5.4 Tests of independence
The next chi-square test we will cover is probably the most common - the chi-square test of independence. Here, we move from one categorical variable to two.
Chi square tests of independence are used when we want to test whether two categorical variables are associated with each other (i.e. show a relationship). Some examples of this question might take on the following:
- Is smoking history (yes/no) associated with lung cancer diagnosis? (yes/no)
- Is there an association between gender and employment status?
5.4.1 Example scenario
We’ll start off with a very basic example. In the below dataset, children from several schools were surveyed regarding what instrument they played. This dataset focuses on two instruments that have historically been seen as gendered (e.g. see Abeles 2009) - clarinet and drums. The sex of the child playing the instrument was also recorded.
Our research question is: is there an association between sex and instrument choice?
Dataset:
## Rows: 122 Columns: 3
## ── Column specification ────────────────────────
## Delimiter: ","
## chr (2): instrument, sex
## dbl (1): id
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
5.4.2 Contingency tables
The primary way of ‘drawing up’ categorical data, particularly when two variables are involved, is to draw a contingency table. A contingency table is a two-way table that shows how many participants/items/objects fall under each combination of our two variables. Here is a contingency table of our data below:
w7_instrument_table <- table(instrument_data$instrument,
instrument_data$sex)
w7_instrument_table %>%
addmargins()##
## F M Sum
## clarinet 51 48 99
## drums 6 17 23
## Sum 57 65 122
5.4.3 Expected frequencies
To calculate expected frequencies in a two-way contingency table (i.e. a test of independence), we use the following formula:
\[ E = \frac{R \times C}{N} \]
Where R = row total and column = column total.
Let’s put this into practice with girls who play the clarinet (highlighted above). The row total for this cell is 99 (i.e. total number of clarinet players). The column total is 57 (total number of girls). To calculate an expected value for this cell, we would therefore calculate the following:
\(E = \frac{99 \times 57}{122}\)
This works out to be roughly 46.25 - which means that we would expect roughly 46 female clarinet players. We then go through and calculate this for each cell, so that we have all of our expected values.
Once we’ve done that, we can then calculate our chi-square test statistic using the same formula as always:
\[ \chi^2 = \Sigma \frac{(O-E)^2}{E} \]
5.4.4 Output
Here’s our output! Firstly, our contingency table:
##
## F M
## clarinet 51 48
## drums 6 17
##
## F M
## clarinet 46.2541 52.7459
## drums 10.7459 12.2541
Next is our chi-square test output. As you can see, our test of independence suggests a significant result (p = .03). In other words, we reject the null hypothesis that there is no association between instrument and sex.
##
## Pearson's Chi-squared test
##
## data: w7_instrument_table
## X-squared = 4.848, df = 1, p-value = 0.02768
Like the previous page, here’s an example write-up of our results:
A chi-square test of association was conducted to examine whether there was an association between sex and instrument choice. A significant association was observed (\(\chi^2\)(1, N = 122) = 4.85, p = .028; Cramer’s V = .199). There were more male drummers and less female drummers than expected.
Note that as part of the write-up above, you would also include a brief interpretation of the effect size - but we will discuss this on the next page.