10. Categorical Association: Practical 10
Chi-square Test for Association
Objectives
Learn how to do the following in R
- conduct and interpret a chi-square test for association among two categorical variables by using the basic formula's
- conduct and interpret a chi-square test for association among two categorical variables by using
chisq.test()
Instruction
- Read through the text below
- Execute code-examples and compare your results with what is explained in the text
- Make the exercises and compare your answers with those in the examples
- Time: 60 minutes
The Chi-square Test for Association
The Chi-Square Test for Association or Independence investigates whether two categorical variables are associated or dependent on each other. As a starting point for this test a cross table with frequencies for the two variables is used.
The hypotheses for this test are:
H0: The variables are independent
HA: The variables are not independent
For this hypothesis test, a #X^2#-statistic is again calculated by the familiar formula:
\[X^2=\sum_{\text{all categories}}{\dfrac{(\text{Observed}-\text{Expected})^2}{\text{Expected}}}\]
In this case the observed and expected frequencies are calculated for each cell in a cross table. This #X^2#-statistic follows a #\chi^2# distribution with #(r-1)(c-1)# degrees of freedom, where r and c are the number of rows and columns of the cross table.
Similar to the #\chi^2# distribution in the goodness of fit test, the #\chi^2# distribution here is only accurate if the sample size in each cell is large enough: not more than #20\%# of the cells may have an expected frequency smaller than #5#.
Similar to what we have seen in the previous section, the #p#-value for the #\chi^2# distribution can be calculated using the pchisq()
function.
The meaning of the codes for wash were: 1 =strongly disagree, 2 = disagree, 3 = neutral, 4 = agree, 5 = strongly agree. For butt these were 1 = yes, 2 = no.
Is there an significant association between the variables wash and butt at a significance level of #\alpha = 0.05#?
You can conduct the #X^2#-test for association in several steps.
Step 1: Formulate the hypotheses
- #H_0#: There is no association between the variables wash and butt.
- #H_A#: There is an association between the variables wash and butt.
Step 2: calculate the observed frequencies
You can find the observed frequencies in a cross table. Visualise them with a mosaic plot.
observed <- table(WD$ wash, WD$ butt)
mosaicplot(observed,xlab='wash', ylab='butt')
There seems to be a relationship between the variables butt and wash. People who do not use a water-saving butt, agree more often with washing their clothes after one wear. Let's test if this difference is significant.
Step 3: calculate the expected frequencies
The expected probabilities are in fact the joint probabilities of the two variables, that are not related (=not correlated). In that situation the joint probabilities can be calculated as the product of the separate probabilites.
Instead of calculating the six joint probabilities in our cross table separately, we will go for an easier solution: a function that is doing this multiplication of two vectors in one step: outer()
. This function takes two vectors as input arguments and multiplies each element in the first vector with all elements in the second vector and places the results in a cross table. The following lines of code first create the required vectors (the 'marginal frequencies' for 'bath' and 'butt'), and then apply the outer()
function:
wash_Fr <- rowSums(observed) butt_Fr <- colSums(observed) joint_Fr <- outer(wash_Fr, butt_Fr)
A compact description of the functions in the lines above: 1) rowSums is a function which calculates the marginal values (totals) of the rows in a data matrix or table, and colSums does this for the columns. 2) the outer function multiplies the first element m from the first vector with element n from the second vector and places this value at row m and column n in the output-matrix. These functions are listed at the formula-sheet that you can use at the exam.
As a last step, we turn the joint frequencies into joint probabilities (= the expected values):
expected <- joint_Fr/sum(observed)
To make things a little bit more compact, we can also integrate the above steps into a 'one-liner':
expected <- outer(rowSums(observed), colSums(observed)) / sum(observed)
Step 4: Check assumptions
Before the next steps we should check if each category has enough observations (not more than #20\%# of the cells may have an expected frequency smaller than 5).
expected#2# out of #10# cells have an expected frequency lower than #5#. That is exactly #20#%, we can proceed but we should regard the results with some caution.
Step 5: Calculate #X^2#-statistic
The #X^2#-statistic is calculated by the following equation:
\[X^2=\sum_{\text{all categories}}{\dfrac{(\text{Observed}-\text{Expected})^2}{\text{Expected}}}\]
X2 <- sum((observed-expected)^2/expected)The #X^2#-statistic #= 7.715#.
Step 6: Calculate #p#-value
The p-value for the #\chi^2# distribution can be calculated using the
pchisq()
function. The degrees of freedom are in turn calculated with the equation #df = (r-1)(c-1)#,df <- (nrow(observed)-1)*(ncol(observed)-1)The #p#-value #= 0.1026#
pchisq(X2, df, lower.tail = FALSE)
Step 7: Conclusion
Now you can compare the #p#-value with the significance level and draw the conclusion.
The #p#-value is larger than the significance level of #0.05#. We can therefore not reject #H_0#. There is no assocation between the variables wash and butt.
Fortunately, also the #X^2#-test for association can be conducted with the function chisq.test()
.
chisq.test()
to the variables (wash and butt) to test wether people who wash clothes after one wear also tend not to use a water saving butt. Do you get the same result?
chisq.test(WD$ wash, WD$ butt)The #X^2#-statistic #=7.715# and the #p#-value #= 0.1026#. Hence, this is exactly the same result as with the manual calculation.