Chi-square Test for Association

10. Categorical Association: Practical 10

Chi-square Test for Association

Objectives

Learn how to do the following in R

conduct and interpret a chi-square test for association among two categorical variables by using the basic formula's
conduct and interpret a chi-square test for association among two categorical variables by using chisq.test()

Instruction

Read through the text below
Execute code-examples and compare your results with what is explained in the text
Make the exercises and compare your answers with those in the examples
Time: 60 minutes

The Chi-square Test for Association

The Chi-Square Test for Association or Independence investigates whether two categorical variables are associated or dependent on each other. As a starting point for this test a cross table with frequencies for the two variables is used.

The hypotheses for this test are:

H0: The variables are independent
HA: The variables are not independent

For this hypothesis test, a $X^2$ -statistic is again calculated by the familiar formula:

$X^2=\sum_{\text{all categories}}{\dfrac{(\text{Observed}-\text{Expected})^2}{\text{Expected}}}$

In this case the observed and expected frequencies are calculated for each cell in a cross table. This $X^2$ -statistic follows a $\chi^2$ distribution with $(r-1)(c-1)$ degrees of freedom, where r and c are the number of rows and columns of the cross table.

Similar to the $\chi^2$ distribution in the goodness of fit test, the $\chi^2$ distribution here is only accurate if the sample size in each cell is large enough: not more than $20\%$ of the cells may have an expected frequency smaller than $5$ .

Similar to what we have seen in the previous section, the $p$ -value for the $\chi^2$ distribution can be calculated using the pchisq() function.

Let's say that you want to test whether there is an association between the variables unhappy and butt in the WD dataframe (i.e. you expect that people who are unhappy when they use less water tend not to use a water saving butt).

The meaning of the codes for unhappy were: 1 =strongly disagree, 2 = disagree, 3 = neutral, 4 = agree, 5 = strongly agree. For butt these were 1 = yes, 2 = no.

Is there an significant association between the variables unhappy and butt at a significance level of $\alpha = 0.1$ ?

You can conduct the $X^2$ -test for association in several steps.

Step 1: Formulate the hypotheses

$H_0$ : There is no association between the variables unhappy and butt.
$H_A$ : There is an association between the variables unhappy and butt.

Step 2: calculate the observed frequencies
You can find the observed frequencies in a cross table. Visualise them with a mosaic plot.

observed <- table(WD$ unhappy, WD$ butt)
mosaicplot(observed,xlab='unhappy', ylab='butt')

There seems to be a relationship between the variables butt and unhappy. The distribution over the 5 categories of unhappy is different for the two categories of butt. Let's test if this difference is significant.

Step 3: calculate the expected frequencies
The expected probabilities are in fact the joint probabilities of the two variables, that are not related (=not correlated). In that situation the joint probabilities can be calculated as the product of the separate probabilites.

Instead of calculating the six joint probabilities in our cross table separately, we will go for an easier solution: a function that is doing this multiplication of two vectors in one step: outer(). This function takes two vectors as input arguments and multiplies each element in the first vector with all elements in the second vector and places the results in a cross table. The following lines of code first create the required vectors (the 'marginal frequencies' for 'bath' and 'butt'), and then apply the outer() function:

unhappy_Fr <- rowSums(observed)
butt_Fr <- colSums(observed)
joint_Fr <- outer(unhappy_Fr, butt_Fr)

A compact description of the functions in the lines above: 1) rowSums is a function which calculates the marginal values (totals) of the rows in a data matrix or table, and colSums does this for the columns. 2) the outer function multiplies the first element m from the first vector with element n from the second vector and places this value at row m and column n in the output-matrix. These functions are listed at the formula-sheet that you can use at the exam.

As a last step, we turn the joint frequencies into joint probabilities (= the expected values):

expected <- joint_Fr/sum(observed)

To make things a little bit more compact, we can also integrate the above steps into a 'one-liner':

expected <- outer(rowSums(observed), colSums(observed)) / sum(observed)

Step 4: Check assumptions
Before the next steps we should check if each category has enough observations (not more than $20\%$ of the cells may have an expected frequency smaller than 5).

expected

$2$ out of $10$ cells have an expected frequency lower than $5$ . That is exactly $20$ %, we can proceed but we should regard the results with some caution.

Step 5: Calculate $X^2$ -statistic
The $X^2$ -statistic is calculated by the following equation:

$X^2=\sum_{\text{all categories}}{\dfrac{(\text{Observed}-\text{Expected})^2}{\text{Expected}}}$

X2 <- sum((observed-expected)^2/expected)

The $X^2$ -statistic $= 6.579$ .

Step 6: Calculate $p$ -value
The p-value for the $\chi^2$ distribution can be calculated using the pchisq() function. The degrees of freedom are in turn calculated with the equation $df = (r-1)(c-1)$ ,

df <- (nrow(observed)-1)*(ncol(observed)-1)
pchisq(X2, df, lower.tail = FALSE)

The $p$ -value $= 0.1599$

Step 7: Conclusion
Now you can compare the $p$ -value with the significance level and draw the conclusion.

The $p$ -value is larger than the significance level of $0.1$ . We can therefore not reject $H_0$ . There is no assocation between the variables unhappy and butt.

New example

Fortunately, also the $X^2$ -test for association can be conducted with the function chisq.test().

Let's apply the chisq.test() to the variables (wash and butt) to test wether people who wash clothes after one wear also tend not to use a water saving butt.

Do you get the same result?

chisq.test(WD$ wash, WD$ butt)

The $X^2$ -statistic $=7.715$ and the $p$ -value $= 0.1026$ . Hence, this is exactly the same result as with the manual calculation.

New example