10. Categorical Association: Chi-Square Test for Independence
Chi-Square Test for Independence: Test Statistic and p-value
Data for the Chi-Square Test for Independence
The observed frequency is the number of individuals in the sample that are classified as a particular category and is denoted by #f_o#.
The expected frequency is the number of individuals that one would expect to be classified as a particular category based on the predictions made by the null hypothesis and is denoted by #f_e#.
The expected frequency of a cell is calculated with the following formula:
\[f_e = \cfrac{f_r \cdot f_c}{n}\]
where #f_r# is frequency total for the row and #f_c# is the frequency total for the column.
Calculating Expected Frequencies
Consider the following frequency distribution table:
Observed Frequencies | |||
Apple | Banana | #\blue{\text{Total}}# | |
Extrovert | #\purple{\text{13}}# | #\purple{\text{37}}# | #\blue{\text{50}}# |
Introvert | #\purple{\text{81}}# | #\purple{\text{97}}# | #\blue{\text{178}}# |
#\orange{\text{Total}}# | #\orange{\text{94}}# | #\orange{\text{134}}# | 228 |
To calculate the expected frequencies, apply the following formula to each #\purple{\text{cell}}# in the table:
\[f_e = \cfrac{\blue{f_r} \cdot \orange{f_c}}{n}\]
where #\blue{f_r}# is frequency total for the row and #\orange{f_c}# is the frequency total for the column.
#\begin{array}{llcl}
\,\,\,\,\scriptsize{\bullet}&\,\,\normalsize{\text{Extrovert - Apple}}&:&\cfrac{\blue{50}\cdot \orange{94}}{228}=20.61\\
\,\,\,\,\scriptsize{\bullet}&\,\,\normalsize{\text{Extrovert - Banana}}&:&\cfrac{\blue{50}\cdot \orange{134}}{228}=29.39\\
\,\,\,\,\scriptsize{\bullet}&\,\,\normalsize{\text{Introvert - Apple}}&:&\cfrac{\blue{178}\cdot \orange{94}}{228}=73.39\\
\,\,\,\,\scriptsize{\bullet}&\,\,\normalsize{\text{Introvert - Banana}}&:&\cfrac{\blue{178}\cdot \orange{134}}{228}=104.61\\
\end{array}#
Expected Frequencies | |||
Apple | Banana | Total | |
Extrovert | 20.61 | 29.39 | 50 |
Introvert | 73.39 | 104.61 | 178 |
Total | 94 | 134 | 228 |
#\phantom{0}#
After the expected frequencies have been calculated, the next step is to calculate the Chi-Square Test for Independence test statistic in order to determine how much the observed frequencies differ from the frequencies expected under the null hypothesis.
#\phantom{0}#
Chi-Square Test Statistic and Distribution
The test statistic for the Chi-Square Test for Independence is denoted by #\chi^2# and is calculated with the following formula:
\[\chi^2=\sum_{\text{all cells}}{\dfrac{(\text{Observed}-\text{Expected})^2}{\text{Expected}}}=\sum_{\text{all cells}}{\dfrac{(f_o-f_e)^2}{f_e}}\]
Since the calculation of the test statistic involves adding squared values, a #\chi^2#-statistic will always have a value of zero or larger.
Assuming the null hypothesis of the Chi-Square Test for Independence is true, the #\chi^2#-statistic will (approximately) follow a #\chi^2#-distribution with #df = (r -1)(c-1)# degrees of freedom, where #r# is the number of rows and #c# the number of columns.
Chi-square distributions are positively skewed and the critical region will always entirely be located in the right tail of the distribution.
Calculating the p-value of a Chi-Square Test for Independence
A Chi-Square test is by definition a right-tailed test.
To calculate the #p#-value of a Chi-Square Test for Independence in Excel, use the following command:
\[=1\text{ - }\text{CHISQ.DIST}(\chi^2, df, 1)\]
To calculate the #p#-value of a Chi-Square Test for Independence in R, use the following command:
\[\text{pchisq}(\chi^2, df, lower.tail=\text{FALSE})\]
Where #df = (r \text{ - }1)(c\text{ - }1)#.
If #p \lt \alpha#, reject #H_0# and conclude #H_a#. Otherwise, do not reject #H_0#.
In an effort to assess the impact of funding cuts on pre-school programs, school administrators in a US school district selected a simple random sample of #164# students in the seventh grade and determined whether or not each student had attended pre-school and whether each student was performing below, at, or above grade level in mathematics.
The distribution was organized in the following two-way frequency table:
Below grade level | At grade level | Above grade level | Total | |
Attended pre-school | 11 | 36 | 20 | 67 |
No pre-school | 38 | 39 | 20 | 97 |
Total | 49 | 75 | 40 | 164 |
The researcher plans on using a Chi-Square Test for Independence to determine whether pre-school attendance and mathematical ability are related to one another.
Calculate the #p#-value of the test and make a decision regarding #H_0#. Round your answer to #3# decimal places. Use the #\alpha = 0.03# significance level.
#p=0.007#
On the basis of this #p#-value, #H_0# should be rejected, because #\,p# #\lt# #\alpha#.
There are a number of different ways we can calculate the #p#-value of the test. Click on one of the panels to toggle a specific solution.
Calculate the expected frequency of all cells in the table with the following formula:
\[f_e = \cfrac{f_r \cdot f_c}{n}\]
where #f_r# is frequency total for the row, #f_c# is the frequency total for the column, and #n# is the total sample size.
Below grade level | At grade level | Above grade level | Total | |
Attended pre-school | 20.018 | 30.64 | 16.341 | 67 |
No pre-school | 28.982 | 44.36 | 23.659 | 97 |
Total | 49 | 75 | 40 | 164 |
Calculate the #\chi^2#-statistic:
\[\begin{array}{rcl}
\chi^2&=&\sum\limits_{\text{all cells}}{\dfrac{(f_o-f_e)^2}{f_e}}\\
&=& \cfrac{(11-20.018)^2}{20.018} +\cfrac{(36-30.64)^2}{30.64} +\cfrac{(20-16.341)^2}{16.341} +\cfrac{(38-28.982)^2}{28.982} +\\&&\cfrac{(39-44.36)^2}{44.36} +\cfrac{(20-23.659)^2}{23.659}\\
&=& 9.839
\end{array}\]
Determine the degrees of freedom:
\[df = (r -1)(c-1) = (2 -1 )(3 - 1)=2\]
To calculate the #p#-value of a #\chi^2#-test, make use of the following Excel function:
CHISQ.DIST(x, deg_freedom, cumulative)
- x: The value at which you wish to evaluate the distribution function.
- deg_freedom: An integer indicating the number of degrees of freedom.
- cumulative: A logical value that determines the form of the function.
- TRUE - uses the cumulative distribution function, #\mathbb{P}(X \leq x)#
- FALSE - uses the probability density function
A Chi-Square test is by definition a right-tailed test. Thus, to calculate the #p#-value of the test, run the following command:
\[=1\text{ - }\text{CHISQ.DIST}(\chi^2,(r \text{ - }1)(c\text{ - }1), 1)\\
\downarrow\\
=1\text{ - }\text{CHISQ.DIST}(9.839, 2, 1)\]
This gives:
\[p = 0.007\]
Since #\,p# #\lt# #\alpha#, the null hypothesis of independence should be rejected.
Calculate the expected frequency of all cells in the table with the following formula:
\[f_e = \cfrac{f_r \cdot f_c}{n}\]
where #f_r# is frequency total for the row, #f_c# is the frequency total for the column, and #n# is the total sample size.
Below grade level | At grade level | Above grade level | Total | |
Attended pre-school | 20.018 | 30.64 | 16.341 | 67 |
No pre-school | 28.982 | 44.36 | 23.659 | 97 |
Total | 49 | 75 | 40 | 164 |
Calculate the #\chi^2#-statistic:
\[\begin{array}{rcl}
\chi^2&=&\sum\limits_{\text{all cells}}{\dfrac{(f_o-f_e)^2}{f_e}}\\
&=& \cfrac{(11-20.018)^2}{20.018} +\cfrac{(36-30.64)^2}{30.64} +\cfrac{(20-16.341)^2}{16.341} +\cfrac{(38-28.982)^2}{28.982} +\\&&\cfrac{(39-44.36)^2}{44.36} +\cfrac{(20-23.659)^2}{23.659}\\
&=& 9.839
\end{array}\]
Determine the degrees of freedom:
\[df = (r -1)(c-1) = (2 -1 )(3 - 1)=2\]
To calculate the #p#-value of a #\chi^2#-test, make use of the following R function:
pchisq(q, df, lower.tail)
- q: The value at which you wish to evaluate the distribution function.
- df: An integer indicating the number of degrees of freedom.
- lower.tail: If TRUE (default), probabilities are #\mathbb{P}(X \leq x)#, otherwise, #\mathbb{P}(X \gt x)#.
A Chi-Square test is by definition a right-tailed test. Thus, to calculate the #p#-value of the test, run the following command:
\[\text{pchisq}(q = \chi^2, df = (r \text{ - }1)(c\text{ - }1), lower.tail=\text{FALSE})\\
\downarrow\\
\text{pchisq}(q = 9.839, df = 2, lower.tail=\text{FALSE})\]
This gives:
\[p = 0.007\]
Since #\,p# #\lt# #\alpha#, the null hypothesis of independence should be rejected.