Testing for Differences Between Proportions

7. Hypothesis Testing: Practical 7

Testing for Differences Between Proportions

Hypothesis Testing on proportions in R

Hypothesis tests on proportions can be used to test whether the proportion (probability on success) is equal to (or higher/lower than) a critical value. The R command to conduct this test is prop.test().

Let's take a look at the documentation for prop.test()

?prop.test

The arguments are similar to those of the t.test(), but now you have to specify the two parameters for your observations: number of successes (x) and the total number of observations (n). The most important arguments are:

x: number of successes (total number of TRUEs). This can also be a table with TRUE and FALSE counts.
n: number of trials (total number of observations).
p: critical value you are testing against, theoretical probability of success.
alternative: specify if you want a two-sided, left-tailed or right-tailed proportion test with the arguments "two.sided", "less" and "greater".
conf.level: confidence level of the interval, the default value is $0.95$ .

One sample proportion test

Let's now apply the prop.test() command to the air quality data of Amsterdam.

Recall that according to the EU regulations, the daily average PM10 concentration is allowed to exceed a threshold value of $50$ (μg/m3) up to a maximum of $35$ times a year. We can use prop.test() to make this evaluation, based on a sample (e.g. if the PM10 concentration is not measured every day or not in every year at every location).

Use the data from 2018 as a sample to test whether the proportion of peak-days at the Amsterdam-Stadhouderskade exceeds the allowed proportion in general. The critical value to test against is then the proportion $35/365=0.0959$ . The hypotheses are:

$H_0$ : $\pi \leq 0.0959$
$H_a$ : $\pi > 0.0959$ .

We will go through the process step-by-step.

1) Create a dataframe with only the measurements for the PM10 concentration at the Amsterdam-Stadhouderskade in 2018.

The syntax is a little bit different than you saw so far, because the data is stored in a special class-type: POSIXct. This class is very handy, because it allows you to select dates based on only the year, even though you did not store the year separately. This means you don't need to create a vector of all the dates in 2018! To use the POSIXct-class, you need to specify which part of the date you want with format(PM10_shk$date, "%Y"). "%Y" indicates that you want to select the year, "%m" indicates a month and "%d" indicates a day. Try this:

format(PM10_shk$date, "%Y")

Now you can use this syntax to select all rows with the year "2018".

PM10_shk_2018 <- PM10_shk[format(PM10_shk$date, "%Y")=="2018",]

2) Count the number of successes (x) and the total number of observations (n).

We start by counting the total number of observations. This can be done with the command nrow(), as each row in our dataframe is one observation.

n_PM10_shk_2018 <- nrow(PM10_shk_2018)

Now we count the number of successes. In this case we regard the days that the concentration exceeds $50$ as the successes. You thus have to select all rows with a 'value' larger than $50$ and subsequently count the number of 'successes'.

PM10_shk_2018_success <- PM10_shk_2018[PM10_shk_2018$value > 50,]
x_PM10_shk_2018 <- nrow(PM10_shk_2018_success)

3) Perform the proportions test using the prop.test() function.

You should fill in the following arguments:

x: number of successes: x = x_PM10_shk_2018
n: number of observations: n = n_PM10_shk_2018
p: theoretical proportion: p = 35/365
alternative: for right-sided test: alternative = "greater".
conf.level: significance level of 0.05: use default value $0.95$ .

prop.test(x= x_PM10_shk_2018, n= n_PM10_shk_2018, p = 35/365, alternative = "greater")

This gives the following result:

	1-sample proportions test with continuity correction

data:  x_PM10_shk_2018 out of n_PM10_shk_2018, null probability 35/365
X-squared = 22.082, df = 1, p-value = 1
alternative hypothesis: true p is greater than 0.09589041
95 percent confidence interval:
 0.009962525 1.000000000
sample estimates:
         p 
0.02017291

The resulting p-value from this test is $1$ .
The output is similar to the output of the t.test(). The proportion of success in the sample is lower than the critical value, and the p-value is large ( $p > \alpha$ ). Therefore, we cannot reject the null-hypothesis that the proportion is smaller than or equal to the critical value. Good news for Amsterdam!

New example