Recap: Sample Versus Population

6. Parameter Estimation and Confidence Intervals: Practical 6

Recap: Sample Versus Population

If you have access to (data about) an entire population, say all the green roofs in Amsterdam, it’s straight forward to answer questions like, “What is the typical surface area of a green roof in Amsterdam?” and “How much variation is there in the surfaces of different roofs?” accurately. However, if you have access to only a sample of the population, as is often the case, the task becomes more complicated. What is your best guess for the typical surface area if you only know the sizes of f.i. $20$ roofs? This sort of situation requires that you use your sample to estimate ('to infer') on what your population looks like. We have explored the properties of samples and sampling statistics in the previous practical.

Let's rehearse some of the sampling steps at the start of this lab. We will consider the variable V$capaciteit. It measures the amount of water that can be retained on the roof in mm.

Before we start to take samples from the variable capacity, let's first have a look at the data.

summary(V$capaciteit)
hist(V$capaciteit)

As you see, the values range from $20$ to $85$ mm, and the distribution of the capacity is (strongly) right-skewed.

Now let's assume that this data represents the total number of roofs of this type in Amsterdam, hence we have the data for the entire population at our availability.

The municipality of Amsterdam provides subsidies to people who want to install a green roof and maintains this administration a.o. to evaluate the net effect of this subsidy on rainfall retention and reduction of peak discharges. However, visiting individual roofs to measure values like water storage capacity is costly. So, rather than measuring it for the entire population, it is more cost-effective to take a random sample and only measure the capacity for the roofs in that sample.

Let's assume the municipality would take a simple random sample of size $30$ from the population. We can simulate what happens in this case with the sample() function. Recall that the command set.seed() ensures reproducibility of your result (i.e. the sample will be identical if you run the sample() command again with the same seed).

population <- V$capaciteit
set.seed(1) 
samp <- sample(population,30) 
hist(samp)

What is the average of your sample? Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?

The average of the sample is $36.87$ mm.

You can calculate the average with the following command:

sample_mean <- mean(samp)

Another student's distribution will not be identical because the random sampling process leads to a different sample each time. (The only exception is when the other student uses the same seed.)

New example