5. Sampling: Practical 5b
Construct the Sampling Distribution of any Statistic
In the previous practical, you learned how to draw a simple random sample from the population and how you could calculate statistics based on the sample. You saw how each time you did draw a sample, the statistic would change because your sample contained different individuals. The larger the sample, the closer the sample statistic was to the population statistic.
Look at the following code; you can run the following line #10# times and see how the sample and sample median are different in every draw.
zs <- sample(z, 50)
hist(zs)
median(zs)
It would be interesting to see what happens if you run the lines #5000# times. This is of course not feasible by repeating lines of code 'manually'. But here a language like R becomes very useful. One way to make such repetitive calculations and store the results every time with a for-loop. The for-loop will be shortly introduced here (feel free to skip this section if you are already familiar with for-loops).
For loops
A for-loop is designed to execute code as many times as you want, without having to type out every iteration. The for-loop is build by the for(){ }
command. In between round brackets you specify the variable that will take on different values and also which values it should take and in between curly brackets you write what has to be done.
for (i in 1:15){
print(i)
}
In the above loop, the variable i is first assigned value 1, then this value is printed. Next, i is assigned value 2 and this is printed again. This repeats until the maximum value is reached and then the for-loop ends.
You can also make a loop where the variable i is not used within the curly brackets.
for (i in 1:5){
zs <- sample(z, 1)
print(zs)
}
However, this code does not save your samples. It just overwrites it in every iteration. To avoid this, you could create an empty container before you start the loop. For example: if you want to save #10# single numbers, you can create a vector of NA values using the rep() command. The rep()
command, simply replicates a value as many times as you need. Check the help for rep() you never used it before. In the for-loop you can index the empty vector, as you would do with a dataframe or matrix.
# create vector with NAs
n_samples <- 1
n_times <- 8
zs <- rep(NA, n_times)
# run for loop
for (i in 1:n_times){
zs[i] <- sample(z, n_samples)
}
# print result
zs
Sampling distribution
Let's now use the for-loop to calculate the median of many samples.
Create a for-loop that calculates and saves #1500# times the median of a simple random sample of the simulated population #z#. Take #n = 100# for the sample size.
Show the result in a histogram, and compare the mean of the sampling medians to the population median.
# create vector with NAs
n_samples <- 100
n_times <- 1500
zs_med <- rep(NA, n_times)
# run for loop
for (i in 1:n_times){
zs_med[i] <- median( sample(z, n_samples) )
}
hist(zs_med)
mean(zs_med) # mean of sampling-medians
median(z) # population median
The constructed distribution is an approximation of the sampling distribution of the median. The sampling distribution of a sample statistic is the probability distribution of that statistic. In other words, it is the distribution of the sample statistic if you were to endlessly draw samples of a particular size from the population. The sampling distribution of the median gives you thus insight into the probability that your sample has a certain median. It is important to keep in mind that every statistic has a sampling distribution.