Cluster Sampling

5. Sampling: Practical 5a

Cluster Sampling

One-stage cluster sampling is the last sampling method we will try on this dataset.

Cluster sampling divides the population into a number of subgroups, called clusters. Next, simple random sampling is used to select one or more of these clusters. At the last step, all observations in these clusters are sampled. It is possible to add another step to this one-stage cluster sampling, by simple random sampling from the selected clusters instead of sampling the whole cluster. This is called multistage cluster sampling.

Unlike stratified sampling, the clusters are internally heterogeneous, which is why you can get away with sampling just a few clusters. Often spatial units are used for cluster sampling, think for example of neighborhoods in a city. For the BCI data we use the quadrats.

You can find an example of the code to perform cluster sampling in R below.
First you need to get unique values for the quadrats. Here the names of the quadrats are sampled, because they correspond to the identification of the quadrats in the BCI-dataframe. We can use the samples to perform a logical operator on BCI (e.g. BCI[BCI$quadrat == "2219"). However, since we sample multiple clusters a faster solution is the %in% command. %in% checks for every row in the dataframe if it contains the name of one of the clustersamples and returns TRUE or FALSE. When using that as row index in BCI[row, column], only the rows belonging to the quadrats will be returned and your sample is complete!

Draw a one-stage cluster sample of $10$ quadrats. Use the seed $49$ to get reproducible results.

What are the mean and standard deviation of the variable dbh? Round your answers on $3$ decimal points.

mean = $48.301$
sd = $80.505$

Cluster sampling is done in a few steps:

Step 1: get the unique identifiers for the quadrats.
You can use the unique() function to get all unique quadrat 'names'.

quadrats <- unique(BCI$quadrat)

Step 2: sample $10$ quadrats
Use simple random sampling to get $10$ random clusters. Use the seed $49$ to get reproducible results.

set.seed(49)
cluster_sample <- sample(quadrats, 10)

Step 3: get the observations from these $10$ quadrat sample

As stated before you can use the %in% command to index the rows of the $10$ clusters.

samp_10q <- BCI[BCI$quadrat %in% cluster_sample,]

Step 4: calculate the mean and standard deviation.
Finally, you can calculate the mean and standard deviation of dbh.

mean(samp_10q$ dbh)
sd(samp_10q$ dbh)

New example