Stratified Sampling

5. Sampling: Practical 5a

Stratified Sampling

In the previous section, you took simple random samples of trees in the BCI dataset. However, you did not have any control over how many trees from each species or each quadrat got sampled. The goal of stratified sampling is to have control over the number of trees sampled from a pre-specified group within your data, a so-called stratum. The individuals within a stratum have at least one characteristic in common.

Let's assume that here we would like to stratify the samples with respect to the size of trees, based on dbh. The following code achieves this. It creates a new column (sizecl) in the BCI data frame, fills it with NA's and subsequently fills it with 4 codes to indicate small trees ( log10(BCI$dbh)< 1.25 ) up to large trees (log10(BCI$dbh)>= 2)

BCI$sizecl <- NA
BCI$sizecl[ log10(BCI$dbh)< 1.25 ] <- 1 
BCI$sizecl[ log10(BCI$dbh)>= 1.25 & log10(BCI$dbh)< 1.5 ] <- 2 
BCI$sizecl[ log10(BCI$dbh)>= 1.5 & log10(BCI$dbh)< 2 ] <- 3
BCI$sizecl[ log10(BCI$dbh)>= 2 ] <- 4

A frequency table shows that the distribution over the four strata within the population is approximately $3:3:3:1$ .

table(BCI$sizecl)

In stratified sampling, we should take samples from each stratum. Within each stratum the sampling should be random, but the size of each sample should be proportional to the stratum size. So for each total sample size of $10$ we should take $3$ individuals from stratum $1$ , $3$ for stratum $2$ , $3$ for stratum $3$ and $1$ for stratum $4$ .

Apply stratified sampling to the size of the trees (use the variable sizecl you just created) to take an overall sample of $100$ individuals. Calculate the sample mean of the above-ground-biomass (BCI$ agb) and round your answer to $3$ decimal points.

You can approach this problem in a few steps.

Step 1: divide your population into four strata based on the new variable sizecl

class1_plants <- BCI[BCI$sizecl == 1,] 
class2_plants <- BCI[BCI$sizecl == 2,] 
class3_plants <- BCI[BCI$sizecl == 3,] 
class4_plants <- BCI[BCI$sizecl == 4,]

Step 2: decide the sample size per stratum

The distribution over the four strata in the population is approximately $3:3:3:1$ and you want your final sample of $100$ plants to have this same distribution. This means you have to sample $30$ plants from sizecl = 1, $30$ from sizecl = 2, $30$ from sizecl = 3, and $10$ from sizecl = 4.

Step 3: take a simple random sample from each stratum

Now that you have your four strata, the next step is to take a simple random sample from each of them. We will use a seed of $46$ so that everybody gets the same results.

set.seed(46)
class1_sample_row_numbers <- sample(nrow(class1_plants), 30) 
class1_sample <- class1_plants[class1_sample_row_numbers,]

set.seed(46)
class2_sample_row_numbers <- sample(nrow(class2_plants), 30) 
class2_sample <- class2_plants[class2_sample_row_numbers,]

set.seed(46)
class3_sample_row_numbers <- sample(nrow(class3_plants), 30) 
class3_sample <- class3_plants[class3_sample_row_numbers,]

set.seed(46)
class4_sample_row_numbers <- sample(nrow(class4_plants), 10) 
class4_sample <- class4_plants[class4_sample_row_numbers,]

Step 4: Combine the 4 samples into one

Now that you have your $4$ samples, you have to combine them into one. You can use the function rbind() for this. rbind() combines a dataframe by rows so that you get 1 dataframe with 100 rows.

final_sample <- rbind(class1_sample, class2_sample, class3_sample, class4_sample)

Step 5: Calculate the mean of agb

mean(final_sample$ agb)

The mean agb of this stratified sample is thus $0.032$ .

New example