Measures of Central Tendency

1. Descriptive Statistics: Practical 1

Measures of Central Tendency

R contains the commands mean()and median() to calculate exactly these statistics based on data. If you want to visualise the complete distribution, you can check the frequency distribution with the command hist(). This will create a histogram in the lower-right pane (the Plots Tab).

Let's have a look at these functions.

Calculate the median of gdpPercap using the gapminder data (dataframe G) and visualise the frequency distribution. Subsequently calculate the median of gdpPercap only for 1972.

The median of gdpPercap for the whole dataframe $=$ $3531.85$
The median of gdpPercap for 1972 $=$ $3339.13$

Calculating the median of gdpPercap for the whole dataframe is simply done with:

median(G$gdpPercap)

The frequency distribution can be visualised with the command:

hist(G$gdpPercap)

If you want to calculate the median for only 1972, you can first make a subset of the data with:

G1972 <- G[G$year == 1972,]

after which you apply the function for median on the gdpPercap.

median(G1972$gdpPercap)

Alternatively, you can do this also in one line by selecting all the rows in which the year is 1972 and selecting the column 'gdpPercap'. Apply the median to this selection.

median(G[G$year == 1972, 'gdpPercap'])

Or by selecting the column 'gdpPercap' and subsequently all the rows in which the year is 1972:

median(G$gdpPercap[G$year == 1972])

New example

The mode is the value that has highest number of occurrences in a set of data. Unike mean and median, mode can have both numeric and character data, however it is usually not giving insightfull results for continuous numeric data (sometimes it is relevant for discrete data).

R does not contain a separate command to calculate the mode. But there is a more general command that does provide the mode as output as well, table(). The table() command counts the number of occurrences to each category in a variable or, in other words, it makes a frequency table. It should only be applied to a vector with categorical data (which could be stored as character strings, but also as integers).

Use table() to calculate the frequency distribution for the variable continent. Store the results in an object CF.

Think about the values that are shown, what do they mean? Based on this data, what would be the mode for the variable continent?

CF <- table(G$continent)

The table shows the number of observations in the dataframe per continent. As the dataframe contains multiple years, this does not correspond to the number of countries per continent! The mode for the variable continent is Africa. You can also visualise this in a barchart with the following comment:

barplot(CF)

(In case you are wondering how to find the number of countries per continent, remember the functions length() and unique()! e.g. length(unique(G[G$continent == 'Europe', 'country'])) gives the number of unique countries in Europe).

New example

Quantiles specify where specific fractions of the data are located. The 50th quantile is in fact the median. The function quantile() gives quantiles in R.

Calculate the 90th quantile of pop for the entire dataframe and subsequently separately for Africa. Describe in your own words what these data mean.

The 90th quantile of pop for the entire dataframe $=$ $54801370$
The 90th quantile of pop for Africa $=$ $26426121$

Use the help function: ?quantile if you need help to specify the right input arguments to this function. Here you can read that you can specify the quantile with a probability vector in the range 0 to 1. That means that we need to specify 0.9 for the 90th quantile.

quantile(G$pop, 0.9)

We can calculate the same quantile for only Africa by first making a subset with only data for this content.

G_Africa <- G[G$continent == 'Africa', ]

Then we can calculate the 90th quantile:

quantile(G_Africa$pop, 0.9)

Alternatively, you could do this in one line by specifying the rows with only the continent selected and the column pop. Then apply the quantile() function.

quantile(G[G$continent == 'Africa', 'pop'], 0.9)

New example