Basic skills in R: Working with functions
Commonly used functions for numeric vectors
In this theory page we look at commonly used functions for a numeric vector, for example coming from measurements of some quantity. The function mostly involve descriptive statistics.
Sample data vector In almost all examples below we will work with data vectors v
and cleaned_v
that have been created in the following way:
> set.seed(123) > v <- rnorm(10, mean = 5, sd = 2) > v[sample(1:10, 3)] <- NA # randomly 3 replacements with NA > v [1] NA 4.539645 8.117417 5.141017 5.258575 8.430130 5.921832 2.469878 NA NA > cleaned_v <- as.numeric(na.omit(v)) > cleaned_v [1] 4.539645 8.117417 5.141017 5.258575 8.430130 5.921832 2.469878
What we did in the above session is create a random vector of 10 values for a normal distribution with mean value 5 and standard deviation 2. In this vector we randomly replace three values by NA
, that turn them into 'Not Avaliable'. The resulting vector v
has seven numeric values and NA
at three positions as shown in the session. A cleaned version of this vector, called cleaned_v
, is made by removing all not available components of v
.
Finding basic properties of a data vector The function length()
is used to find the number of elements in a vector.
> length(v) [1] 10 > length(cleaned_v) [1] 7
When there are no missing data in a data vector the functions min()
and max()
to find the smallest and largest value in the data vector, respectively. But when data are missing this will not work unless you add the argument na.rm = TRUE
.
> min(cleaned_v) [1] 2.469878 > max(cleaned_v) [1] 8.43013 > min(v) [1] NA
> max(v)
[1] NA > min(v, na.rm = TRUE) [1] 2.469878 > max(v, na.rm = TRUE) [1] 8.43013
The function range()
returns a vector with the smallest and greatest value:
> range(v, na.rm = TRUE) [1] 2.469878 8.430130 > range(cleaned_v) [1] 2.469878 8.430130
Important note For all functions discussed henceforth it is actually true that they only work well when you either apply them with the added argument na.rm = TRUE
or apply them to a cleaned version of the data vector.
Determining measures of central tendency of a data vector We discuss how you compute measures of central tendency like mean, median, and mode of a data vector.
Mean The (sample) mean or average of a data vector equals the sum of all values divided by the total number of values. So it is in fact the arithmetic mean or sample mean of the data vector. A more formal definition is that the mean \(\bar{x}\) of a numeric data vector \(x=(x_1, x_2, \ldots, x_n)\) is given by \[\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_i\] The sum of all values in a vector can be computed via the function sum()
.
> sum(cleaned_v) [1] 39.87849 > sum(v, na.rm = TRUE) [1] 39.87849
Therefore the arithmetic mean can be computed as follows:
> sum(cleaned_v) / length(cleaned_v)
[1] 5.696928
Of course, R has a built-in function for this purpose: it is simply called mean
> mean(cleaned_v) [1] 5.696928 > mean(v, na.rm = TRUE) [1] 5.696928
Median The median of a data vector equals the 'middle value' of the values in the data vector, i.e. the value at which half the values in the data vector are smaller and half are larger. In other words, it is the 50th percentile and the 0.5 quantile of the datavector. It can be computed as follows:
- Sort the value from smallest to largest (or vice versa).
- Count in to find the middle value. The median is the middle value for an odd number of
elements in the data vector; it is defined as the mean of the two middle
values for an even number of elements.
In R, the above procedure can be performed in the background via the dedicated function median()
:
> sort(cleaned_v) [1] 2.469878 4.539645 5.141017 5.258575 5.921832 8.117417 8.430130 > median(cleaned_v) [1] 5.258575 > median(v, na.rm = TRUE) [1] 5.258575
Mode The mode of a distribution is the most frequently occurring score or category. In a data vector, the mode is the value/values that occurs/occur most frequently. A way to remember that is: it's the number (or numbers) that is (are) popular (in Dutch: "in de mode") in your data vector.
You can use the function mlv()
from the R package modeest
to compute the mode. A simple example shows you how:
> install.packages("modeest") > sample1 <- c(1,2,3,4,3) > modeest::mlv(sample1, method = "mfv") [1] 3 > sample2 <- c(1,2,3,4,3,2) # a bimodal sample > modeest::mlv(sample2, method = "mfv") [1] 2 3
Computing sample variance and sample standard deviation The (sample) variance is a measure of spread of values in a data vector and is defined as follows: the variance \(s^2\) (or \(\mathrm{var}(x)\) of a numeric data vector \(x=(x_1, x_2, \ldots, x_n)\) is given by \[s^2=\frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})^2\] The (sample) standard deviation \(s\) is the square root of the variance, i.e. \[s=\sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})^2}\] The two R functions that compute these measures of spread are var()
and sd()
, respectively:
> v [1] NA 4.539645 8.117417 5.141017 5.258575 8.430130 5.921832 2.469878 NA NA > cleaned_v [1] 4.539645 8.117417 5.141017 5.258575 8.430130 5.921832 2.469878
> var(v)
[1] NA > var(v, na.rm = TRUE) # variance with omission of not available values [1] 4.272348 > var(cleaned_v) [1] 4.272348
> sd(v)
[1] NA > sd(v, na.rm = TRUE) # standard deviation with omission of not available values [1] 2.066966 > sd(cleaned_v) [1] 2.066966