Commonly used functions for numeric vectors

Basic skills in R: Working with functions

Commonly used functions for numeric vectors

Content

Sample data vector
Finding basic properties of a data vector
Determining measures of central tendency of a data vector
Computing sample variance and sample standard deviation

In this theory page we look at commonly used functions for a numeric vector, for example coming from measurements of some quantity. The function mostly involve descriptive statistics.

Sample data vector In almost all examples below we will work with data vectors v and cleaned_v that have been created in the following way:

> set.seed(123)
> v <- rnorm(10, mean = 5, sd = 2)
> v[sample(1:10, 3)] <- NA # randomly 3 replacements with NA
> v
 [1]       NA 4.539645 8.117417 5.141017 5.258575 8.430130 5.921832 2.469878       NA       NA
> cleaned_v <- as.numeric(na.omit(v))
> cleaned_v
[1] 4.539645 8.117417 5.141017 5.258575 8.430130 5.921832 2.469878

What we did in the above session is create a random vector of 10 values for a normal distribution with mean value 5 and standard deviation 2. In this vector we randomly replace three values by NA, that turn them into 'Not Avaliable'. The resulting vector v has seven numeric values and NA at three positions as shown in the session. A cleaned version of this vector, called cleaned_v, is made by removing all not available components of v.

Finding basic properties of a data vector The function length() is used to find the number of elements in a vector.

> length(v)
[1] 10
> length(cleaned_v)
[1] 7

When there are no missing data in a data vector the functions min() and max() to find the smallest and largest value in the data vector, respectively. But when data are missing this will not work unless you add the argument na.rm = TRUE.

> min(cleaned_v)
[1] 2.469878
> max(cleaned_v)
[1] 8.43013
> min(v)
[1] NA
> max(v) 
[1] NA
> min(v, na.rm = TRUE)
[1] 2.469878
> max(v, na.rm = TRUE)
[1] 8.43013

The function range() returns a vector with the smallest and greatest value:

> range(v, na.rm = TRUE)
[1] 2.469878 8.430130
> range(cleaned_v)
[1] 2.469878 8.430130

Important note For all functions discussed henceforth it is actually true that they only work well when you either apply them with the added argument na.rm = TRUE or apply them to a cleaned version of the data vector.

Determining measures of central tendency of a data vector We discuss how you compute measures of central tendency like mean, median, and mode of a data vector.

Mean The (sample) mean or average of a data vector equals the sum of all values divided by the total number of values. So it is in fact the arithmetic mean or sample mean of the data vector. A more formal definition is that the mean \(\bar{x}\) of a numeric data vector \(x=(x_1, x_2, \ldots, x_n)\) is given by \[\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_i\] The sum of all values in a vector can be computed via the function sum().

> sum(cleaned_v)
[1] 39.87849
> sum(v, na.rm = TRUE)
[1] 39.87849

Therefore the arithmetic mean can be computed as follows:

> sum(cleaned_v) / length(cleaned_v)
[1] 5.696928

Of course, R has a built-in function for this purpose: it is simply called mean

> mean(cleaned_v)
[1] 5.696928
> mean(v, na.rm = TRUE)
[1] 5.696928

Weighted mean R provides the function weighted.mean() to compute a weighted arithmetic mean. If needed, you can use again the argument na.rm = TRUE in order to deal with missing values. In the session below we have a data vector x consisting of the natural numbers \(1, 2, \ldots, 7\). We created two sub vectors x1 = 1,2,3 and x2 = 4, 5, 6, 7. The means of these three vectors are \(4\), \(2\), and \(5.5\), respectively. The average of the means of x1 and x2 is \(3.75\), and this differs from the mean of x. The reason of this difference is that we did not account for the difference in number of elements in the data vectors x1 and x2 . With a weighted mean procedure we could compute from the means of x1 and x2 the mean of x by using weights \(\tfrac{3}{7}\) and \(\tfrac{4}{7}\).

> x <- 1:7
> x1 <- x[1:3]
> x2 <- x[4:7]
> m1 <- mean(x1); m1
[1] 2
> m2 <- mean(x2); m2
[1] 5.5
> mean_x <- mean(x); mean_x
[1] 4
> mean_m1_m2 <- mean(c(m1,m2)); mean_m1_m2
[1] 3.75
> weighted_mean_m1_m2 <- weighted.mean(c(m1,m2), c(3/7, 4/7))
> weighted_mean_m1_m2
[1] 4

Trimming If the argument trim = f, with \(0<f<0.5\), is specified in the call of the function mean(), that R computes a symmetrically trimmed mean with a fraction of \(f\) values deleted from each end before the mean is computed.

> cleaned_v
[1] 4.539645 8.117417 5.141017 5.258575 8.430130 5.921832 2.469878
> mean(cleaned_v)
[1] 5.696928
> mean(cleaned_v, trim = 0.15)
[1] 5.795697

Median The median of a data vector equals the 'middle value' of the values in the data vector, i.e. the value at which half the values in the data vector are smaller and half are larger. In other words, it is the 50th percentile and the 0.5 quantile of the datavector. It can be computed as follows:

Sort the value from smallest to largest (or vice versa).
Count in to find the middle value. The median is the middle value for an odd number of
elements in the data vector; it is defined as the mean of the two middle
values for an even number of elements.

In R, the above procedure can be performed in the background via the dedicated function median():

> sort(cleaned_v)
[1] 2.469878 4.539645 5.141017 5.258575 5.921832 8.117417 8.430130
> median(cleaned_v)
[1] 5.258575
> median(v, na.rm = TRUE)
[1] 5.258575

Other quantiles The \(n\)th percentile of a data vector is the value below which \(n\%\) of the values lie; this is also the \(n/100\) quantile. In R, the function quantile() allows you to compute estimates of underlying distribution quantiles. In our example of a discrete distribution we can do the following:

> quantile(cleaned_v, .15, type=1)
     15% 
4.539645 
> quantile(cleaned_v, .85, type=1)
     85% 
8.117417

Mode The mode of a distribution is the most frequently occurring score or category. In a data vector, the mode is the value/values that occurs/occur most frequently. A way to remember that is: it's the number (or numbers) that is (are) popular (in Dutch: "in de mode") in your data vector.

You can use the function mlv() from the R package modeest to compute the mode. A simple example shows you how:

> install.packages("modeest")
> sample1 <- c(1,2,3,4,3)
> modeest::mlv(sample1, method = "mfv") 
[1] 3
> sample2 <- c(1,2,3,4,3,2) # a bimodal sample
> modeest::mlv(sample2, method = "mfv")
[1] 2 3

Computing sample variance and sample standard deviation The (sample) variance is a measure of spread of values in a data vector and is defined as follows: the variance \(s^2\) (or \(\mathrm{var}(x)\) of a numeric data vector \(x=(x_1, x_2, \ldots, x_n)\) is given by \[s^2=\frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})^2\] The (sample) standard deviation \(s\) is the square root of the variance, i.e. \[s=\sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})^2}\] The two R functions that compute these measures of spread are var() and sd(), respectively:

> v
[1]       NA 4.539645 8.117417 5.141017 5.258575 8.430130 5.921832 2.469878       NA       NA
> cleaned_v
[1] 4.539645 8.117417 5.141017 5.258575 8.430130 5.921832 2.469878
> var(v)  
[1] NA
> var(v, na.rm = TRUE) # variance with omission of not available values
[1] 4.272348
> var(cleaned_v)
[1] 4.272348
> sd(v) 
[1] NA
> sd(v, na.rm = TRUE) # standard deviation with omission of not available values
[1] 2.066966
> sd(cleaned_v)
[1] 2.066966