Applying functions on data structures

Basic programming: Introduction

Applying functions on data structures

All functions below with "apply" in their name provide a form of implicit repetition: they all apply a function repeatedly to components of an input object and collect the results in a single structure.

apply The R function apply() allows you to apply a function across an array, matrix , or data frame. You can do this in several ways, depending on the value you specify to the argument MARGIN:

1 for row-wise application,
2 for column-wise application, and
c(1,2) for row- and columnwise application (mainly for multidimensional arrays).

Depending on the input object type and the function passed in, apply() outputs a vector, a list, a matrix, or an array. The function that is passed in can be a built-in T function or a function that you have defined yourself.

A matrix example will make clear what is meant by all of the above. Have a close look at the instructions and their outputs.

> M <- matrix((1:12), nrow = 3); M
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12
> mean(M)    # average of all matrix entries
[1] 6.5
> mean(1:12) # average of all natural numbers from 1 to 12
[1] 6.5
> apply(M, MARGIN = 1, FUN = mean)  # averages of row values
[1] 5.5 6.5 7.5
> apply(M, MARGIN = 2, FUN = mean)  # averages of column values
[1]  2  5  8 11
> apply(M[, c(1,3)], MARGIN = 2, FUN = mean) # averages of columns 1 and 3
[1] 2 8
> apply(M, MARGIN = 2, FUN = cumsum) # cumulative sums of columns
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    3    9   15   21
[3,]    6   15   24   33
> apply(M, MARGIN = 1, FUN = range) # ranges (min and max) of rows
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]   10   11   12
> F <- function(x) { x + sample(c(-1,0,1), size = 1)}
> set.seed(123)
> F(M)   # all matrix elements are added a fixed randomly generated number
     [,1] [,2] [,3] [,4]
[1,]    2    5    8   11
[2,]    3    6    9   12
[3,]    4    7   10   13
> set.seed(123)
> apply(M, MARGIN = c(1,2), FUN = F) - M  # random change of individual matrix elements
     [,1] [,2] [,3] [,4]
[1,]    1    0    0   -1
[2,]    1    1    0    0
[3,]    1    0    1    0

lapply The function lapply() is a variety of apply() that takes in a vector, a list, or a data frame as input and always outputs a list (the letter "l" in the function name stands for "list"). The specified function applies to each element of the input object, hence the length of the resulting list is always equal to the input object's length. You can turn such a list into a vetorr via the function unlist().

The syntax of this function lapply() is similar to the syntax of apply(), only here there is no need for the parameter MARGIN because the function applies element-wise for lists and vectors, and column-wise for data frames.

We present two simple examples, but have a close look at the instructions and their outputs.

> movie_titles <- c("ALIENS", "FROZEN", "GRAVITY", "MALEFICENT")
> movie_titles_lowercase <- lapply(movie_titles, FUN = tolower)
> str(movie_titles_lowercase)
List of 4
 $ : chr "aliens"
 $ : chr "frozen"
 $ : chr "gravity"
 $ : chr "maleficent"
> unlist(movie_titles_lowercase)
[1] "aliens"     "frozen"     "gravity"    "maleficent"

> v <- c(1,2,3); w <- c(4,5,6); df <- data.frame(v,w)
> df
  v w
1 1 4
2 2 5
3 3 6
> lapply(df, FUN = mean)   # compute the mean of v and w
$v
[1] 2

$w
[1] 5

> lapply(df, FUN = cumsum) # compute the cumulative sum of v and w
$v
[1] 1 3 6

$w
[1]  4  9 15

sapply and vapply The function sapply() is a user-friendly version and wrapper of the function lapply() by default returning a vector, matrix or, if the argument simplify = "array" is specified, an array if appropriate. So this function takes list, vector or data frame as input and gives output as a vector, matrix, or array.

The function vapply() is similar to sapply(), but has a pre-specified type of return value, so it can be safer (and sometimes faster) to use.

Let's use the same examples as in the explanation of lapply():

> movie_titles <- c("ALIENS", "FROZEN", "GRAVITY", "MALEFICENT")
> movie_titles_lowercase <- sapply(movie_titles, FUN = tolower)
> str(movie_titles_lowercase)
 Named chr [1:4] "aliens" "frozen" "gravity" "maleficent"
 - attr(*, "names")= chr [1:4] "ALIENS" "FROZEN" "GRAVITY" "MALEFICENT"
> movie_titles_lowercase
      ALIENS       FROZEN      GRAVITY   MALEFICENT 
    "aliens"     "frozen"    "gravity" "maleficent"
> vapply(movie_titles, FUN = tolower, FUN.VALUE = "chr")
      ALIENS       FROZEN      GRAVITY   MALEFICENT 
    "aliens"     "frozen"    "gravity" "maleficent"

> v <- c(1,2,3); w <- c(4,5,6); df <- data.frame(v,w)
> df
  v w
1 1 4
2 2 5
3 3 6
> sapply(df, FUN = mean)   # compute the mean of v and of w
v w 
2 5 
> sapply(df, FUN = cumsum) # compute the cumulative sum of v and of w
     v  w
[1,] 1  4
[2,] 3  9
[3,] 6 15

tapply The function tapply(X = ..., INDEX = ..., FUN = ...) splits the data of the input argument X, based on the levels of the input argument INDEX, and the applies the given function specified in the input argument FUN. In case INDEX is a list of two factors, a cross table is created; in case of $n\ge 3$ factors, an $n$ -way contingency table is constructed. This explains the "t" as first character in the function name: it can be used to create a table.

A simple example of the application of tapply() uses data of a team of eight university lecturers, for which the monthly salary and function level are known. We use the function tapply() to compute the average monthly salary per function level:. We will also show how the average salary and standard deviation can be computer per function level for this team of university lecturers.

> monthly_salary <- c(7305, 3877, 4786, 4494, 6002, 5705, 6305, 5705)
> function_level <- c("D1", "D4", "D4", "D3", "D1", "D2", "D2", "D3")
> tapply(X = monthly_salary, INDEX = function_level, FUN = mean)
    D1     D2     D3     D4 
6653.5 6005.0 5099.5 4331.5

> # Next, the mean monthly salary and standard deviation for each function level.
> tapply(X = monthly_salary, INDEX = function_level, FUN = function(x) {c(mean(x), sd(x))})
$D1
[1] 6653.5000  921.3601

$D2
[1] 6005.0000  424.2641

$D3
[1] 5099.5000  856.3063

$D4
[1] 4331.5000  642.7601

In the second example we use the built-in dataset mtcars and compute the mean horsepower of cars (hp) with a give number of gears (gears). We also create a cross table of mean horsepower of cars wit respect to the number of cylinders (cyl) and the number of gear. The value NA in the output means that there is no car in the dataset with 8 cylinders and 4 gears.

> with(mtcars, tapply(X = hp, INDEX = gear, FUN = mean))
       3        4        5 
176.1333  89.5000 195.6000 
> with(mtcars, tapply(X = hp, INDEX = list(cyl, gear), FUN = mean))
         3     4     5
4  97.0000  76.0 102.0
6 107.5000 116.5 175.0
8 194.1667    NA 299.5