Long versus wide format

Basic skills in R: Getting data into and out of R

Long versus wide format

Content

Introduction and the dataset of the example
Table in wide format
Table in long format
Conversion from wide to long format
Conversion from long to wide format

Introduction and the dataset of the example When you have entered, cleaned, and validated all the data, it may still be the case that the dataset does not have the correct structure. By structure, we mean how the independent and dependent variables, as well as the subject id's, and so on, are organized. A distinction is made between a dataset in wide and long format. We explain the difference using the following example. It is important to know these two formats because statistical and visualisation functions in R usually require a long format, while the data may be available in a wide format. A conversion of the structure is then necessary.

Suppose you are conducting research on the memory of elderly individuals and want to test the effect of a memory training program on it. At the beginning of the study, participants take a memory test; this is usually called the pretest. Subsequently, the participants undergo a memory training program for 4 weeks. At the end of the training period, they take another memory test; this is typically referred to as the posttest. The difference score provides insights into the short-term effect of the memory training.

We use the fictitious table below in which the scores of four participants on the two memory tests are provided.

participant	pretest	posttest
A	525	780
B	610	790
C	630	840
D	575	800

Table in wide format The most intuitive way to store the measurement data from our example in a table is to record all the data for each participant (the independent variable) in consecutive columns (the dependent variables). The participant's ID or label is then placed in the first column and does not contain duplicate values. This is the way we have tabulated the data above. This is called a dataset in wide format.

In R, we can quickly enter the data in this manner into a data frame as follows:

> participant <- c("A", "B", "C", "D")
> pretest <- c(525, 610, 630, 575)
> posttest <- c(780, 790, 840, 800)
> df_wide <- data.frame(participant, pretest, posttest)
> df_wide
  participant pretest posttest
1           A     525      780
2           B     610      790
3           C     630      840
4           D     575      800

Table in long format There is another way to structure the data, which is in long format. In our example, all the scores from the different tests would be in the same column, with an additional column indicating which test, pretest or posttest, each measurement corresponds to, and a column indicating which participant the measurement comes from, as illustrated in the table below.

participant	kind of test	score
A	pretest	525
B	pretest	610
C	pretest	630
D	pretest	575
A	posttest	780
B	posttest	790
C	posttest	840
D	posttest	800

In R, we can enter the data in this manner into a data frame, but it requires more effort, especially when the number of participants is large. Fortunately, there is a convenient method to transform data from wide to long format.

Conversion from wide to long format To convert a data frame from wide format to long format, it is most convenient to use the function melt() from the package reshape2

> install.packages("reshape2")
> library(reshape2)
> df_long <- melt(df_wide, value.name="score")
Using participant as id variables
> df_long
  participant  variable score
1           A   pretest   525
2           B   pretest   610
3           C   pretest   630
4           D   pretest   575
5           A  posttest   780
6           B  posttest   790
7           C  posttest   840
8           D  posttest   800
> names(df_long)[2] <- "testtype" # adjust name
> df_long  
  participant  testtype score
1           A   pretest   525
2           B   pretest   610
3           C   pretest   630
4           D   pretest   575
5           A  posttest   780
6           B  posttest   790
7           C  posttest   840
8           D  posttest   800

Alternative To convert a data frame from wide format to long format, we recommend using the function melt() from the package reshape2 because it is the most straightforward approach. An alternative is the function reshape() from the basic setting of R. The more complex code would look like this:

> df_long2 <- reshape(
+   data = df_wide,                     # name of wide data frame to be converted
+   direction = "long",                 # conversion direction: long format
+   varying = c("pretest", "posttest"), # names of columns to be merged
+   idvar = "participant",              # the independent variable
+   v.names = "score",                  # new column name for measured values
+   timevar = "testtype",               # column name with distinguished cases as values
+   times = c("pretest", "posttest")    # names of distinguished cases (pre- and posttest)
+ )
> df_long2
           participant testtype score
A.pretest            A  pretest   525
B.pretest            B  pretest   610
C.pretest            C  pretest   630
D.pretest            D  pretest   575
A.posttest           A posttest   780
B.posttest           B posttest   790
C.posttest           C posttest   840
D.posttest           D posttest   800

You can see here that R tries to provide appropriate names for the rowsin the table.

Conversionn from long to wide format To convert a data frame from long format to wide format, it is most convenient to use the function dcast() from the package reshape2. In our example, we can do the following:

> df_wide2 <- dcast(df_long, participant~testtype, value.var="score")
> df_wide2
  participant pretest posttest
1           A     525      780
2           B     610      790
3           C     630      840
4           D     575      800

Alternative To convert a data frame from long format to wide format, we recommend using the function dcast() from the package reshape2 because it is the most straightforward approach. An alternative is the function reshape() form the basic setting of R. The more complex code would look like this:

> df_wide2 <- reshape(
+   data = df_long,       # data frame of long format to be converted
+   direction = "wide",   # conversion direction: wide format
+   v.names = "score",    # name of clumn with measured values
+   timevar = "testtype", # column name with distinguished cases as values
+   idvar = "participant" # the independent variable
+ )
> df_wide2
  participant score.pretest score.posttest
1           A           525            780
2           B           610            790
3           C           630            840
4           D           575            800

Again, R does its best to create suitable column names. You can customise these if needed.

> names(df_wide2)[2:3] = c("pretest","posttest") # adjust column names
> df_wide2
  participant pretest posttest
1           A     525      780
2           B     610      790
3           C     630      840
4           D     575      800