Basic skills in R: Getting data into and out of R
Long versus wide format
Introduction and the dataset of the example When you have entered, cleaned, and validated all the data, it may still be the case that the dataset does not have the correct structure. By structure, we mean how the independent and dependent variables, as well as the subject id's, and so on, are organized. A distinction is made between a dataset in wide and long format. We explain the difference using the following example. It is important to know these two formats because statistical and visualisation functions in R usually require a long format, while the data may be available in a wide format. A conversion of the structure is then necessary.
Suppose you are conducting research on the memory of elderly individuals and want to test the effect of a memory training program on it. At the beginning of the study, participants take a memory test; this is usually called the pretest. Subsequently, the participants undergo a memory training program for 4 weeks. At the end of the training period, they take another memory test; this is typically referred to as the posttest. The difference score provides insights into the short-term effect of the memory training.
We use the fictitious table below in which the scores of four participants on the two memory tests are provided.
participant | pretest | posttest |
A | 525 | 780 |
B | 610 | 790 |
C | 630 | 840 |
D | 575 | 800 |
Table in wide format The most intuitive way to store the measurement data from our example in a table is to record all the data for each participant (the independent variable) in consecutive columns (the dependent variables). The participant's ID or label is then placed in the first column and does not contain duplicate values. This is the way we have tabulated the data above. This is called a dataset in wide format.
In R, we can quickly enter the data in this manner into a data frame as follows:
> participant <- c("A", "B", "C", "D")
> pretest <- c(525, 610, 630, 575)
> posttest <- c(780, 790, 840, 800)
> df_wide <- data.frame(participant, pretest, posttest)
> df_wide
participant pretest posttest
1 A 525 780
2 B 610 790
3 C 630 840
4 D 575 800
Table in long format There is another way to structure the data, which is in long format. In our example, all the scores from the different tests would be in the same column, with an additional column indicating which test, pretest or posttest, each measurement corresponds to, and a column indicating which participant the measurement comes from, as illustrated in the table below.
participant | kind of test | score |
A | pretest | 525 |
B | pretest | 610 |
C | pretest | 630 |
D | pretest | 575 |
A | posttest | 780 |
B | posttest | 790 |
C | posttest | 840 |
D | posttest | 800 |
In R, we can enter the data in this manner into a data frame, but it requires more effort, especially when the number of participants is large. Fortunately, there is a convenient method to transform data from wide to long format.
Conversion from wide to long format To convert a data frame from wide format to long format, it is most convenient to use the function melt()
from the package reshape2
> install.packages("reshape2")
> library(reshape2)
> df_long <- melt(df_wide, value.name="score") Using participant as id variables > df_long participant variable score 1 A pretest 525 2 B pretest 610 3 C pretest 630 4 D pretest 575 5 A posttest 780 6 B posttest 790 7 C posttest 840 8 D posttest 800 > names(df_long)[2] <- "testtype" # adjust name
> df_long participant testtype score 1 A pretest 525 2 B pretest 610 3 C pretest 630 4 D pretest 575 5 A posttest 780 6 B posttest 790 7 C posttest 840 8 D posttest 800
Conversionn from long to wide format To convert a data frame from long format to wide format, it is most convenient to use the function dcast()
from the package reshape2
. In our example, we can do the following:
> df_wide2 <- dcast(df_long, participant~testtype, value.var="score")
> df_wide2
participant pretest posttest
1 A 525 780
2 B 610 790
3 C 630 840
4 D 575 800