Combining and cleaning data

Basic skills in R: Getting data into and out of R

Combining and cleaning data

Content

Creation of two datasets for the examples
Merging data frames
Removing outliers
Removing duplicated data

Creation of two datasets for the examples We start again with the creation of a fictitious data set of students' exam marks and export the data with the function write.table() to a csv file csv.

> studentID <- c(101, 34, 76, 104, 111, 83)
> firstname <- c("John", "Joe", "Carla", "Daniel", "Robert", "Nadia")
> lastname <- c("Johnson", "Peters", "Johnson", "Smith", "Baker", "Fisher")
> maths <- c(7.5, 4.5, 8, 5.5, 9, 10)
> stats <- c(5, 5, 8.5, 6.5, 99, 9.5)
> df_marks <- data.frame(studentID, firstname, lastname, maths, stats)
> setwd("C:/temp")
> write.table(df_marks, file="marks.csv", sep = ",")

We create another dataset of students' marks for the resit and export these data to a second csv file.

> studentID <- c(101, 34, 76, 104, 111, 83, 55)
> resit <- c(7, 6, NA, NA, NA, NA, 8)
> df_resit <- data.frame(studentID, resit)
> write.table(df_resit, file="marks_resit.csv", sep = ",")

Combining data frames We assume that the two data frames have been imported or taken from the session above.

> df_marks
  studentID firstname lastname maths stats
1       101      John  Johnson   7.5   5.0
2        34       Joe   Peters   4.5   5.0
3        76     Carla  Johnson   8.0   8.5
4       104    Daniel    Smith   5.5   6.5
5       111    Robert    Baker   9.0  99.0
6        83     Nadia   Fisher  10.0   9.5
> df_resit
  studentID resit
1       101     7
2        34     6
3        76    NA
4       104    NA
5       111    NA
6        83    NA
7        55     8

You can combine, or more precisely merge the two data frames on the basis of a common column via the function merge().

> all_marks <- merge(df_marks, df_resit, by="studentID")
> all_marks
  studentID firstname lastname maths stats resit
1        34       Joe   Peters   4.5   5.0     6
2        76     Carla  Johnson   8.0   8.5    NA
3        83     Nadia   Fisher  10.0   9.5    NA
4       101      John  Johnson   7.5   5.0     7
5       104    Daniel    Smith   5.5   6.5    NA
6       111    Robert    Baker   9.0  99.0    NA

Note that we only merge rows that have a common value for studentID. In case we want to merge everything into one data frame, thne we must use the option All=TRUE.

> merge(df_marks, df_resit, by="studentID", all=TRUE)
  studentID firstname lastname maths stats resit
1        34       Joe     Peters      4.5        5.0          6
2        55      <NA>       <NA>       NA         NA          8
3        76     Carla    Johnson      8.0        8.5         NA
4        83     Nadia     Fisher     10.0        9.5         NA
5       101      John    Johnson      7.5        5.0          7
6       104    Daniel      Smith      5.5        6.5         NA
7       111    Robert      Baker      9.0       99.0         NA

Removing outliers We continue with the variabele all_marks from the previous example.

> all_marks
  studentID firstname lastname maths stats resit
1        34       Joe   Peters   4.5   5.0     6
2        76     Carla  Johnson   8.0   8.5    NA
3        83     Nadia   Fisher  10.0   9.5    NA
4       101      John  Johnson   7.5   5.0     7
5       104    Daniel    Smith   5.5   6.5    NA
6       111    Robert    Baker   9.0  99.0    NA

In the column with marks for the stats exam you see a mark of 99; this is impossible on a scale from 1 to 10 and so it is desirable to remove this outlier. This can be done in our example in the following way

> cleaned_marks <- all_marks[all_marks$stats <= 10, ]
> cleaned_marks 
  studentID firstname lastname maths stats resit

1        34       Joe   Peters   4.5   5.0     6
2        76     Carla  Johnson   8.0   8.5    NA
3        83     Nadia   Fisher  10.0   9.5    NA
4       101      John  Johnson   7.5   5.0     7
5       104    Daniel    Smith   5.5   6.5    NA

Removing duplicated data We continue with the variable cleaned.marks from the previous example. We add a former row to it so that some row occurs twice in the dataset.

> N.Fisher <- cleaned_marks[3,] 
> duplicated_data <- rbind(cleaned_marks, N.Fisher)
> duplicated_data

  studentID firstname lastname maths stats resit

1        34       Joe   Peters   4.5   5.0     6
2        76     Carla  Johnson   8.0   8.5    NA
3        83     Nadia   Fisher  10.0   9.5    NA
4       101      John  Johnson   7.5   5.0     7
5       104    Daniel    Smith   5.5   6.5    NA

31       83     Nadia   Fisher  10.0   9.5    NA

The function duplicated() created a logical vector in which the value TRUE indicated elements of a vector of rows in a data frame that occurred before in the object. You can use this function also to remove double data. In our example:

> duplicated(duplicated_data)
[1] FALSE FALSE FALSE FALSE FALSE  TRUE
> which(duplicated(duplicated_data))  # which double row(s)?
[1] 6
> cleaned_marks <- duplicated_data[!duplicated(duplicated_data), ] 
> cleaned_marks

  studentID firstname lastname maths stats resit

1        34       Joe   Peters   4.5   5.0     6
2        76     Carla  Johnson   8.0   8.5    NA
3        83     Nadia   Fisher  10.0   9.5    NA
4       101      John  Johnson   7.5   5.0     7
5       104    Daniel    Smith   5.5   6.5    NA