Basic skills in R: Getting data into and out of R
Combining and cleaning data
Creation of two datasets for the examples We start again with the creation of a fictitious data set of students' exam marks and export the data with the function write.table()
to a csv file csv.
> studentID <- c(101, 34, 76, 104, 111, 83) > firstname <- c("John", "Joe", "Carla", "Daniel", "Robert", "Nadia") > lastname <- c("Johnson", "Peters", "Johnson", "Smith", "Baker", "Fisher") > maths <- c(7.5, 4.5, 8, 5.5, 9, 10) > stats <- c(5, 5, 8.5, 6.5, 99, 9.5) > df_marks <- data.frame(studentID, firstname, lastname, maths, stats) > setwd("C:/temp") > write.table(df_marks, file="marks.csv", sep = ",")
We create another dataset of students' marks for the resit and export these data to a second csv file.
> studentID <- c(101, 34, 76, 104, 111, 83, 55)
> resit <- c(7, 6, NA, NA, NA, NA, 8) > df_resit <- data.frame(studentID, resit) > write.table(df_resit, file="marks_resit.csv", sep = ",")
Combining data frames We assume that the two data frames have been imported or taken from the session above.
> df_marks studentID firstname lastname maths stats 1 101 John Johnson 7.5 5.0 2 34 Joe Peters 4.5 5.0 3 76 Carla Johnson 8.0 8.5 4 104 Daniel Smith 5.5 6.5 5 111 Robert Baker 9.0 99.0 6 83 Nadia Fisher 10.0 9.5 > df_resit studentID resit 1 101 7 2 34 6 3 76 NA 4 104 NA 5 111 NA 6 83 NA 7 55 8
You can combine, or more precisely merge the two data frames on the basis of a common column via the function merge()
.
> all_marks <- merge(df_marks, df_resit, by="studentID") > all_marks studentID firstname lastname maths stats resit 1 34 Joe Peters 4.5 5.0 6 2 76 Carla Johnson 8.0 8.5 NA 3 83 Nadia Fisher 10.0 9.5 NA 4 101 John Johnson 7.5 5.0 7 5 104 Daniel Smith 5.5 6.5 NA 6 111 Robert Baker 9.0 99.0 NA
Note that we only merge rows that have a common value for studentID
. In case we want to merge everything into one data frame, thne we must use the option All=TRUE.
> merge(df_marks, df_resit, by="studentID", all=TRUE) studentID firstname lastname maths stats resit 1 34 Joe Peters 4.5 5.0 6 2 55 <NA> <NA> NA NA 8 3 76 Carla Johnson 8.0 8.5 NA 4 83 Nadia Fisher 10.0 9.5 NA 5 101 John Johnson 7.5 5.0 7 6 104 Daniel Smith 5.5 6.5 NA 7 111 Robert Baker 9.0 99.0 NA
Removing outliers We continue with the variabele all_marks
from the previous example.
> all_marks studentID firstname lastname maths stats resit 1 34 Joe Peters 4.5 5.0 6 2 76 Carla Johnson 8.0 8.5 NA 3 83 Nadia Fisher 10.0 9.5 NA 4 101 John Johnson 7.5 5.0 7 5 104 Daniel Smith 5.5 6.5 NA 6 111 Robert Baker 9.0 99.0 NA
In the column with marks for the stats exam you see a mark of 99; this is impossible on a scale from 1 to 10 and so it is desirable to remove this outlier. This can be done in our example in the following way
> cleaned_marks <- all_marks[all_marks$stats <= 10, ] > cleaned_marks studentID firstname lastname maths stats resit
1 34 Joe Peters 4.5 5.0 6
2 76 Carla Johnson 8.0 8.5 NA
3 83 Nadia Fisher 10.0 9.5 NA
4 101 John Johnson 7.5 5.0 7
5 104 Daniel Smith 5.5 6.5 NA
Removing duplicated data We continue with the variable cleaned.marks
from the previous example. We add a former row to it so that some row occurs twice in the dataset.
> N.Fisher <- cleaned_marks[3,] > duplicated_data <- rbind(cleaned_marks, N.Fisher) > duplicated_data
studentID firstname lastname maths stats resit
1 34 Joe Peters 4.5 5.0 6
2 76 Carla Johnson 8.0 8.5 NA
3 83 Nadia Fisher 10.0 9.5 NA
4 101 John Johnson 7.5 5.0 7
5 104 Daniel Smith 5.5 6.5 NA
31 83 Nadia Fisher 10.0 9.5 NA
The function duplicated()
created a logical vector in which the value TRUE indicated elements of a vector of rows in a data frame that occurred before in the object. You can use this function also to remove double data. In our example:
> duplicated(duplicated_data) [1] FALSE FALSE FALSE FALSE FALSE TRUE
> which(duplicated(duplicated_data)) # which double row(s)? [1] 6
> cleaned_marks <- duplicated_data[!duplicated(duplicated_data), ]
> cleaned_marks
studentID firstname lastname maths stats resit
1 34 Joe Peters 4.5 5.0 6
2 76 Carla Johnson 8.0 8.5 NA
3 83 Nadia Fisher 10.0 9.5 NA
4 101 John Johnson 7.5 5.0 7
5 104 Daniel Smith 5.5 6.5 NA