Basic skills in R: Getting data into and out of R
Dealing with missing values (NAs) during and after data import
NA and NaN: how can you deal with this? Until now we have only worked with datasets in which all data were completely present. But this is the exception rather than the rule; much more often values are missing in dataset. Missing values are referred to in R as NA ( Not Available) in numerical datasets and as <NA> in factors. Please note that NA differs from NaN ; the latter is an acronym for Not a Number. Division by zero produces such a value in the following session:
> 0/0 [1] NaN > 1/0-1/0 [1] NaN
Missing values lead to problems when executing R functions. For example, in the session below, adding elements of the vector y
does not produce a number:
> class(y)
[1] "numeric"
> sum(y)
[1] NA
But fortunately, many functions have an option na.rm=TRUE
to indicate that missing values can be removed before the function is applied:
> sum(y, na.rm=TRUE)
[1] 6
Detecting missing values in an object can be done with the function is.na()
. The result of calling this function is a logical object indicating the location(s) where values are missing. Missing values can be removed using the function na.omit()
.
> is.na(y)
[1] FALSE FALSE TRUE FALSE
> which(is.na(y)) # which position has no value?
[1] 3
> na.omit(y)
[1] 1 2 3
attr(,"na.action")
[1] 3
attr(,"class")
[1] "omit"
The first line of the output above is a vector with no missing values. The next two lines indicate the location where a value is missing. If you just want a vector of values, do the following:
> cleaned_y <- as.numeric(na.omit(y))
> str(cleaned_y)
num [1:3] 1 2 3
Summing the values now works without any problems:
> sum(cleaned_y)
[1] 6
Dataset for examples Let's look at an example of an incomplete data set of exam scores of students. We create a fictitious dataset with missing values and write the data to a csv file with the function write.table()
:
> studentID <- c(101, 34, "", 104, 111, 83) > firstname <- c("John", "Joe", "", "Daniel", "", "Nadia") > lastname <- c("Johnson", "Peters", "Johnson", "Smith", "Baker", "Fisher") > maths <- c(7.5, 4.5, 8, "", 9, 10) > stats <- c(5, 5, 8.5, 6.5, 9, 9.5) > marks <- data.frame(studentID, firstname, lastname, maths, stats) > marks studentID firstname lastname maths stats 1 101 John Johnson 7.5 5.0 2 34 Joe Peters 4.5 5.0 3 Johnson 8 8.5 4 104 Daniel Smith 6.5 5 111 Baker 9 9.0 6 83 Nadia Fisher 10 9.5 > setwd("C:/temp") > write.table(marks, file="marks.csv", sep = ","")
Importing a dataset with missing data When data is imported, missing values can appear in the table in many different ways, for example: "measurement failed", "lost" or simply an empty field as in our example with exam grades. When importing data, you can indicate in the read.table()
function via the option na.strings=...
how missing values in the dataset are marked. The function then turns these missing values into NA
and <NA>
values that can be interpreted by most functions in R. In our example where empty fields indicate missing values, the option is na.strings=""
, but if we had used an underscore - as a descriptor, we would use na.strings="-"
. With na.strings=c("-", "?", "")
you indicate that an underscore, a question mark and an empty field can be used in the file to be imported for missing values.
> df_marks <- read.table(file="marks.csv", header=TRUE, sep = ",", na.strings="") > df_marks studentID firstname lastname maths stats 1 101 John Johnson 7.5 5.0 2 34 Joe Peters 4.5 5.0 3 NA <NA> Johnson 8.0 8.5 4 104 Daniel Smith NA 6.5 5 111 <NA> Baker 9.0 9.0 6 83 Nadia Fisher 10.0 9.5
Dealing with missing data We often want to know whether a dataset contains missing values. The na.fail()
function returns the object if nothing is missing and gives a warning otherwise. Sometimes you can work around the missing data. For example, if we want to calculate the average grade for the math test, we can do this by ignoring the missing values:
> mean(df_marks$maths, na.rm=TRUE)
[1] 7.8
In many cases we want to continue working with a subset of the data without the missing values. You can find the locations where values are missing with the is.na()
function. The complete.cases()
function returns a logic vector indicating which rows are complete, that is, do not contain any missing values. In our example, these are the first two rows and the last row of the imported table. You can remove the rows with missing data using the na.omit()
function.
> na.fail(df_marks) # is there a missing value? Error in na.fail.default(df_marks) : missing values in object
> is.na(df_marks) # tests which positions are missing a value studentID firstname lastname maths stats 1 FALSE FALSE FALSE FALSE FALSE 2 FALSE FALSE FALSE FALSE FALSE 3 TRUE TRUE FALSE FALSE FALSE 4 FALSE FALSE FALSE TRUE FALSE 5 FALSE TRUE FALSE FALSE FALSE 6 FALSE FALSE FALSE FALSE FALSE
> na.omit(df_marks) # remove rows with a missing value
studentID firstname lastname maths stats
1 101 John Johnson 7.5 5.0
2 34 Joe Peters 4.5 5.0
6 83 Nadia Fisher 10.0 9.5
If you want to remove rows with missing values in a certain column, you have to be more specific. Suppose you want to delete the lines in the table in which a studentID value is missing, you can do this as follows:
> df <- df_marks[!(is.na(df_marks$studentID)), ]
> df # row with missing studentID has been removed
studentID firstname lastname maths stats
1 101 John Johnson 7.5 5.0
2 34 Joe Peters 4.5 5.0
4 104 Daniel Smith NA 6.5
5 111 <NA> Baker 9.0 9.0
6 83 Nadia Fisher 10.0 9.5
Suppose you want to remove all columns in the last data frame that are missing values. Then first transpose the table with the function t()
, remove rows with missing values, transpose the intermediate result again, make it a data frame, and ensure that columns have the desired data type instead of Factor. In our example, we can do this for the numeric columns as follows:
> cleaned_df <- as.data.frame(t(na.omit(t(df))))
> with(cleaned_df, {
+ studentID <- as.numeric(as.character(studentID))
+ stats <- as.numeric(as.character(stats))
+ })
> cleaned_df # columns with missing values removed
studentID lastname stats
1 101 Johnson 5.0
2 34 Peters 5.0
4 104 Smith 6.5
5 111 Baker 9.0
6 83 Fisher 9.5
Now we can calculate the average mark for the stats exam from the cleaned table without any problems:
> mean(cleaned_df$stats) # average mark for the stats exam
[1] 7.25