Basic skills in R: Getting data into and out of R
Importing data from a delimited text file
The dataset used in examples In R we can use functions whose names start with read.
and write.
to import and export data for the file types that we want. As main example we will use measurement data from a study (MAS et al, 1999) about the concentration profile of the active substance after oral administration of 125 mg MDMA (better known as Ecstasy or XTC). The plasma concentration \(C\) - time \(t\) data are listed in the table below.
\(t\) (hours) |
\(C\) (µg/L) |
0.25 | 0.0 |
0.50 | 19.2 |
0.75 | 74.6 |
1.00 | 125.0 |
1.50 | 200.0 |
2.00 | 223.1 |
3.00 | 215.4 |
4.00 | 192.3 |
6.00 | 157.7 |
8.00 | 130.8 |
10.00 | 115.4 |
24.00 | 38.5 |
Entering data from the keyboard and writing to a file First we will enter the above data from the keyboard, while omitting on purpose the final row.
First we set the working directory and show the files present here
> setwd("C:/temp") # set working directory > dir() # print working directory with files [1] "age.csv" "age.xlsx"
Next we enter the above dataset except the last row and create a date frame:
> t <- c(0.25, 0.50, 0.75, 1.00, 1.50, 2.00,
+ 3.00, 4.00, 6.00, 8.00, 10.00) > C <- c(0, 19.2, 74.6, 125.0, 200.0, 223.1,
+ 215.4, 192.3, 157.7, 130.8, 115.4) > df <- data.frame(t,C)
We export the current dataset via the function write.table()
to the file MDMAdata.txt, with values separated by a semicolon, and with only column names in the first row.
> write.table(df, file="MDMAdata.txt", sep=";",
+ row.names=FALSE, col.names=TRUE)
Let's check that a new file has occurred in the working directory:
> dir() # extra file in C:/temp [1] "age.csv" "age.xlsx" "MDMAdata.txt"
Now we append more data to the file:
> write.table(data.frame(t=24.00, C=38.5), file="MDMAdata.txt", append=TRUE, + sep=";", row.names = FALSE, col.names = FALSE)
In order to be sure that we can import the exported dataset and get something new, we first remove the data frame df
from the workpace before reading the dataset from the file MDMAdata.txt. In the function read.table()
we only specify the filename and separator, and indicate that the first row is a header containing the column names of the data frame that we create from the imported data.
> rm(df) # remove existing data frame df > df <- read.table(file="MDMAdata.txt", header=TRUE, sep=";") > df t C 1 0.25 0.0 2 0.50 19.2 3 0.75 74.6 4 1.00 125.0 5 1.50 200.0 6 2.00 223.1 7 3.00 215.4 8 4.00 192.3 9 6.00 157.7 10 8.00 130.8 11 10.00 115.4 12 24.00 38.5
Data import via read.table() The above example already shows the main ingredients of the process of importing and exporting data via a delimited text file. Let's focus now on the import data from delimited text files using read.table()
This function reads a file in table format and saves it as a data frame. Each row of the table appears as one line in the file. The syntax is
dataframe <- read.table(file, options)
where file is a delimited ASCII file and options are parameters controlling how data must be processed. The table below lists the most common options.
header |
A logical value indicating whether the file contains the names of the variables in the first line. |
sep |
The delimiter separating data values. The default is sep="", which denotes one or more spaces, tabs, new lines, or carriage returns. Use sep=";" to read a semicolon-delimited file, sep="," to read a comma-delimited file, and sep="\t" to read a tab-delimited file. |
row.names |
An optional parameter specifying one or more variables to represent row identifiers. |
col.names |
If the first row of the data file does not contain the names of the variables (header=FALSE ), you can use col.names to specify a character vector containing the names of the variables. If header=FALSE and the col.names option is omitted, variables will be named V1 , V2 , and so on. |
na.strings |
An optional character vector indicating codes for missing-values. For example, na.strings=c("-999", "?") converts each -999 and ? value to NA when data are read. |
colClasses |
An optional vector of classes to be assigned to the columns. For example, colClasses=c("numeric", "character", "NULL", "numeric") reads the first column as numeric values, reads the second column as character values, skips the third column, and reads the fourth column as numeric values. If there are more than four columns in the data, the values in colClasses are recycled. When you are reading large text files, including the colClasses option can considerably speed up processing. |
skip |
The number of lines in the data file to skip before beginning to read the data. This option is useful for skipping header comments in the file. |
stringAsFactor |
logical value indicating whether character variables should be converted to factors. When you are processing large text files, setting stringsAsFactors=FALSE can speed up processing. |
The base-R environment also provides the functions read.csv()
and read.delim()
for importing rectangular text files. These are simply wrapper functions that call read.table()
with specific defaults. For example, read.csv()
calls read.table()
with header=TRUE
, and sep=","
while read.delim()
calls read.table()
with header=TRUE
and sep="\t"
(a tab). Details are provided in the read.table()
help.
The package readr
The package readr
provides a powerful alternative to basic R functions for reading rectangular text files. The primary function is read_delim()
with helper functions read_csv()
and read_tsv()
for reading comma-delimited and tab-delimited files, respectively. The main advantage of using the import functions of this package is that they are significantly faster than the basic R functions. Additionally, they are very good at guessing the correct data type of each column (numeric, character, date, and date-time). Functions in the this package return data as tibbles (data frames with some specialized features).
In the example below we use the file grades.csv, the comma-separated-values text file created by saving the Excel file studentgrades.xlsx in this format. This Excel sheet is a fictitious list of student grades for maths and stats exams.
Explanation
Carry out the following steps:
- Download the Excel sheet studentgrades.xlsx and save it as the csv file studentgrades.csv in the directory C:/temp
- Install the package
readr
(ignore warnings) - Activate the package via the function
library()
. - Set the working directory.
- Import the dataset with the function
read_csv()
. - View the imported dataset via the command
View()
, explore the internal structure via the functionstr()
, look at the first rows of the imported table via the functionhead()
, or just display the data frame in the console
Sample session
> install.packages("readr")
> library(readr)
> setwd("C:/temp")
> grades <-
+ read_csv(file="studentgrades.csv") -- Column specification ------ cols( studentID = col_double(), first.name = col_character(), last.name = col_character(), maths = col_double(), stats = col_double() )
> View(grades) > grades
# A tibble: 6 x 5 studentID first.name last.name maths stats <dbl> <chr> <chr> <dbl> <dbl> 1 101 John Smith 7.5 5 2 34 John Jones 4.5 5 3 76 Jane Smith 8 8.5 4 104 Daniel Brown 6 6.5 5 111 John Taylor 9 9 6 83 Esther William 10 9.5