Importing data from a delimited text file

Basic skills in R: Getting data into and out of R

Importing data from a delimited text file

Content

The dataset used in examples
Entering data from the keyboard and writing to a file
Data import via read.table()
The package readr

In R we can use functions whose names start with read. and write. to import and export data for the file types that we want. As main example we will use measurement data from a study (MAS et al, 1999) about the concentration profile of the active substance after oral administration of 125 mg MDMA (better known as Ecstasy or XTC). The plasma concentration $C$ - time $t$ data are listed in the table below.

$t$ (hours)	$C$ (µg/L)
0.25	0.0
0.50	19.2
0.75	74.6
1.00	125.0
1.50	200.0
2.00	223.1
3.00	215.4
4.00	192.3
6.00	157.7
8.00	130.8
10.00	115.4
24.00	38.5

Entering data from the keyboard and writing to a file First we will enter the above data from the keyboard, while omitting on purpose the final row.

First we set the working directory and show the files present here

> setwd("C:/temp") # set working directory
> dir() # print working directory with files
[1] "age.csv"  "age.xlsx"

Next we enter the above dataset except the last row and create a date frame:

> t <- c(0.25, 0.50, 0.75, 1.00, 1.50, 2.00, 
+        3.00, 4.00, 6.00, 8.00, 10.00)
> C <- c(0, 19.2, 74.6, 125.0, 200.0, 223.1, 
+        215.4, 192.3, 157.7, 130.8, 115.4)
> df <- data.frame(t,C)

We export the current dataset via the function write.table() to the file MDMAdata.txt, with values separated by a semicolon, and with only column names in the first row.

> write.table(df, file="MDMAdata.txt", sep=";", 
+             row.names=FALSE, col.names=TRUE)

Let's check that a new file has occurred in the working directory:

> dir()  # extra file in C:/temp
[1] "age.csv"      "age.xlsx"     "MDMAdata.txt"

Now we append more data to the file:

> write.table(data.frame(t=24.00, C=38.5), file="MDMAdata.txt", append=TRUE,
+             sep=";", row.names = FALSE, col.names = FALSE)

In order to be sure that we can import the exported dataset and get something new, we first remove the data frame df from the workpace before reading the dataset from the file MDMAdata.txt. In the function read.table() we only specify the filename and separator, and indicate that the first row is a header containing the column names of the data frame that we create from the imported data.

> rm(df)  # remove existing data frame df
> df <- read.table(file="MDMAdata.txt", header=TRUE, sep=";")
> df
       t     C
1   0.25   0.0
2   0.50  19.2
3   0.75  74.6
4   1.00 125.0
5   1.50 200.0
6   2.00 223.1
7   3.00 215.4
8   4.00 192.3
9   6.00 157.7
10  8.00 130.8
11 10.00 115.4
12 24.00  38.5

Data import via read.table() The above example already shows the main ingredients of the process of importing and exporting data via a delimited text file. Let's focus now on the import data from delimited text files using read.table() This function reads a file in table format and saves it as a data frame. Each row of the table appears as one line in the file. The syntax is

dataframe <- read.table(file, options)

where file is a delimited ASCII file and options are parameters controlling how data must be processed. The table below lists the most common options.

`header`	A logical value indicating whether the file contains the names of the variables in the first line.
`sep`	The delimiter separating data values. The default is sep="", which denotes one or more spaces, tabs, new lines, or carriage returns. Use `sep=";"` to read a semicolon-delimited file, `sep=","` to read a comma-delimited file, and `sep="\t"` to read a tab-delimited file.
`row.names`	An optional parameter specifying one or more variables to represent row identifiers.
`col.names`	If the first row of the data file does not contain the names of the variables (`header=FALSE`), you can use `col.names` to specify a character vector containing the names of the variables. If `header=FALSE` and the `col.names` option is omitted, variables will be named `V1`, `V2`, and so on.
`na.strings`	An optional character vector indicating codes for missing-values. For example, `na.strings=c("-999", "?")` converts each -999 and ? value to `NA` when data are read.
`colClasses`	An optional vector of classes to be assigned to the columns. For example, `colClasses=c("numeric", "character", "NULL", "numeric")` reads the first column as numeric values, reads the second column as character values, skips the third column, and reads the fourth column as numeric values. If there are more than four columns in the data, the values in `colClasses` are recycled. When you are reading large text files, including the colClasses option can considerably speed up processing.
`skip`	The number of lines in the data file to skip before beginning to read the data. This option is useful for skipping header comments in the file.
`stringAsFactor`	logical value indicating whether character variables should be converted to factors. When you are processing large text files, setting `stringsAsFactors=FALSE` can speed up processing.

The base-R environment also provides the functions read.csv() and read.delim() for importing rectangular text files. These are simply wrapper functions that call read.table() with specific defaults. For example, read.csv() calls read.table() with header=TRUE, and sep="," while read.delim() calls read.table() with header=TRUE and sep="\t" (a tab). Details are provided in the read.table() help.

Reading a table via an R script You can also include a table in an R script as input for the function read.table(). In our example, you can do it as follows:

df <- read.table(header=TRUE, stringsAsFactors=TRUE, text="
    t     C 
 0.25   0.0 
 0.50  19.2 
 0.75  74.6 
 1.00 125.0 
 1.50 200.0 
 2.00 223.1 
 3.00 215.4 
 4.00 192.3 
 6.00 157.7
 8.00 130.8
10.00 115.4 
24.00  38.5
")

The package readr

The package readr provides a powerful alternative to basic R functions for reading rectangular text files. The primary function is read_delim() with helper functions read_csv() and read_tsv() for reading comma-delimited and tab-delimited files, respectively. The main advantage of using the import functions of this package is that they are significantly faster than the basic R functions. Additionally, they are very good at guessing the correct data type of each column (numeric, character, date, and date-time). Functions in the this package return data as tibbles (data frames with some specialized features).

In the example below we use the file grades.csv, the comma-separated-values text file created by saving the Excel file studentgrades.xlsx in this format. This Excel sheet is a fictitious list of student grades for maths and stats exams.

Explanation

Carry out the following steps:

Download the Excel sheet studentgrades.xlsx and save it as the csv file studentgrades.csv in the directory C:/temp
Install the package readr (ignore warnings)
Activate the package via the function library().
Set the working directory.
Import the dataset with the function read_csv().
View the imported dataset via the command View(), explore the internal structure via the function str(), look at the first rows of the imported table via the function head(), or just display the data frame in the console

Sample session


> install.packages("readr")
> library(readr)
> setwd("C:/temp")
> grades <-
+   read_csv(file="studentgrades.csv")

-- Column specification ------
cols(
  studentID = col_double(),
  first.name = col_character(),
  last.name = col_character(),
  maths = col_double(),
  stats = col_double()
)
> View(grades)
> grades

# A tibble: 6 x 5
  studentID first.name last.name maths stats
      <dbl> <chr>      <chr>     <dbl> <dbl>
1       101 John       Smith       7.5   5  
2        34 John       Jones       4.5   5  
3        76 Jane       Smith       8     8.5
4       104 Daniel     Brown       6     6.5
5       111 John       Taylor      9     9  
6        83 Esther     William    10     9.5