Basic skills in R: Working with data structures
Data frame
A data frame has rows and columns like a matrix, but a matrix must be filled with values of the same data type whereas a data frame can hold values of different data types. Take a look at the data frame below, which we will use in the sample R sessions below, and which is a small data set of people. The columns Age and Height hold numerical values, while the columns Name and Sex hold character values. Each row contains one set of values for one person labelled with an identity number Id.
Id | Name | Age | Sex | Height |
1101 | John | 25 | Male | 180 |
2223 | Emily | 30 | Female | 165 |
1301 | Micheal | 28 | Male | 175 |
4001 | Sophia | 22 | Female | 160 |
2262 | Michael | 32 | Male | 174 |
Data frames are used frequently in research, so it is useful to know how to work with them.
Explanation
To create a data frame, you can use the fynction data.frame()
. This function requires vectors as an input. Each vector then becomes a new column in the data frame. You can write the vectors directly between the parentheses of the function data.frame()
or you can first create the vectors outside this function and insert them afterwards.
You can display the dimension of a data frame via the function dim()
. The number of rows and columns can also be obtained by the function nrow()
and ncol()
, respectively. The internal structure can be shown via the function str()
. It reveals that the function data.frame()
has converted the character vectors into so-called factors, in which categorical values are internally stored as numbers.
The option row.names=...
specifies the column name to be used in labelling cases on various printouts and graphs. You can ask about the row and column names used in a date frame by the functions rownames()
and names()
, respectively
Sample session
Explanation
Like vectors and matrices, data frame are subscriptable and mutable objects. We use the data frame persondata
to explain how to select an element of the data frame and change it.
You look at the data frame persondata
and you see for example that Emily's height is located at row 2, column 5. You can select this element with a positional index [2,5]. Alternatively, you may use row and column names. You can change its value and this changes the data frame as well.
Sometimes you work with a data frame that has an insane amount of rows. Let's imagine that our data frame was actually a lot longer then it would take too long to scroll through all that until you found Emily's row. Assuming that there is a unique Emily in the data set, you can use the instruction persondata[persondata$Name=="Emily", "Height"]
Sample session
The idea of the above instruction is as follows:
persondata
is the date frame to select an element from;- the 1st argument informs R to look up in which row in
persondata
the name Emily occurs in the columnName
; - the 2nd argument informs R to use the column
Height
.
In case a name (like Michael) occurs more than once in the column Name
, you get all rows that contain this name. In the sample session we find in this way the id's of all persons in the data set who are called Michael.
You can also create new data frames by selection methods. For example, in the sample session we get a data frame with the names and ages of all persons having height equal to 160.
The row names of a data frame can be reset to the default names by assignment of the empty object NULL
.
Working with data frames
Explanation
We continue to work with the data frame persondata
that we created before.
Let's assume that we want to view only the columns labelled by Name
and Height
. We can select them by a vector of column numbers and by a vector of column names; see the first instructions in sample session on the right-hand side.
You can select a single column, say the Id
column, by the instruction persondata[ ,"Id"]
(Note that there is still a comma to separate the first element from the second element within the square brackets!), but it is easier to enter the age variable in the data frame as personadata$Age
. The $
notation is used to indicate a particular column variable from a given data frame. For example, if we want to compute the average, minimum and maximum age of all persons in our data set, we can use the following code: c(mean(persondata$Age), min(persondata$Age), max(persondata$Age))
. It can get tiresome typing persondata$
at the beginning of every variable name, but the function with()
can simplify the code:
Sample session
with(persondata, c(mean(Age), min(Age), max(Age)))
is shorter and more readable. By the way, the function summary()
would have been the most convenient way to do some descriptive statistics about the age variable.
Like vectors and matrices, data frames in R allow selection of components using logical expressions; see the last three instruction is the sample session.
Another way to filter data from a data frame is to use the function subset()
. This function only needs to know two things from you:
- the name of the data frame;
- the condition to select rows.
subset(persondata, Age>25 & Age<30)
Practice
1. logarithmic table
- Create a data frame which you could use as a look-up table for logarithms. The first column contains the values for which you want calculate the logarithm: let it be the numbers \[1,2,3,\ldots 9, 10, 20, 30, \ldots 90, 100, 200, 300, \ldots , 1000\] The next columns contain the function values for the logarithms with base \(2\), \(e\), \(3\), and \(10\), respectively. Show the first few lines of the data frame.
- Delete the column in the data frame of values of the logarithm with base \(3\).
2. Fuel consumption of cars
R has the built-in data frame mtcars
, extracted from the 1974 Motor Trend US magazine. It contains fuel efficiency data depending on 10 aspects of automobile design for 32 automobiles.
- Find out what data are collected and which car types are involved
- Make a summary, in descriptive statistics sense, of the miles per gallon variable
- Draw point plots of miles per gallon against car weight and against number of cylinders.
- What are the names of the cars for which the miles per gallon is greater than 25 and the 1/4 mile time is less than 18 seconds?