Exploring the data

0. The Basics of R: Practical 0

Exploring the data

Exploring the structure of a dataframe

A data frame is the representation of data in the format of a table. Each column contains values of one variable and each row contains one set of values from one observation. More information about the contents of a dataframe can be obtained through the commands str(), summary() and head().

str(): Prints the structure of the dataframe in a compact way. Each variable name is given (preceded by a $ sign), followed by an indication of the variable type, and then an example of the contents. The label 'Factor' can be taken as a synonym for 'Categorical'. The label 'int' refers to integers: these are numbers without decimals, and the label 'num' refers to numbers with decimals.
summary(): Prints for each variable in the data frame a short overview of the contents. For the categorical variables, it gives a list of how frequently each category occurs (up to the first 6 categories, alphabetically ordered). For the numerical variables, the 5-number summary and the mean is given.
head(): Prints the top 6 rows of the dataframe.

Considering data frame G, what is the data type of the variable year?

integer

In R you could use the following command:

str(G)

'data.frame': 1704 obs. of 6 variables: 
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ... 
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ... 
$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ... 
$ lifeExp : num 28.8 30.3 32 34 36.1 ... 
$ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 ... 
$ gdpPercap: num 779 821 853 836 740 ...

Alternatively, you could also use the command class(G$ year).

New example

Other useful commands to inspect the contents of a dataframe are:

Size:
- dim(G) - returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
- nrow(G) - returns the number of rows
- ncol(G) - returns the number of columns
Names:
- names(G) - returns the column names (synonym of colnames() for dataframes)
- rownames(G) - returns the row names.

Selecting a variable

The different variables make-up different columns in the dataframe. You can select a column from a dataframe by using the $ symbol. The command G$lifeExp means: column lifeExp from dataframe G. So to copy column lifeExp into a new variable, the following notation can be used.

lifeExp <- G$lifeExp

The new object created (lifeExp) is not a dataframe anymore, but a vector with the data for one variable and consequently also values of one type (numerical data in this case). The lifeExp variable also shows-up in the Environment tab in the upper-right pane (under the section 'Values').

Creating subsets

Dataframes have rows and columns. If you want to extract specific information from it, you need to specify which rows and columns you want in between square brackets. Row numbers come first, followed by column numbers, separated by a comma. If you don't specify the row number or the column number all rows or all columns are returned. If you want multiple rows or columns, you can combine them with the c() command or use the : command if you want consecutive rows.

# First element in the first column
G[1,1]
# First element in the 3th column
G[1,3]
# First row
G[1,]
# First column
G[,1]
# First three elements in the 4th column
G[1:3,4]
# Elements from the second row, first and fifth column
G[2,c(1,5)]

Select the first #245# rows from dataframe G for the columns #2# to #6# and save in a variable called subset.

subset <- G[1:245,2:6]

New example