Basic skills in R: Working with data structures
Factor
Explanation
A major difference between a factor created by the function factor()
and a vector object in R is that a factor stores the categorical values as a vector of integers in the range [1, number of unique values] and an internal vector of character strings (the original values) is mapped to these integers. The unique values in a factor are called levels. By default, factor levels for character vectors are created in alphabetical order with the option ordered=TRUE
. You can override the default by specifying a levels option. See the sample session on the right-hand side.
You can also convert a character vector into a factor via the function as.factor()
.
Sample session
> diabetes <- c("Type 1", "Type 2",
+ "Type 1", "Type 1") > diabetes <- factor(diabetes) > str(diabetes) Factor w/ 2 levels "Type 1","Type 2": 1 2 1 1 > status <- c("Improved", "Poor",
+ "Poor", "Excellent") > status <- factor(status, ordered=TRUE) > str(status) Ord.factor w/ 3 levels "Excellent"<"Improved"<
..: 2 3 3 1
> status <- factor(status,
+ ordered=TRUE, levels=c("Poor",
+ "Improved", "Excellent")) > str(status) Ord.factor w/ 3 levels "Poor"<"Improved"<..:
2 1 1 3
Four reasons to use factors in R
We list the four major reasons for using factors instead of character vectors:
- Use of some important statistical functions
Some statistical functions in R require factors as input. For example, the ANOVA test function:aov()
. ANOVA is one of the most frequently used statistical tests, and knowing how to provide the input might help you. - Automatic naming of axes and legends of plots Plotting functions in R that create plots like line graphs, bar plots, and boxplots, will automatically use the factor levels as labels of axes and legends. That saves you time and effort. How to make plots (with labels and legends) is something we'll discuss at the end of this chapter.
- Checking (large) vectors quickly on typo's If you made a typo somewhere in your vector, this is a way to find out! Do you have a category you didn't intent to have? Or did you forget to include any data? Do you miss any categories?
- Saving computer memory Internally, factors are stored as a row of integers (whole numbers), with each integer representing a specific category. Integers take up very little memory. If you have large vectors, making a factor out of that will save space on your computer.