Cross Tables

10. Categorical Association: Practical 10

Cross Tables

Let's start by making a table and barplot of one of our variables, just like we did in practicum 3.

Make a barplot of the variable "wash".

Which category contains the most observations?

Solution

Category "disagree" occurs most often.

You can first check the table

table(WD$ wash)

and then make a barplot of the table

barplot(table(WD$ wash))

New example

A cross table shows the frequency distribution of two or more variables. This gives you insight into the relationship between two categorical variables. The underlying question is whether one or more categories or combinations of categories have a frequency that deviates from the overall pattern.

Make a cross table of the variables wash and butt.

Solution

We need a few commands to calculate the crosstable:

t <- table(WD$ wash, WD$ butt)
r <- rowSums(t)
c <- colSums(t)
tot <- sum(t)

Combining all this information results in the following table:

wash	butt
		yes	no	sum
	strongly disagree	6	3	9
	disagree	71	26	97
	neutral	24	7	31
	agree	26	21	47
	strongly agree	6	6	12
	sum	133	63	196

New example

In the grey areas at the top and left, the variables and the different categories that occur in these are given. The central (white) part of the table contains the joint frequencies. And the blue areas contain the frequency distributions for the individual variables. With 'deviating from the overall pattern' we mean that the distribution for the joint frequencies for one category (the distribution over one row) is very different from the distribution for the individual variable as a whole (the values in the blue margin at the bottom).

As you see, it is very hard to compare so many numbers to discern these kinds of patterns. Converting the frequencies into proportions helps, but is not ideal either.

The best way to search for relationships or patterns in a cross table is by visualising it via a so-called mosaic plot. We can create a mosaic plot with the command mosaicplot() .

observed <- table(WD$bath, WD$wash)
mosaicplot(observed,xlab='bath', ylab='wash')

Figure 1 gives an example of a mosaic plot for the variables 'wash' and 'bath'.

Figure 1. Mosaic plot for the variables wash and bath in the WD dataframe.

The areas of the rectangles in a mosaic plot correspond to the number of occurrences for a specific combination of two categorical variables. In this specific example the relative height of the top left rectangle shows the share of the observations that have value 1 (strongly disagree) for the variable "wash", and its width shows the share of the observations that have value 1 (regularly) for the variable "bath".

The way to read a mosaic plot is by comparing columns (rows) among each other. If two categorical variables are related, it means that the distribution of cell sizes for one column (row) is quite different from the distribution of the other columns (row).