3. Probability: Practical 3
Contingency Tables
To apply the probability formula's on real data we have to count quite a bit. And for that purpose, the command table()
is quite handy. Until now we have only applied it to make a frequency distribution for a single variable. But it can also make counts over two variables and show the result in a contingency table (i.e. frequency table with 2 dimensions).
For example, to look at the distribution of the passenger class for the different sexes, you can create a contingency table for the two variables sex and pclass with the command:
table(titanic$sex,titanic$pclass)
Notice that the categories from the first argument in the table()
command (titanic$sex) become the rows in the contingency table, and the categories from the second argument (titanic$pclass) become the columns.
If you would like the rows and columns switched, you can simply change the order of the input-arguments:
table(titanic$pclass,titanic$sex)
You can also add the variable names to the table to remember which is which and make the interpretation easier:
table(pclass = titanic$pclass, sex = titanic$sex)
To find the answer you use the contigency table of the two variables (pclass and survived):
table(titanic$ pclass, titanic$ survived)This command results in the following table:
0 | 1 | ||
1 | 123 | 200 | |
2 | 158 | 119 | |
3 | 528 | 181 |
which shows that #528# of the passengers that traveled in the third class died.
To turn a table with frequencies into a table with probabilities, we'd have to divide each cell in this table by the total number of passengers (i.e. the total number of observations: #1309#). This can be achieved via the following commands:
gender_pclass <- table(titanic$sex,titanic$pclass)
gender_pclass/sum(gender_pclass)
You can also use the command prop.table()
to do the same in a slightly easier way.
prop.table( table(titanic$sex,titanic$pclass) )
To find the answer you use the proportion table of the two variables (sex and survived). As you need to calculate the joint probability, you don't have to specify the margin argument:
prop.table(table(titanic$ sex, titanic$ survived))This command results in the following table:
0 | 1 | ||
female | 0.097021 | 0.258976 | |
male | 0.521008 | 0.122995 | |
which shows that the probability is #0.259#.
prop.table()
has an additional useful option: it can calculate proportions per row or per column. For this task a second input argument is used. For example, to calculate the distribution over the three classes within each gender, you should use:
prop.table( table(titanic$sex,titanic$pclass), margin=1 )
And to calculate the distribution over gender within each passenger class, the following command should be used:
prop.table( table(titanic$sex,titanic$pclass), margin=2 )
For the last two examples, you can see how for margin=1
the proportions per row sum to 1, and for margin=2
the proportions per column.
Note that if you would change the order of the input arguments (titanic$sex and titanic$pclass in the example above), then the numbers to be used for margin should also be changed to get the same result. In other words: the proportions in the following tables are the same (they only have rows and columns interchanged).
prop.table( table(titanic$pclass,titanic$sex), margin=1 )
prop.table( table(titanic$sex,titanic$pclass), margin=2 )
To find the answer you use the proportion table of the two variables (sex and pclass). In this exercise, you need to calculate the probability that a passenger traveled in the first class, given that the passenger is a female. This is a conditional probability, which means you need to specify the margin argument: you need to make sure that sex is summing to #1#. If you put sex in the rows (first listed), you should specifiy
margin = 1
: prop.table(table(titanic$ sex, titanic$ pclass), margin = 1)This command results in the following table:
1 | 2 | 3 | |
female | 0.309013 | 0.227468 | 0.463519 |
male | 0.212337 | 0.202847 | 0.584816 |
which shows that the probability is #0.309#.