4. Probability Distributions: Practical 4
Random variables
Objectives
Learn how to do the following in R
- generate random data that follows a uniform, normal or binomial distribution with specified parameters using the commands
runif()
,rnorm()
andrbinom()
. - calculate the mean and variance of individual random variables or combinations of two random variables
- describe and interpret the probability distribution of a random variable on the basis of a frequency histogram and an (empirical) cumulative probability plot.
Random variables
R has commands that behave like a random variable: they generate values that are different each time but on average the generated data follow a prescribed, fixed, distribution.
Three commands of this type are runif()
, rnorm()
and rbinom()
. Let's take a look at the help pages for these functions to examine the difference between them.
?runif
?rnorm
?rbinom
As you see in the help tab of the bottom-right window, all three commands generate random variables, however, the underlying distributions are different. The runif()
command generates random variables from a uniform distribution, the rnorm()
command generates random variables from a normal distribution and the rbinom()
command generates random variables from a binomial distribution.
Let #x# denote a number randomly generated from the interval #[0 - 10]# with equal probabilities. In R you can simulate #x# with the following command
x <- runif(1, 0, 10)
Using the same command you can easily simulate #n# random numbers to create a variable #X# as a vector of #n# random numbers.
runif()
command to generate a variable #X# with a vector of #n = 10000# random numbers in the interval #[10 - 15]# and calculate the mean and variance.X <- runif(10000, 10, 15)You notice that the mean of the #10000# randomly generated numbers is close to #12.5# and thus close to the expected value of #X#.
mean(X)
var(X)
First think how you expect the histogram of #X# to look like and subsequently create a histogram of #X#.
hist(X)
This command shows a uniform frequency distribution. Its shape is a rectangle: equal frequencies across the range.
Yet another way to explore the generated data is by creating an empirical cumulative probability plot. The empirical cumulative probability is the empirical equivalent of the cumulative distribution function. This implies that it uses the sampled points rather than the underlying distribution to calculate the probability that a new sample is smaller or equal to #X#. It is created by the command
ecdf()
. The following command creates such a plot (run it yourself).
plot(ecdf(runif(10,0,1)))
The command first generates #10# uniformly distributed random numbers in the interval #[0 - 1]#. Next the ecdf()
function assigns a probability to each point and then adds up the probabilities, point by point, from the minimum value to the maximum value. This produces the cumulative probability for each point. In the last step the plot()
function plots the cumulative probability on the y-axis against the generated numbers on the x-axis. You can see that every point in the data set is plotted and that a horizontal line extends outward from these points. Think of this line as a “step” and then the next point is a step higher than the previous one. How much higher? That would be #1/N#, where #N# is the number of points in the sample. In this case every point is thus #1/10 = 0.1# higher.
Now create this figure in R. Is the shape of the empirical cumulative probability function as you expected, and are e.g. the values at the y-axis for #x=7.5# and #x=12.5# in agreement with what you expected?
X <- runif(10000, 5, 20)
plot(ecdf(X))
The empirical cumulative probability at #x=12.5# gives a probability at the y-axis of #0.5# (= the median). x=.
So far, you only generated random numbers from a uniform distribution, let's now try to draw samples from a normal and binominal distribution. At the start of this practicum, you saw in the help that the commands to do this are rnorm(n, mean, sd)
and rbinom(n, size, prob)
. If you did not take a close look, please return to the respective help pages to read about the arguments.
Then create a few empirical cumulative probability plots with a variable containing only #10# random numbers sampled from the same distribution. How does this compare with the plot of the variable containing #10000# variables?
X <- rnorm(10000, 0, 3)
mean(X)
var(X)
hist(X)
plot(ecdf(X))
You can see in the histogram that the probability that #x# is close to the mean is now much larger than the probability that #x# is far from the mean. As expected, the empirical cumulative probability plot of #10000# random numbers sampled from #N(0,3)# follows the cumulative density plot of the same normal distribution very closely.
Now we run the following lines a few times to generate empirical cumulative probability plots with variables of only #10# numbers.
X <- rnorm(10, 0, 3)
plot(ecdf(X))
The empirical cumulative distribution of only #10# numbers is much more variable and follows the cumulative density plot less closely. Sampling only #10# numbers is not enough to capture the underlying distribution.
x=.
Generate #10000# random numbers from a binomial distribution with the parameter value #size = 150# and #p = 0.4#. Calculate the mean and variance and plot the histogram and empirical cumulative probability plot.
X <- rbinom(10000, 150, 0.4)You see by calculating the results for the binomial distribution that the mean is around #60# (this is no surprise because theoretically, the mean should be #size \cdot p = 150 \cdot 0.4#).
mean(X)
var(X)
hist(X)
plot(ecdf(X))
You also see from the frequency histogram that the binomial distribution does not have values lower than #0#. Furthermore it is right-skewed (the tail extends further to the right than the left). Also the empirical cumulative probability plot for this distribution is different than that for the uniform and normal distribution you have seen before: it makes discrete jumps with gaps at every integer on the x-axis. The reason for this is that the binomial distribution is discrete; it can only have positive integers as a result (#0#, #1#, #2#, etc.). So for every integer value for #x#, the probability #P(X \leq value)# is increasing.