Prediction and Prediction Errors

9. Simple Linear Regression: Practical 9

Prediction and Prediction Errors

Let’s return to the relation between quality_of_life and wish_to_move. The following commands create a scatterplot of the relation and then draw the least squares line on top.

plot(amsterdam$quality_of_life, amsterdam$wish_to_move)
abline(m1, col = 'red')

The function abline() plots a line based on the slope and intercept from the linear regression model information stored in m1 (obviously, it is essential that m1 still contains the output from lm(wish_to_move~quality_of_life, data = amsterdam) and is not overwritten by another model). In stead of using m1, the function abline() can also take the intercept and slope as input arguments:

abline(a=78 ,b=-7.08, col = 'green')

Once the regression model has been fitted, it can be used to predict $y$ at any value of $x$ . When predictions are made for values of $x$ that are beyond the range of the observed $x$ -data, it is referred to as extrapolation. We should be careful with such predictions because the statistical model may simply not be valid beyond the range of available $x$ -data. However, predictions made within the range of the data are more reliable.

What percentage for wish to move would you predict if you knew the inhabitants gave the quality of life in their neighbourhood a $7.5$ ?

w i s h_t o_m o v e = 77.995 - 7.083 \cdot q u a l i t y_o f_l i f e

$24.9\%$

$wish\_to\_move = 77.995 -7.083 \cdot 7.5 = 24.9$

New example

Of course, if you want to predict the wish to move percentage for many neighbourhoods it would much handier if there is a function in R that can do this in one step. Fortunately, there is! We can use the predict() function for this task. In addition to the prediction of the response variable, this function can also calculate the uncertainty of this prediction. This uncertainty is expressed through a confidence interval and is interpreted in exactly the same way as the confidence interval for a proportion or a mean. For example for a $95\%$ confidence interval, you expect the regression line to be within the specified range in $95$ out of $100$ cases if you would sample from the population and fit the linear model repeatedly.

Let's see how this works.

Predict the wish_to_move percentage for a quality_of_life of $5.5$ , $6.0$ , and $7.0$ . Use the predict() function.

w i s h_t o_m o v e = 77.995 - 7.083 \cdot q u a l i t y_o f_l i f e

The wish to move is respectively $39.0\%$ , $35.5\%$ and $28.4\%$ .

1) Save all the values you want to predict in a dataframe

data <- data.frame(quality_of_life=c(5.5, 6.0, 7.0))
data

2) Predict the wish to move percentage

predict(m1, newdata = data, interval ="confidence")

The first line creates a dataframe with three new values for 'quality_of_life' (our predictor); $5.5$ , $6.0$ and $7.0$ . We need this information in the predict() function: via the input argument newdata you specify the values of the predictor variable for which a prediction of the response variable is desired.

New example

If we study the output from the predict() function, it appears that the confidence interval is not equally wide for each value of quality_of_life. The confidence interval for $10$ is f.i. much wider than those for $6.5$ and $7.5$ . This is a general property for confidence intervals around a regression line: the further away from the mean of the predictor variable, the wider the interval.

Let's now try to predict the wish to move percentage for $33$ new neighbourhoods. This data is stored in the data frame amsterdam_test.

Inspect the new data and check specifically if the data contains a column "quality_of_life". This column is necessary to predict the wish to move percentage with the linear regression model you made in the previous steps.

str(amsterdam_test)
summary(amsterdam_test)

Now you are ready to predict the wish_to_move percentage for these neighbourhoods. You can simply use the predict() function as you did before.

Predict the wish_to_move percentage for the $33$ neighbourhoods in the dataframe amsterdam_test.

Which neighbourhood (number) has the lowest wish_to_move percentage?

Neighbourhood number $3969$ has with $20.6\%$ the lowest wish to move percentage.

You can use the predict() function to predict the percentage for every neighbourhood. Simply use the complete amsterdam_test dataframe as input for the newdata argument.

wish_to_move_m1 <- predict(m1, newdata = amsterdam_test) 
wish_to_move_m1

Sort the percentages to find the neighbourhood with the lowest wish to move percentage

WtM_m1 <- sort(wish_to_move_m1, decreasing = FALSE)
WtM_m1[1]

We can plot the data to get some more insight in the prediction:

plot(amsterdam$quality_of_life, amsterdam$wish_to_move)
abline(m1, col = 'red')
points(amsterdam_test$quality_of_life, wish_to_move_m1, col='blue', pch=16)

The plot shows the fitted line from the first dataset and linear model and adds the predicted points from the above code in blue. As you see, the predictions of the wish to move percentage of the new neighbourhoods are exactly on the regression line.

New example

And a final note about the predict() function: if you do not specify newdata, predictions are given for the data values that were used to fit the model.

predict(m1, interval ="confidence")