9. Simple Linear Regression: Practical 9
Prediction and Prediction Errors
Let’s return to the relation between quality_of_life and wish_to_move. The following commands create a scatterplot of the relation and then draw the least squares line on top.
plot(amsterdam$quality_of_life, amsterdam$wish_to_move)
abline(m1, col = 'red')
The function abline()
plots a line based on the slope and intercept from the linear regression model information stored in m1 (obviously, it is essential that m1 still contains the output from lm(wish_to_move~quality_of_life, data = amsterdam) and is not overwritten by another model). In stead of using m1, the function abline()
can also take the intercept and slope as input arguments:
abline(a=78 ,b=-7.08, col = 'green')
Once the regression model has been fitted, it can be used to predict at any value of . When predictions are made for values of that are beyond the range of the observed -data, it is referred to as extrapolation. We should be careful with such predictions because the statistical model may simply not be valid beyond the range of available -data. However, predictions made within the range of the data are more reliable.
Of course, if you want to predict the wish to move percentage for many neighbourhoods it would much handier if there is a function in R that can do this in one step. Fortunately, there is! We can use the predict()
function for this task. In addition to the prediction of the response variable, this function can also calculate the uncertainty of this prediction. This uncertainty is expressed through a confidence interval and is interpreted in exactly the same way as the confidence interval for a proportion or a mean. For example for a confidence interval, you expect the regression line to be within the specified range in out of cases if you would sample from the population and fit the linear model repeatedly.
Let's see how this works.
predict()
function.wish_to_move=77.995−7.083⋅quality_of_life1) Save all the values you want to predict in a dataframe
data <- data.frame(quality_of_life=c(5.5, 6.0, 7.0))
data
2) Predict the wish to move percentage
predict(m1, newdata = data, interval ="confidence")
The first line creates a dataframe with three new values for 'quality_of_life' (our predictor); , and . We need this information in the
predict()
function: via the input argument newdata
you specify the values of the predictor variable for which a prediction of the response variable is desired.If we study the output from the predict()
function, it appears that the confidence interval is not equally wide for each value of quality_of_life. The confidence interval for is f.i. much wider than those for and . This is a general property for confidence intervals around a regression line: the further away from the mean of the predictor variable, the wider the interval.
Let's now try to predict the wish to move percentage for new neighbourhoods. This data is stored in the data frame amsterdam_test
.
Inspect the new data and check specifically if the data contains a column "quality_of_life". This column is necessary to predict the wish to move percentage with the linear regression model you made in the previous steps.
str(amsterdam_test)
summary(amsterdam_test)
Now you are ready to predict the wish_to_move percentage for these neighbourhoods. You can simply use the predict() function as you did before.
Which neighbourhood (number) has the lowest wish_to_move percentage?
You can use the
predict()
function to predict the percentage for every neighbourhood. Simply use the complete amsterdam_test dataframe as input for the newdata
argument. wish_to_move_m1 <- predict(m1, newdata = amsterdam_test)Sort the percentages to find the neighbourhood with the lowest wish to move percentage
wish_to_move_m1
WtM_m1 <- sort(wish_to_move_m1, decreasing = FALSE)We can plot the data to get some more insight in the prediction:
WtM_m1[1]
plot(amsterdam$quality_of_life, amsterdam$wish_to_move)
abline(m1, col = 'red')
points(amsterdam_test$quality_of_life, wish_to_move_m1, col='blue', pch=16)
The plot shows the fitted line from the first dataset and linear model and adds the predicted points from the above code in blue. As you see, the predictions of the wish to move percentage of the new neighbourhoods are exactly on the regression line.
And a final note about the predict()
function: if you do not specify newdata
, predictions are given for the data values that were used to fit the model.
predict(m1, interval ="confidence")