9. Simple Linear Regression: Practical 9
Regression Line
It is inconvenient to find the correct least squares line, i.e. the line that minimizes the sum of squared residuals, through trial and error. Fortunately, there is a function in R that can do this directly: lm()
. (note: lm stands for 'linear model')
lm()
function to find the line that minimizes the sum of squared residuals for the relation between wish_to_move and quality_of_life.lm(wish_to_move ~ quality_of_life, data = amsterdam)
Let's explain the input to the lm()
function
The first argument is a formula that takes the form y ~ x
. You should read this formula as 'the response variable y is a function of the predictor x'. So the R-command above states that you want to make a linear model of 'wish_to_move' as a function of 'quality_of_life'.
The second argument specifies that R should look in the 'amsterdam' dataframe for the two variables ('wish_to_move' and 'quality_of_life').
The output of this command is as follows.
Call:
lm(formula = wish_to_move ~ quality_of_life, data = amsterdam)
Coefficients:
(Intercept) quality_of_life
77.995 -7.083
As you see, the output gives the coefficients of the regression equation. Based on this information, we can write down the equation for the regression line in its common form:
#wish\_to\_move = 77.995 - 7.083 \cdot quality\_of\_life#
Inference about the regression line
We can get much more information from the lm()
command though. But for this we need to save its results as a separate object. Later on, we can extract all sorts of information from this object. Here, we store the results in an object that we call m1.
m1 <- lm(wish_to_move ~ quality_of_life, data = amsterdam)
We can subsequently get more elaborate information about the model through the summary function.
summary(m1)
Call:
lm(formula = wish_to_move ~ quality_of_life, data = amsterdam)
Residuals:
Min 1Q Median 3Q Max
-15.4151 -3.5820 -0.1444 4.4492 14.8346
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 77.995 7.802 9.997 < 2e-16 ***
quality_of_life -7.083 1.059 -6.685 8.07e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.385 on 118 degrees of freedom
Multiple R-squared: 0.2747, Adjusted R-squared: 0.2686
F-statistic: 44.69 on 1 and 118 DF, p-value: 8.073e-10
And that's quite a bit of information. Let's go over this output line-by-line.
Call:
lm(formula = wish_to_move ~ quality_of_life, data = amsterdam)
Similar to what you have seen by functions like t.test()
and prop.test()
, the output starts with something that is just a repetition of the function call. This is not very interesting when you just gave this command to R, but quite crucial if the result is saved in a report and read back later.
The second line gives a summary of the model residuals.
Residuals:
Min 1Q Median 3Q Max
-15.4151 -3.5820 -0.1444 4.4492 14.8346
This gives a five-number summary of the residuals. The average of the residuals is zero by definition, so the median should not be far from zero, and the minimum and maximum should be roughly equal in absolute value.
Next, a table with information on the regression coefficients is given:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 77.995 7.802 9.997 < 2e-16 ***
quality_of_life -7.083 1.059 -6.685 8.07e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The column at the left lists the components of the regression equation: (Intercept) and the name of the predictor variable. In the second column ('Estimate'), the values for the intercept and the predictor coefficient are given.
Columns three to five give information to conduct hypothesis tests on the coefficient values: standard errors for the estimated coefficients, t values, and p-values (= Pr(>|t|) ).
The standard error describes the variability of the estimated coefficient (i.e. how much variation you would expect in the coefficient value if you would fit the regression line again based on a new sample). The t value is the ratio Estimate/(standard error), and the p-value gives the probability that this t value would occur under a null-model with mean zero and standard deviation 1. The stars at the right provide a qualitative indication of the size of the p-values and add no further information. Each row in this 'Coefficient table' presents, in fact, a hypothesis test. The null-hypothesis in these tests is that the coefficient value equals zero and the alternative is that it differs from zero. For the intercept coefficient, the hypothesis test is rarely interesting, but the hypothesis test for the predictor coefficient (the slope of the regression line) is! In a situation where the null-hypothesis would be true, you would expect a regression line that is horizontal (a slope of zero - no relation between predictor and response) and if the null-hypothesis is rejected, you assume that there is a relation between predictor and response variable.
The last three lines of the output provide a number of statistics on overall model performance. We will go over these, one by one.
Residual standard error: 6.385 on 118 degrees of freedom
The residual standard error (RSE) describes the accuracy of the model. Roughly speaking it is the amount that the individual values of the response variable will deviate from the regression line. It relates to the residual sum of squares (RSS) that we calculated in the previous paragraph through the following equation:
#RSE = \sqrt{\frac{RSS}{n-2}}#.
Here #n# is the number of observations that were used for the regression and #n-2# are the degrees of freedom. The degrees of freedom are the number of independent observations that are available to estimate the regression line.
Multiple R-squared: 0.2747, Adjusted R-squared: 0.2686
The Multiple R-squared, or more simply #R^2#, represents the proportion of variability in the response variable that is explained by the predictor. For this model, #27.47\%# of the variability in 'wish to move' is explained by 'quality of life'. We will ignore the Adjusted R-squared here - it is a value that is slightly lower than the multiple R-squared.
Lastly,
F-statistic: 44.69 on 1 and 118 DF, p-value: 8.073e-10
This is an F test for the hypothesis that the regression equation as a whole is explaining any variance in Y (it tests the H0 of #R^2=0#). This last p-value is the same as the p-value for the variable quality_of_life. This is always the case for linear regression with only one predictor variable, and for that reason, the last line does not contain any extra information. Regression can, however, be extended with more predictor variables, and then this last hypothesis test becomes different from the tests on the coefficients for the predictor variables.