9. Simple Linear Regression: Practical 9
Sum of Squared Residuals
Imagine that you want to check if the wish to move could be predicted by the quality of life. Wish to move is then the response variable and quality of life the predictor. Let's first visually inspect the relationship with a scatterplot.
plot(amsterdam$quality_of_life, amsterdam$wish_to_move)
Just as you can use the mean and standard deviation to summarize a single variable, you can summarize the relationship between these two variables by finding the line that best follows their association. Use the following interactive function to select the line that you think does the best job of going through the cloud of points.
plot_ss(amsterdam$quality_of_life, amsterdam$wish_to_move)
After running this command, you’ll be prompted to click two points on the plot to define a line. Once you’ve done that, the line you specified will be shown in black and the difference between each point and the line in blue. The line specifies the predicted percentage of people that wish to move given the quality of life. The difference between each observed point and the predicted value is the residual (or error). The residual represents variation in your data that remains unexplained by the regression model.
The most common way to perform linear regression is to select the line that minimizes the sum of squared residuals. The distance from each point to the line should be as small as possible for the best fit. To accomplish this, the distance from each point to the line is squared and then these values are summed.
Note that the function plot_ss
does not only produce a graph, but also provides you with the coefficients of your line as well as the sum of squares in the R console.
Mathematically, a line is given by the equation
#y = \beta_0 + \beta_1 \cdot x#
#\beta_0# is called the intercept. This is the value of #y# when #x = 0# (or the point where the regression line crosses the Y-axis). In our case it indicates the percentage of people that wish to move if the quality of life is #0#.
#\beta_1# is called the slope. The slope indicates the change in #y# with one unit change in #x#. Notice that the slope in our example is negative, this indicates that the relation between quality of life and the wish to move is negative: the higher the quality of life the lower the wish to move becomes. This is of course what you would expect.
At this point, it is important to notice that linear regression is a statistical method to build a model where one or more predictor variables explain a response variable in a statistical sense (in this course we only study the situation with one predictor variable). Even though a causal relation between predictor and response may be plausible, the statistical method itself will only give evidence for correlation, not a causal relationship.