9. Simple Linear Regression: Simple Linear Regression
Assessing the Quality of a Regression Model
The quality of a regression model is dependent on the accuracy of the predictions that the model makes. A good-fitting model will result in accurate predictions of the outcome variable, whereas a poor-fitting model will result in bad predictions.
In order to assess the predictive power of a regression model, we are going to divide the total variation in the outcome variable #Y# into two parts:
- The amount of variance in #Y# that can be explained by the regression model
#\phantom{0}# - The amount of variance in #Y# that cannot be explained by the regression model
The greater the amount of the variance in the outcome variable #Y# we are able to explain with the help of the regression model, the greater the predictive power of the model will be.
#\phantom{0}#
Division of Variation
A key property of linear regression models is that they divide the total amount of variation in the outcome variable #Y# into two parts: the variation that can be explained by the regression model and that which cannot.
Three Sum of Squares
To get a measure of the total variation in the outcome variable #Y#, we calculate the total sum of squares, denoted #SS_{total}#. This measure represents all the variation in #Y# that could possibly be explained by our regression model and is calculated with the following formula:
\[SS_{total}=\sum_{i=1}^{n} (Y_i - \bar{Y})^2\]
To get a measure of the amount of variation in the outcome variable #Y# that can be explained by the regression model, we calculate the model sum of squares, denoted #SS_{model}#. To calculate this measure, use the following formula:
\[SS_{model}=\sum_{i=1}^{n} (\hat{Y}_i - \bar{Y})^2\]
To get a measure of the amount of variation in the outcome variable #Y# that cannot be explained by the regression model, we calculate the residual sum of squares, denoted #SS_{residual}#. To calculate this measure, use the following formula:
\[SS_{residual}=\sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2=\sum_{i=1}^{n} (e_i)^2\]
#\phantom{0}#
Once we have divided the total variation in the outcome variable #Y# into the part that can be explained by the regression model and that which cannot, we can calculate the coefficient of determination to get a single measure of the predictive power of the regression model.
#\phantom{0}#
Coefficient of Determination
The coefficient of determination, denoted #R^2#, is the proportion of the total variation in the outcome variable #Y# that can be explained by the regression model.
There are two ways to calculate the coefficient of determination:
\[R^2=\cfrac{SS_{model}}{SS_{total}}\,\,\,\,\,\,\,\,\,\, \text{or} \,\,\,\,\,\,\,\,\,\, R^2=1-\cfrac{SS_{residual}}{SS_{total}} \]
Interpreting the Coefficient of Determination
The coefficient of determination #R^2# always takes on a value between #0# and #1#:
- An #R^2# of #0# indicates that the variation in the outcome variable #Y# cannot be explained whatsoever by the regression model.
- An #R^2# of #1# indicates that the variation in the outcome variable #Y# can be perfectly explained by the regression model.
- An #R^2# between #0# and #1# indicates that some of the variation in the outcome variable #Y# can be explained by the regression model.
If we, for example, find a coefficient of #R^2=0.72#, this means that #72\%# of the total variation in the outcome variable #Y# can be explained by the regression model.