Assessing the Quality of a Regression Model

9. Simple Linear Regression: Simple Linear Regression

Assessing the Quality of a Regression Model

The quality of a regression model is dependent on the accuracy of the predictions that the model makes. A good-fitting model will result in accurate predictions of the outcome variable, whereas a poor-fitting model will result in bad predictions.

In order to assess the predictive power of a regression model, we are going to divide the total variation in the outcome variable #Y# into two parts:

The amount of variance in #Y# that can be explained by the regression model
#\phantom{0}#
The amount of variance in #Y# that cannot be explained by the regression model

The greater the amount of the variance in the outcome variable #Y# we are able to explain with the help of the regression model, the greater the predictive power of the model will be.
#\phantom{0}#

Division of Variation

A key property of linear regression models is that they divide the total amount of variation in the outcome variable #Y# into two parts: the variation that can be explained by the regression model and that which cannot.

Three Sum of Squares

To get a measure of the total variation in the outcome variable #Y#, we calculate the total sum of squares, denoted #SS_{total}#. This measure represents all the variation in #Y# that could possibly be explained by our regression model and is calculated with the following formula:

\[SS_{total}=\sum_{i=1}^{n} (Y_i - \bar{Y})^2\]

To get a measure of the amount of variation in the outcome variable #Y# that can be explained by the regression model, we calculate the model sum of squares, denoted #SS_{model}#. To calculate this measure, use the following formula:

\[SS_{model}=\sum_{i=1}^{n} (\hat{Y}_i - \bar{Y})^2\]

To get a measure of the amount of variation in the outcome variable #Y# that cannot be explained by the regression model, we calculate the residual sum of squares, denoted #SS_{residual}#. To calculate this measure, use the following formula:

\[SS_{residual}=\sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2=\sum_{i=1}^{n} (e_i)^2\]

#\phantom{0}#
Once we have divided the total variation in the outcome variable #Y# into the part that can be explained by the regression model and that which cannot, we can calculate the coefficient of determination to get a single measure of the predictive power of the regression model.
#\phantom{0}#

Coefficient of Determination

The coefficient of determination, denoted #R^2#, is the proportion of the total variation in the outcome variable #Y# that can be explained by the regression model.

There are two ways to calculate the coefficient of determination:

\[R^2=\cfrac{SS_{model}}{SS_{total}}\,\,\,\,\,\,\,\,\,\, \text{or} \,\,\,\,\,\,\,\,\,\, R^2=1-\cfrac{SS_{residual}}{SS_{total}} \]

Interpreting the Coefficient of Determination

The coefficient of determination #R^2# always takes on a value between #0# and #1#:

An #R^2# of #0# indicates that the variation in the outcome variable #Y# cannot be explained whatsoever by the regression model.
An #R^2# of #1# indicates that the variation in the outcome variable #Y# can be perfectly explained by the regression model.
An #R^2# between #0# and #1# indicates that some of the variation in the outcome variable #Y# can be explained by the regression model.

If we, for example, find a coefficient of #R^2=0.72#, this means that #72\%# of the total variation in the outcome variable #Y# can be explained by the regression model.