Residuals

9. Simple Linear Regression: Simple Linear Regression

Residuals

Simple linear regression analysis finds the best-fitting straight line through a set of data points. Just because a line is best-fitting, however, does not necessarily mean that the regression model is going to be useful for making predictions.
$\phantom{0}$

Predictive Power of the Model

If the outcome variable $Y$ and the prediction variable $X$ are only weakly related to one another, then the predictive power of the model will also be weak. Visually, this is represented by the data points lying quite far from the regression line.

$\phantom{0}$
As the relationship between $X$ and $Y$ becomes stronger, the data will be clustered much more around the regression line and the predictive power of the model increases.

$\phantom{0}$
If all data points fall exactly along the regression line, we speak of a perfect linear relationship and we are able to perfectly predict the values of $Y$ on the basis of the values of $X$ . This will never happen in practice, however.

$\phantom{0}$
A good way to assess the predictive power of a linear regression model is to investigate the residuals of the model.
$\phantom{0}$

Residuals

A residual is the difference between a predicted and observed value for $Y$ . Together, the residuals are a good indication of the level of the prediction error of the regression equation.

The residuals in the population are denoted by the Greek letter epsilon $\epsilon$ , sample residuals are denoted $e$ .

The residual of the $i^{th}$ observation is calculated as follows:

$e_i = Y_i - \hat{Y}_i$
Residuals are visually represented by the vertical distance between the regression line and a data point:

Data points that lie above the regression line produce positive residuals.
$\phantom{0}$
Data points that lie below the regression line produce negative residuals.

Calculation of Residuals

Consider the regression equation $\hat{Y}=1+2X$ and the data points $\blue{(1,4)}$ , $\blue{(2,8)}$ , and $\blue{(4,5)}$ . The residuals of these three data points are calculated as follows:

For the first point :
- $\purple{\hat{Y}_1}=1 +2\cdot 1=3$
- $\blue{Y_1} = 4$
- $\orange{e_1}= Y_1-\hat{Y}_1=4-3=1$
For the second point :
- $\purple{\hat{Y}_2}=1+2\cdot 2=5$
- $\blue{Y_2}=8$
- $\orange{e_2}= Y_2-\hat{Y}_2 = 8-5 =3$
For the last point :
- $\purple{\hat{Y}_3} =1 + 2\cdot 4=9$
- $\blue{Y_3}=5$
- $\orange{e_3}= Y_3-\hat{Y}_3 = 5-9=-4$

$\phantom{0}$
Now that we have introduced the concept of residuals, we can finally introduce the full form of the simple linear regression model.
$\phantom{0}$

Simple Linear Regression Model

A simple linear regression model is described by the following equation:
$\begin{array}{rcccl} Y_i &=& \hat{Y}_i + e_i &=& b_0 + b_1X_i + e_i \end{array}$

If the fit of the regression model is perfect, then all residuals will be zero. If the fit of the model is poor, then the observed and predicted values of $Y$ will lie far apart and the residuals will be large.