Formulas regression

Formulas, Statistical Tables and R Commands: Formulas

Formulas regression

1. Simple linear regression

Regression equation simple linear regression

\begin{equation*} \widehat{y}_i=a+bx_i,
\end{equation*}where the regression coefficient is estimated by
\begin{equation*} b=r_{xy} \left(\frac{s_y}{s_x}\right)
\end{equation*} and the intercept is estimated by
\begin{equation*} a=\overline{y}-b\overline{x}.
\end{equation*}

Residual

The residual (or prediction error) is\begin{equation*} (y_i -\hat{y_i}),
\end{equation*}where #y_i# is the observed value and #\hat{y_i}# the predicted value for person #i#.

Sums of squares for y

\begin{eqnarray*} \sum_{i=1}^n\limits (y_i-\overline{y})^2
= & \sum_{i=1}^n\limits (\widehat{y}_i-\overline{y})^2 & +
\sum_{i=1}^n\limits (y_i-\widehat{y}_i)^2, \\ \text{also referred to as:} \\
SS_y = & SS_{\widehat{y}-\overline{y}} & + \text{ } SS_{y-\widehat{y}},\\ \text{or as:} \\
SS_{tot} = & SS_{reg} & + \text{ } SS_{res}
\end{eqnarray*}where #SS_{tot}# is the total sum of squares of #y#, #SS_{reg}# is the regression sum of squares 'explained' by the model and #SS_{res}# is the residual sum of squares.

Proportion explained variation

The proportion explained variation (also called the proportional reduction in prediction error) is
\begin{eqnarray*}
r^2_{xy} &=& \frac{\sum_{i=1}^n\limits (y_i-\overline{y})^2 - \sum_{i=1}^n\limits (y_i-\widehat{y}_i)^2}{\sum_{i=1}^n\limits (y_i-\overline{y})^2},\\
&=&\frac{SS_{tot}-SS_{res}}{SS_{tot}},\\
&=& \frac{SS_{reg}}{SS_{tot}}
\end{eqnarray*}

#t#-test for regressin

The test statistic for regression coefficient #b# assuming #H_0#: #\beta=\beta_0=0# is
\begin{equation*}
t=\frac{(b-\beta_0)}{se_b}.
\end{equation*}where #se_b# is calculated by software. The statistic follows a #t# distribution with #n-2# degrees of freedom (#df=n-2#), when the assumptions hold.

Standardized residual

The standardized residual equals
\begin{equation*}
\frac{y_i-\widehat{y_i}}{se_{y_i-\widehat{y_i}}},
\end{equation*}where #se_{y_i-\widehat{y_i}}#, the standard error for the residual (also referred to as #se_{res}#) is calculated by software.

Residual standard deviation

The residual standard deviation based on #n# observations equals
\begin{equation*}
s_{res} = \sqrt\frac{\sum_{i=1}^n\limits{(y_i-\widehat{y_i})^2}}{n-k},
\end{equation*}in other words: #s_{res}=\sqrt{\frac{SS_{res}}{n-k}}#, where #k# equals the number of parameters in the regression equation (#k=2# for simple regression).

#95\%# - prediction interval for # \mu_y#

\begin{equation*}
\widehat{y}_i-2s\le y_i \le\widehat{y}_i+2s,
\end{equation*} where #s# is the residual standard deviation and 2 is an approximation of #t_{\alpha/2}#.

#95\%# - confidence interval for #\mu_y#

\begin{equation*}
\widehat{y}-2(s/\sqrt{n})\le \mu_y \le\widehat{y}+2(s/\sqrt{n}),
\end{equation*} where #s# is the residual standard deviation, 2 is an approximation of #t_{\alpha/2}# and #n# is the number of observations.

2. Multiple linear regression

Regression equation simple multiple regression

For the independent variables (predictors) #x_{1},x_{2},x_{3},\ldots,x_{j},\ldots #
\begin{equation*} \widehat{y}_i=a+b_1x_{i1}+b_2x_{i2}+b_3x_{i3}+\ldots+b_jx_{ij}+\ldots
\end{equation*}

Proportion explained variation

The proportion explained variation (proportional reduction in prediction error), or squared multiple correlation coefficient is
\begin{eqnarray*}
R^2&=& \frac{\sum_{i=1}^n\limits (y_i-\overline{y})^2 - \sum_{i=1}^n\limits (y_i-\widehat{y}_i)^2}
{\sum_{i=1}^n\limits (y_i-\overline{y})^2},
\end{eqnarray*}
In other words: #R^2 = \frac{SS_{tot}-SS_{res}}{SS_{tot}}=\frac{SS_{reg}}{SS_{tot}}#.

Multiple correlation coefficient

\begin{equation*}
R= \sqrt{R^2}
\end{equation*}

#F#-test statistic regression analysis

The null hypothesis that all regression coefficients equal zero is tested using
\begin{equation*}
F = \frac{\displaystyle\frac{\sum_{i=1}^n\limits (\widehat{y}_i- \overline{y})^2}{k-1}}{\displaystyle\frac{\sum_{i=1}^n\limits (y_i-\widehat{y}_i)^2}{n-k}} = \frac{\displaystyle\frac{SS_{reg}}{df_{reg}}}{\displaystyle\frac{SS_{res}}{df_{res}}} = \frac{MS_{reg}}{MS_{res}},
\end{equation*}where #k# equals the number of parameters in the regression equation and #n# the number of observations. The degrees of freedom are #df_{reg}=k-1# en #df_{res}=n-k#. #df_{reg}# and #df_{res}# are often referred to as #df_1# and #df_2#.

#MS# denotes mean squares. #MS_{res}# denotes the residual variance.

Test statistic for #b_j#

The test statistic for regression coefficient #b# assuming #H_0#: #\beta=\beta_0=0# is
\begin{equation*}
t_{b_j}=\frac{(b_j-\beta_0)}{se_{b_j}},
\end{equation*} where #se_{b_j}# is calculated by software. The statistic follows a #t# distribution with #n-k# degrees of freedom, where #k# equals the number of parameters in the regression equation (#k=2# for simple regression), when the assumptions hold.

#100(1-\alpha)\%# - confidence interval for #\beta_j#

\begin{equation*}
b_j-t_{\alpha/2}\cdot se_{b_j}\le \beta_j
\le b_j+t_{\alpha/2}\cdot se_{b_j}.
\end{equation*}

3. Exponential regression

Regression equation simple exponential regression

\begin{equation*} \mu_y = \alpha\beta^x,
\end{equation*}where #\beta >0# must hold. This populatie level equation provides the predicted value for the population mean of #y# for a given value of #x#.

4. Logistic regression

Log-odds (logit)

If #y# is a dichotomous (binary) variable taking on values 0 or 1 and we denote #p(y=1)# as #p#, then:
\begin{equation*}
\begin{split}
\mbox{odds}&=\frac{p}{1-p} \\
\mbox{log-odds} =\mbox{logit}(p)&=\ln\Bigg(\frac{p}{1-p}\Bigg)\\
p&=\frac{\mbox{odds}}{1+\mbox{odds}}
\end{split}
\end{equation*}To calculate the probability #p# from a log-odds value requires the following rule: #e^{ln(x)}=x#.

Regression equation simple logistic regression

\begin{equation*}
p(Y=1)=\frac{e^{\alpha+\beta x}}{1+e^{\alpha+\beta x}},
\end{equation*}for which the following holds:
\begin{equation*}
\mbox{logit}\big(P(Y=1)\big) = \alpha+\beta x.
\end{equation*}