Good evening everyone,

Following my last post on multiple regression, I would like to talk about ANOVA tables as they are a very important part of the Level II curriculum on quantitative methods. ANOVA stands for ANAlysis Of VAriance; it helps to understands how well a model does at explaining the dependent variable.

First of all, recall that $Y=\{Y_i\} ~ i=1,…,n$ denote the real values of the dependent variables and recall that $\hat{Y}=\{\hat{Y}_i\} ~ i=1,…,n$ are the values estimated by the model. We define the following values:

**Total Sum of Squares** (SST) :

$$\text{SST}=\sum_{i=1}^n (Y_i – \bar{Y})^2$$

This is the total variation of the process $Y$, i.e. the squared deviations from $Y_i$ from the mean of the process denoted $\bar{Y}$. With the regression, this total variation is what we are trying to reproduce.

**Regression Sum of Squares** (RSS):

$$\text{RSS}=\sum_{i=1}^n (\hat{Y}_i – \bar{Y})^2$$

This is the variation *explained* by the regression model. If the model fitted perfectly the dependent variable, we would have $\text{RSS}=\text{SST}$.

**Sum of Squared Errors** (SSE):

$$\text{SSE}=\sum_{i=1}^n (Y_i – \hat{Y}_i )^2=\sum_{i=1}^n \epsilon_i ^2$$

Finally this is the *unexplained* variation; the sum of the differences between the real values of the process $Y_i$ and the values estimated by the model $\hat{Y}_i$.

As expected, the total variation is equal to the sum of the explained variation and the unexplained variation:

$$\text{SST}=\text{RSS} + \text{SSE}$$

Note that the CFA does not require candidates to be able to compute this values (it would take too long) but I thought that having the definitions helps understanding the concepts.

From these values, we can get the first important statistic we want to look at when discussing the quality of a regression model:

$$\text{R}^2=\frac{\text{RSS}}{\text{SST}}$$

The $\text{R}^2$ measures the part of the total variation that is being explained by the regression model. Its value is bounded from 0 to 1, and the closer it gets to 1 the better the model fits the real data.

Now we also want to compute the average of $\text{RSS}$ and $\text{SSE}$ (the mean sum of squares):

$$\text{MSR} = \frac{\text{RSS}}{k}$$

$$\text{MSE} = \frac{\text{SSE}}{n-k-1}$$

where $n$ is the size of the sample and $k$ is the number of dependent variables used in the model. These values are “intermediary” computations and are use for different statistics computations. First we can compute the standard error of the error terms $\epsilon$ (SEE):

$$ \text{SEE}=\sqrt{\text{MSE}} $$

Note that this is just the application of the classic computation of the standard deviation with $k+1$ degrees of freedom. If the model fits well the data, then $\text{SEE}$ will be close to 0 (its lower bound).

Now, there is an important test in regression analysis which is called the F-statistic. Basically, this test has the null hypothesis that all the coefficients of the regression are statistically insignificant: $H_0 : b_i=0 ~ \forall i$. It is computed as follows:

$$\text{F}=\frac{\text{MSR}}{\text{MSE}}$$

$\text{F}$ is a random variable distributed under an F-Statistic with $k$ degrees of freedom in the numerator and $n-k-1$ degrees of freedom in the denominator. The critical value of the variable can be found in the F distribution form attached to the CFA exam. It is very important to understand that if you reject $H_0$, you say that **at least one of the coefficients is statistically significant**. This, by no mean, implies that all of them are!

To sum up, you can look at the following table known as the ANOVA table:

Source of variation | Degrees of Freedom | Sum of Squares | Mean Sum of Squares |
---|---|---|---|

Regression (explained) | k | RSS | MSR |

Error (unexplained) | n-k-1 | SSE | MSE |

Total | n-1 | SST |

That’s it for tonight, I hoped you enjoyed the ride!

Jérémie, two copy-paste errors: the second SST (2nd formula) is rather “RSS” and the third SST (3rd formula) should be “SSE” , I guess.

Oh yeah right. Thanks for pointing that out!