Following my first post about the Level II and specifically correlation, I am now moving on to the main topic if this year’s curriculum: multiple regression. Before I get started, I want to mentioned that the program talks about regression in 2 steps: it starts by discussing the method with 1 independent variable and then with multiple variables. I will only talk about the multiple variable version, because it is generalized. If you understand the general framework, you can answer any question about the specific 1 variable case.

Multiple Regression is all about describing a *dependent* variable $Y$ with a linear combination of $k$ variables $X_k, k \in 1..K$. This is expressed mathematically as follows:

$$ Y_i = b_0 + b_1 X_{1,i} + …. + b_K X_{K,i} + \epsilon_i$$

Basically, you are trying to estimate the variable $Y_i$ with the values of the different $X_{k,i}$. The regression process consists in estimating the parameters $b_0, b_1, … , b_k$ with an optimization method over a sample of size $n$ (there are $n$ known values for each of the independent variable $X_k$ and the dependent variable $Y$, represented by the $i$ index). When all $X_k=0$, $Y$ has a default value $b_0$ (called *the intercept*). The error term $\epsilon_i$ is there because the model will not be able to determine $Y_i$ exactly; there is hence a residual part of the value of $Y$ which is unexplained by the model which is normally distributed with mean 0 and constant variance $\forall i$.

So to sum up, the inputs are:

- $n$ known values of $Y$
- $n$ known values of each $X_i$

and the outputs are:

- $b_0, b_1, … , b_k$

So, say you have a set of new values for each $X_k$, you can estimate the value for Y, denoted $\hat{Y}$ by doing:

$$ \hat{Y} = b_0 + b_1 X_1 + …. + b_K X_K$$

The most important part of this section is the enumeration of its underlying assumptions:

- There is a
**linear**relationship between the independent variable $X_k$ and the dependent variable $Y$. - The independent variable $X_k$ are
**not correlated with each other**. - The expected value of $\epsilon_i$ is
**0**for all $i$. - The variance of $\epsilon_i$ is
**constant**for all $i$. - The variable $\epsilon$ is
**normally distributed**, - The error terms $\epsilon_1, …, \epsilon_n$ are
**not correlated with each other**.

If one of these assumptions is not verified for the sample being analyzed, then the model is misspecified and we will see in a subsequent post how to detect this problem and how to handle it. Note that point 2) only mentions colinearity between the independent variables; there is no problem if $Y$ is correlated to one of the $X_k$, it’s what we’re looking for.

Now, remember in my first post on the Quant Methods I said that one we would have to compute the statistical significance of estimated parameters. This is exactly what were are going to do now. The thing is, the output parameters of a regression (the coefficients $b_0, b_1, … b_K$) are only statistical estimates. As a matter of fact, there is uncertainty about this estimation. Therefore, the regression algorithm usually outputs the standard deviations $s_{b_k}$ for each parameter $b_k$. This allows us to create a statistical test to determine whether the estimate $\hat{b}_k$ is statistically different from some hypothesized value $b_k$ with a level of confidence of $1-\alpha$. The null hypothesis $H_0$, which we want to reject, is that $\hat{b}_k=b_k$. The test goes as follows:

$$ t=\frac{\hat{b}_k – b_k}{s_{b_k}}$$

If the null hypothesis $H_0$ is verified, the variable $t$ follows a t distribution with $n-K-1$ degrees of freedom. So, you can simply look at the value in the t-distribution form for the desired level on confidence to find the critical value $c_\alpha$. If $t>c_\alpha$ or if $t<-c_\alpha$, then $H_0$ can be *rejected* and we can conclude that $\hat{b}_k$ is **statistically different** from $b_k$.

Usually, we are asked to determine whether some estimate $\hat{b}_k$ is **statistically significant**. As explained in the previous post, this means that we want to test the null hypothesis $H_0: \hat{b}_k = 0$. So, you can just run the same test than before with $b=0$:

$$ t=\frac{\hat{b}_k}{s_{b_k}}$$

That’s it for today. The concepts presented here are essential to succeed the Quantitative Method part of the CFA Level II exam. They are nonetheless quite easy to grasp and the formulas are very simple. Next, we will look at the method to analyze how well a regression model does at explaining the dependent variable.