# CFA Level 2: Quantitative Methods – Autoregressive Processes

Hello again everybody,

We’re getting towards the final straight line before the exam, and I will post here the content of all the little flash cards that I created for myself.

Starting back where I left, in the Quantitative Methods this post will be about Autoregressive Processes sometime denoted AR.

These processes are of the following form:

$$x_{t+1} = b_0 + b_1 x_t$$

Where $x_t$ is the value of process $X$ at time $t$, and $b_0$ and $b_1$ are the parameters we are trying to estimate.

To estimate the parameters $b_0$ and $b_1$, we proceed as follows:

1. Estimate the parameters using linear regression
2. Calculate the auto-correlations of the residuals
3. Test whether these auto-correlations are significant:

Note that we cannot use the Durbin-Watson test we used previously in this section of the CFA curriculum; we will be using a t-test that works this way:

$$t = \frac{\rho_{\epsilon_t, \epsilon_{t+k}}}{\frac{1}{\sqrt{T}}}=\rho_{\epsilon_t, \epsilon_{t+k}} \sqrt{T}$$

Where $\epsilon_t$ is the residual term of the regression at time $t$, and $T$ is the number of observation. The t statistic has $T-2$ degrees of freedom. If  they are statistically significant, then we cannot continue our analysis because of reasons I’ll explain a bit later in the post.

With AR processes, you are trying to actually predict the next values of a given process using a linear relationship between successive values an by applying simple linear regression. The thing is, if you want to be able to trust your estimated $b_0$ and $b_1$ parameters, you need the process to be covariance-stationary.

Now, a bit of math. If a process has a finite mean-reverting level, then it is covariance-stationary. What is the mean-reverting level? Well it simple the value $x_t$ at which $x_{t+1}=x_t$. So, let’s write this in an equation:

$$x_{t+1} = x_t = b_0 + b_1 x_t$$

$$(1-b_1) x_t = b_0 \iff x_t=\frac{b_0}{1-b_1}$$

So, $X$ is covariance stationary  if  $b_1 \neq 1$.

The test for auto-correlations we did in the point 3) guarantees that the process is covariance-stationary if the auto-correlations are not statistically significant.

What if the process X is not covariance-stationary? Well you create a new process $Y$ where:

$$y_t = x_t – x_{t-1}$$

So, you have a new model

$$y_t = b_0 + b_1 y_{t-1} + \epsilon_t$$

which models the next change in the process X which is then covariance stationary. You can use that for the analysis.

This little “trick” is called first differencing.

That’s it, stay tuned for more soon!

# CFA Level II: Quantitative Methods, ANOVA Table

Good evening everyone,

Following my last post on multiple regression, I would like to talk about ANOVA tables as they are a very important part of the Level II curriculum on quantitative methods. ANOVA stands for ANAlysis Of VAriance; it helps to understands how well a model does at explaining the dependent variable.

First of all, recall that $Y=\{Y_i\} ~ i=1,…,n$ denote the real values of the dependent variables and recall that $\hat{Y}=\{\hat{Y}_i\} ~ i=1,…,n$ are the values estimated by the model. We define the following values:

Total Sum of Squares (SST) :

$$\text{SST}=\sum_{i=1}^n (Y_i – \bar{Y})^2$$

This is the total variation of the process $Y$, i.e. the squared deviations from $Y_i$ from the mean of the process denoted $\bar{Y}$. With the regression, this total variation is what we are trying to reproduce.

$$\text{RSS}=\sum_{i=1}^n (\hat{Y}_i – \bar{Y})^2$$

This is the variation explained by the regression model. If the model fitted perfectly the dependent variable, we would have $\text{RSS}=\text{SST}$.

Sum of Squared Errors (SSE):

$$\text{SSE}=\sum_{i=1}^n (Y_i – \hat{Y}_i )^2=\sum_{i=1}^n \epsilon_i ^2$$

Finally this is the unexplained variation; the sum of the differences between the real values of the process $Y_i$ and the values estimated by the model $\hat{Y}_i$.

As expected, the total variation is equal to the sum of the explained variation and the unexplained variation:

$$\text{SST}=\text{RSS} + \text{SSE}$$

Note that the CFA does not require candidates to be able to compute this values (it would take too long) but I thought that having the definitions helps understanding the concepts.

From these values, we can get the first important statistic we want to look at when discussing the quality of a regression model:

$$\text{R}^2=\frac{\text{RSS}}{\text{SST}}$$

The $\text{R}^2$ measures the part of the total variation that is being explained by the regression model. Its value is bounded from 0 to 1, and the closer it gets to 1 the better the model fits the real data.

Now we also want to compute the average of $\text{RSS}$ and $\text{SSE}$ (the mean sum of squares):

$$\text{MSR} = \frac{\text{RSS}}{k}$$

$$\text{MSE} = \frac{\text{SSE}}{n-k-1}$$

where $n$ is the size of the sample and $k$ is the number of dependent variables used in the model. These values are “intermediary” computations and are use for different statistics computations. First we can compute the standard error of the error terms $\epsilon$ (SEE):

$$\text{SEE}=\sqrt{\text{MSE}}$$

Note that this is just the application of the classic computation of the standard deviation with $k+1$ degrees of freedom. If the model fits well the data, then $\text{SEE}$ will be close to 0 (its lower bound).

Now, there is an important test in regression analysis which is called the F-statistic. Basically, this test has the null hypothesis that all the coefficients of the regression are statistically insignificant: $H_0 : b_i=0 ~ \forall i$. It is computed as follows:

$$\text{F}=\frac{\text{MSR}}{\text{MSE}}$$

$\text{F}$ is a random variable distributed under an F-Statistic with $k$ degrees of freedom in the numerator and $n-k-1$  degrees of freedom in the denominator. The critical value of the variable can be found in the F distribution form attached to the CFA exam. It is very important to understand that if you reject $H_0$, you say that at least one of the coefficients is statistically significant. This, by no mean, implies that all of them are!

To sum up, you can look at the following table known as the ANOVA table:

Source of variation Degrees of Freedom Sum of Squares Mean Sum of Squares
Error (unexplained) n-k-1 SSE MSE
Total n-1 SST

That’s it for tonight, I hoped you enjoyed the ride!

# CFA Level II: Quantitative Methods, Multiple Regression

Following my first post about the Level II and specifically correlation, I am now moving on to the main topic if this year’s curriculum: multiple regression. Before I get started, I want to mentioned that the program talks about regression in 2 steps: it starts by discussing the method with 1 independent variable and then with multiple variables. I will only talk about the multiple variable version, because it is generalized. If you understand the general framework, you can answer any question about the specific 1 variable case.

Multiple Regression is all about describing a dependent variable $Y$ with a linear combination of $k$ variables $X_k, k \in 1..K$. This is expressed mathematically as follows:

$$Y_i = b_0 + b_1 X_{1,i} + …. + b_K X_{K,i} + \epsilon_i$$

Basically, you are trying to estimate the variable $Y_i$ with the values of the different $X_{k,i}$. The regression process consists in estimating the parameters $b_0, b_1, … , b_k$ with an optimization method over a sample of size $n$ (there are $n$ known values for each of the independent variable $X_k$ and the dependent variable $Y$, represented by the $i$ index). When all $X_k=0$, $Y$ has a default value $b_0$ (called the intercept). The error term $\epsilon_i$ is there because the model will not be able to determine $Y_i$ exactly; there is hence a residual part of the value of $Y$ which is unexplained by the model which is normally distributed with mean 0 and constant variance $\forall i$.

So to sum up, the inputs are:

• $n$ known values of $Y$
• $n$ known values of each $X_i$

and the outputs are:

• $b_0, b_1, … , b_k$

So, say you have a set of new values for each $X_k$, you can estimate the value for Y, denoted $\hat{Y}$ by doing:

$$\hat{Y} = b_0 + b_1 X_1 + …. + b_K X_K$$

The most important part of this section is the enumeration of its underlying assumptions:

1. There is a linear relationship between the independent variable $X_k$ and the dependent variable $Y$.
2. The independent variable $X_k$ are not correlated with each other.
3. The expected value of $\epsilon_i$ is 0 for all $i$.
4. The variance of $\epsilon_i$ is constant for all $i$.
5. The variable $\epsilon$ is normally distributed,
6. The error terms $\epsilon_1, …, \epsilon_n$ are not correlated with each other.

If one of these assumptions is not verified for the sample being analyzed, then the model is misspecified and we will see in a subsequent post how to detect this problem and how to handle it. Note that point 2) only mentions colinearity between the independent variables; there is no problem if $Y$ is correlated to one of the $X_k$, it’s what we’re looking for.

Now, remember in my first post on the Quant Methods I said that one we would have to compute the statistical significance of estimated parameters. This is exactly what were are going to do now. The thing is, the output parameters of a regression (the coefficients $b_0, b_1, … b_K$) are only statistical estimates. As a matter of fact, there is uncertainty about this estimation. Therefore, the regression algorithm usually outputs the standard deviations $s_{b_k}$ for each parameter $b_k$. This allows us to create a statistical test to determine whether the estimate $\hat{b}_k$ is statistically different from some hypothesized value $b_k$ with a level of confidence of $1-\alpha$. The null hypothesis $H_0$, which we want to reject, is that $\hat{b}_k=b_k$. The test goes as follows:

$$t=\frac{\hat{b}_k – b_k}{s_{b_k}}$$

If the null hypothesis $H_0$ is verified, the variable $t$ follows a t distribution with $n-K-1$ degrees of freedom. So, you can simply look at the value in the t-distribution form for the desired level on confidence to find the critical value $c_\alpha$. If $t>c_\alpha$ or if $t<-c_\alpha$, then $H_0$ can be rejected and we can conclude that $\hat{b}_k$ is statistically different from $b_k$.

Usually, we are asked to determine whether some estimate $\hat{b}_k$ is statistically significant. As explained in the previous post, this means that we want to test the null hypothesis $H_0: \hat{b}_k = 0$. So, you can just run the same test than before with $b=0$:

$$t=\frac{\hat{b}_k}{s_{b_k}}$$

That’s it for today. The concepts presented here are essential to succeed the Quantitative Method part of the CFA Level II exam. They are nonetheless quite easy to grasp and the formulas are very simple. Next, we will look at the method to analyze how well a regression model does at explaining the dependent variable.

# CFA Level II, Quantitative Methods: Correlation

Good evening everyone,

So I’m finally getting started to write about the CFA Level II material, and as I process in the classic order of the curriculum, I will start with the Quantitative Methods part. If you have a bit of experience in quantitative finance, I believe this part is quite straightforward. Actually, it talks about many different things and therefore I will make a different post for each topic to keep each of them as short as possible. This will hopefully make it as readable and accessible as possible for all types of readers.

As a brief introductory note, I would say that this section is very much like the rest of the CFA Level II curriculum, it builds on the concepts learnt in the Level I. Let me write that again: the concepts – not essentially all the formulas. For the Quant Methods part, you have to be comfortable with the Hypothesis Testing part I recapped in this post.

So let’s get started for the first post which will be about correlation. This part is actually quite easy because you’ve pretty much seen everything at the Level I. Let me restate the two simple definitions of the sample covariance and the sample correlation of two sample $X$ and $Y$:

$$\text{cov}_{X,Y} = \frac{\sum_{i=1}^n (X_i – \bar{X})(Y_i – \bar{X})}{n-1}$$

$$r_{X,Y} = \frac{\text{cov}_{X,Y}}{s_X s_Y}$$

where $s_X$ and $s_Y$ are the standard deviation of the respective samples.

The problem of the sample covariance is that it doesn’t really give you a good idea of the strength of the relationship between the two processes; it very much depends on each the samples’ variances. This is where the sample correlation is actually useful because it is bounded: $r_{X,Y} \in [-1,1]$. Taking the simple example, of using twice the same sample, we have $\text{cov}_{X,X} = s_X^2$ and $r_{X,X}=1$.

In general, I would say that the Quant Methods part of the Level II mainly focuses on understanding underlying models and their assumptions, not on learning and applying formulas – it was more the case in Level I. For correlation, there is one key thing to understand, it detects a linear relationship between two samples. This means that the correlation is only useful to detect a relationship of the kind $Y = aX+b+\epsilon$.

A good way to visualize whether there is a correlation between two samples is to look at a scatter plot. To create a few examples, I decided to use MATLAB as we can generate random processes and scatter plots very easily. So I will create 3 processes:

• A basic $X \sim \mathcal{N}(0,1)$ of size $n=1000$
• A process which is a linear combination of $X$.
• A process which is not a linear combination of $X$.
• And another process $Y \sim \mathcal{N}(0,1)$ of size $n=1000$ independent from $X$.

<br />
%Parameter definition<br />
n=1000;<br />
%Process definitions<br />
x=randn(n,1);<br />
linear_x=5+2*x;<br />
squared_x=x.^2;<br />
y=randn(n,1);<br />
%Plotting<br />
figure();<br />
scatter(x,y);<br />
figure();<br />
scatter(x,squared_x);<br />
figure();<br />
scatter(x,linear_x);<br />


The script presented above generates 3 different scatter plots which I will now present and comment. First, let’s take the example of the two independent process X and Y. In this instance, there is not correlation by definition. This give a scatter plot like this:

Now, let’s look at the scatter plot for the process which is a linear combination of X. Obviously, this is an example of two processes with positive correlation (actually, perfect correlation).

Graphically, we can say that there is correlation between two samples if there is a line on the scatter plot. If the line has a positive slope (it goes from down-left to up-right), the correlation will be positive. If it has a negative slope (it goes from top-left to bottom-right), the correlation will be negative. The magnitude of the correlation is determined by how much the points lie on a straight line. In the example above, all the points perfectly line on a straight line.  So a general framework to “estimate” correlation from a scatter plot would be the following:

• Do the points lie approximately on a straight line?
• If no, then there is no correlation, stop here.
• If yes, then there is correlation and continue:
• Is the slope of the line positive or negative?
• If positive, then the correlation $r \geq 0$
• If negative, then the correlation $r \leq 0$
• If there is no slope (vertical or horizontal straight line), then one of the sample is constant and there is no correlation $r=0$.
• How much are the points on a straight line?
• The more they look to be on a straight line, the more $|r|$ is close to 1.

In the example above, the points look like a straight line, the slope is positive, and the points are very much on a straight line, so $r \simeq 1$.

Finally, let’s look at the third process which is simply the squared values of X, so the relationship is not linear:

If we apply the decision framework presented above, we can clearly see that the points do not lie on a straight line, and hence we conclude that there is no correlation between the samples.

Now let’s look at the values given by MATLAB for the correlations:

Samples Correlation
Independent processes X and Y 0.00
X and Linear Combination of X 1.00
X and squared X 0.02

As we can see, MATLAB confirms what we determined looking at the scatter plots.

I now want to explain a very important concept in the Level II curriculum: the statistical significance of an estimation. Quite simply, given the fact that we estimated some value $\hat{b}$, we say that this estimation is statistically significant with some confidence level $(1-\alpha)$, if we manage to reject the hypothesis $H_0 : b=0$ with that confidence level. To do so, we will perform a statistical test which will depend on the value we are trying to estimate.

For the given post, we would like to determine whether the sample correlation $r$ we estimate is statistically significant. There is a simple test given by the CFA institute:

$$t = \frac{r \sqrt{n-2}}{\sqrt{1-r^2}}$$

This variable $t$ follows a Student-t law, with $n-2$ degrees of freedom. This means that, if you know alpha, you can simply look at the critical value $t_c$ in the distribution table and reject the hypothesis $H_0 : r=0$ if $t < -t_c$ or $t > t_c$. Although you are supposed to know how to compute this test yourself, you can also be given the p-value of the estimated statistic. Recall from Level I, the p-value is the minimal $\alpha$ for which $H_0$ can be rejected. Quite simply, if the p-value is smaller than the $\alpha$ you consider for the test, you can reject the null hypothesis $H_0$ and conclude that the estimated value is statistically significant.

Let’s look at the p-values MATLAB gives me for the correlation we discussed in the example:

Samples Correlation P-Value
Independent processes X and Y -0.01 0.77
X and Linear Combination of X 1.00 0.00
X and squared X 0.02 0.53

If we consider an $\alpha$ of 5%, we can see that we cannot reject the null hypothesis for the independent and squared process, so they are not statistically significant. On the other hand, the linear combination of X has a p-value close to 0, which means that it is statistically significant for virtually any $\alpha$.

One last word on the important points to keep in mind about correlation:

• Correlation detects only a linear relationship; not a nonlinear one.
• Correlation is very sensible to outliers (“weird” points, often erroneous, included in the datasets).
• Correlation can be spurious; you can detect a statistically significant correlation whereas there is no economic rationale behind it.

I hope you enjoyed this first post on the CFA Level II, and I’ll be back shortly with more!

# CFA Level I: Hypothesis Testing

Good evening,

As I keep practicing towards the Level I exam, I want to finish my review of the  Quantitative Finance material. This post will hence be the follow up of my previous post. I will here discuss how you test a hypothesis on some statistical measure.

## General Concept

The main concept is as follows, you make an initial hypothesis which is called the     null hypothesis, $H_0$, and which is the statement you want to reject. If $H_0$ is rejected, we hence can accept the alternative hypothesis $H_a$. Of course in statistics, you can never be sure of anything. Hence, you can only reject the null hypothesis with a certain confidence level $\alpha$. It is important to understand that if you can’t reject the null hypothesis, it does not mean that you can accept it! Hence, rejecting the null hypothesis is more powerful than failing to reject it. So, if you want to prove some statement, you should test for its opposite as the null hypothesis; if you can reject it, then you can accept the alternative hypothesis which is your original statement. A test can either be one-tailed or two-tailed, depending on what you want to test.

## First example

Let’s take a simple example: you measure a sample of the returns of the S&P that you assume to be normal. You measure a sample volatility $\mu_s$ and a sample standard deviation $s$. By the central limit theorem, we now that the estimate of the mean follows a law $\mathcal{N}(\mu_0, \frac{\sigma^2}{n})$ where $\mu_0$ is the population mean and $\sigma$ is the population standard deviation. Now, you cannot use your measure $\mu_s$ to say that you found the true population mean, because it’s just a sample statistic. However, what you can say is that, given the fact that you found the sample statistic $\mu_s$ and a given confidence interval $\alpha$, it is highly improbable that the population mean $\mu_0$ was equal to some value $x$. Hence the null hypothesis is $\mu_0=x$.

Let’s assume that you found that $\mu_s=0.1$, $s=0.25\%$ and $n=250$.  You want to show that $\mu_0 \neq 0$, so you null hypothesis is $\mu_0 = 0$ and with a significance level $\alpha=5\%$. You can compute the z-statistic as follows:

$$z= \frac{\mu_s-\mu_0}{\frac{s}{\sqrt{n}}} \sim \mathcal{N}(0,1)$$

Now if $\mu_0$ was equal to 0, you know that there is 5% chance that the estimate $\mu_s$ would lie outside the range $\mu_0 \pm z_{2.5\%} = 0 \pm 1.96 = \pm 1.96$ (because the sample statistic can lie on both side of the distribution). The z-statistic was computed at 6.33, which is more than 1.96, so you can reject the null hypothesis $\mu_0=0$ and accept the alternative hypothesis $\mu \neq 0$.

This was a two-tailed hypothesis. One-tailed hypothesis would be looking at only one side of the distribution: $H_0 = \mu_0 \geq 0$ or $H_0 = \mu_0 \leq 0$. As a rule of thumb, you can remember the the null hypothesis always contains the “equal” sign.

## P-values

The p-value is the probability (assuming the null hypothesis is true) to have a sample statistic at least as extreme as the one being measure. You compute it by look at the test statistic (6.33 in the previous example) and you find the probability (using the Z-table) that a test statistic can be above that value (and you multiply it by 2 if it is two-tailed). In this case, the p-value is very close to 0. Now, if the p-value of the test statistic is below the significance level of the test, you can reject the null hypothesis. This is useful if you want to discuss statistics without having to impose a certain significance level $\alpha$ to the reader; you can just display the p-value and let him decide whether it’s good enough or not.

There are some other hypothesis tests presented in the curriculum, but this is the main framework to remember. It’s pretty easy, and it allows you to score a lot of points in the quantitative finance part of the exam.