# CFA Level II, Quantitative Methods: Correlation

Good evening everyone,

So I’m finally getting started to write about the CFA Level II material, and as I process in the classic order of the curriculum, I will start with the Quantitative Methods part. If you have a bit of experience in quantitative finance, I believe this part is quite straightforward. Actually, it talks about many different things and therefore I will make a different post for each topic to keep each of them as short as possible. This will hopefully make it as readable and accessible as possible for all types of readers.

As a brief introductory note, I would say that this section is very much like the rest of the CFA Level II curriculum, it builds on the concepts learnt in the Level I. Let me write that again: the concepts – not essentially all the formulas. For the Quant Methods part, you have to be comfortable with the Hypothesis Testing part I recapped in this post.

So let’s get started for the first post which will be about correlation. This part is actually quite easy because you’ve pretty much seen everything at the Level I. Let me restate the two simple definitions of the sample covariance and the sample correlation of two sample $X$ and $Y$:

$$\text{cov}_{X,Y} = \frac{\sum_{i=1}^n (X_i – \bar{X})(Y_i – \bar{X})}{n-1}$$

$$r_{X,Y} = \frac{\text{cov}_{X,Y}}{s_X s_Y}$$

where $s_X$ and $s_Y$ are the standard deviation of the respective samples.

The problem of the sample covariance is that it doesn’t really give you a good idea of the strength of the relationship between the two processes; it very much depends on each the samples’ variances. This is where the sample correlation is actually useful because it is bounded: $r_{X,Y} \in [-1,1]$. Taking the simple example, of using twice the same sample, we have $\text{cov}_{X,X} = s_X^2$ and $r_{X,X}=1$.

In general, I would say that the Quant Methods part of the Level II mainly focuses on understanding underlying models and their assumptions, not on learning and applying formulas – it was more the case in Level I. For correlation, there is one key thing to understand, it detects a linear relationship between two samples. This means that the correlation is only useful to detect a relationship of the kind $Y = aX+b+\epsilon$.

A good way to visualize whether there is a correlation between two samples is to look at a scatter plot. To create a few examples, I decided to use MATLAB as we can generate random processes and scatter plots very easily. So I will create 3 processes:

• A basic $X \sim \mathcal{N}(0,1)$ of size $n=1000$
• A process which is a linear combination of $X$.
• A process which is not a linear combination of $X$.
• And another process $Y \sim \mathcal{N}(0,1)$ of size $n=1000$ independent from $X$.

<br />
%Parameter definition<br />
n=1000;<br />
%Process definitions<br />
x=randn(n,1);<br />
linear_x=5+2*x;<br />
squared_x=x.^2;<br />
y=randn(n,1);<br />
%Plotting<br />
figure();<br />
scatter(x,y);<br />
figure();<br />
scatter(x,squared_x);<br />
figure();<br />
scatter(x,linear_x);<br />


The script presented above generates 3 different scatter plots which I will now present and comment. First, let’s take the example of the two independent process X and Y. In this instance, there is not correlation by definition. This give a scatter plot like this:

Now, let’s look at the scatter plot for the process which is a linear combination of X. Obviously, this is an example of two processes with positive correlation (actually, perfect correlation).

Graphically, we can say that there is correlation between two samples if there is a line on the scatter plot. If the line has a positive slope (it goes from down-left to up-right), the correlation will be positive. If it has a negative slope (it goes from top-left to bottom-right), the correlation will be negative. The magnitude of the correlation is determined by how much the points lie on a straight line. In the example above, all the points perfectly line on a straight line.  So a general framework to “estimate” correlation from a scatter plot would be the following:

• Do the points lie approximately on a straight line?
• If no, then there is no correlation, stop here.
• If yes, then there is correlation and continue:
• Is the slope of the line positive or negative?
• If positive, then the correlation $r \geq 0$
• If negative, then the correlation $r \leq 0$
• If there is no slope (vertical or horizontal straight line), then one of the sample is constant and there is no correlation $r=0$.
• How much are the points on a straight line?
• The more they look to be on a straight line, the more $|r|$ is close to 1.

In the example above, the points look like a straight line, the slope is positive, and the points are very much on a straight line, so $r \simeq 1$.

Finally, let’s look at the third process which is simply the squared values of X, so the relationship is not linear:

If we apply the decision framework presented above, we can clearly see that the points do not lie on a straight line, and hence we conclude that there is no correlation between the samples.

Now let’s look at the values given by MATLAB for the correlations:

Samples Correlation
Independent processes X and Y 0.00
X and Linear Combination of X 1.00
X and squared X 0.02

As we can see, MATLAB confirms what we determined looking at the scatter plots.

I now want to explain a very important concept in the Level II curriculum: the statistical significance of an estimation. Quite simply, given the fact that we estimated some value $\hat{b}$, we say that this estimation is statistically significant with some confidence level $(1-\alpha)$, if we manage to reject the hypothesis $H_0 : b=0$ with that confidence level. To do so, we will perform a statistical test which will depend on the value we are trying to estimate.

For the given post, we would like to determine whether the sample correlation $r$ we estimate is statistically significant. There is a simple test given by the CFA institute:

$$t = \frac{r \sqrt{n-2}}{\sqrt{1-r^2}}$$

This variable $t$ follows a Student-t law, with $n-2$ degrees of freedom. This means that, if you know alpha, you can simply look at the critical value $t_c$ in the distribution table and reject the hypothesis $H_0 : r=0$ if $t < -t_c$ or $t > t_c$. Although you are supposed to know how to compute this test yourself, you can also be given the p-value of the estimated statistic. Recall from Level I, the p-value is the minimal $\alpha$ for which $H_0$ can be rejected. Quite simply, if the p-value is smaller than the $\alpha$ you consider for the test, you can reject the null hypothesis $H_0$ and conclude that the estimated value is statistically significant.

Let’s look at the p-values MATLAB gives me for the correlation we discussed in the example:

Samples Correlation P-Value
Independent processes X and Y -0.01 0.77
X and Linear Combination of X 1.00 0.00
X and squared X 0.02 0.53

If we consider an $\alpha$ of 5%, we can see that we cannot reject the null hypothesis for the independent and squared process, so they are not statistically significant. On the other hand, the linear combination of X has a p-value close to 0, which means that it is statistically significant for virtually any $\alpha$.

One last word on the important points to keep in mind about correlation:

• Correlation detects only a linear relationship; not a nonlinear one.
• Correlation is very sensible to outliers (“weird” points, often erroneous, included in the datasets).
• Correlation can be spurious; you can detect a statistically significant correlation whereas there is no economic rationale behind it.

I hope you enjoyed this first post on the CFA Level II, and I’ll be back shortly with more!

This site uses Akismet to reduce spam. Learn how your comment data is processed.