Blog / Correlation: Basics, feel for it and usability.
Correlation: Basics, feel for it and usability.
When we hear and talk about correlation it is usually the Pearson correlation, which is valid if our data could be described with normal distribution.
It is defined as a statistical relationship between two or more random variables or collected data values. It is similar to the degree of dependence giving us a mathematical measure about it.
Correlations are useful because they can indicate a predictive relationship that can be exploited in practice.
In he above equation E is the expected value, cov is the covariance, and corr is the Pearson’s correlation.
If we have the Microsoft Excel application, than we can calculate correlation using Excel.
For example having two datasets with 100 elements of A in column A and B in column B we can calculate the rorrelation between the A and B series using the following equation:
CORREL(A1:A100, B1:B100)
Many other statistical applications support the calculation of correlation by default.
Correlation between two sets of data could be between -1 to +1.
Imagine, that we know, that the correlation between two sets of data is very high, say 0.9
Dataset „A” is the daily change in he stock market index, dataset „B” is a known dataset, derived from our market knowledge, and we know the next element is the dataset „B” in advance.
Than we might have a god shot at predicting the next daily change in the stock market.
What does a value in the correlation coefficient mean, what number is big, good correlation and what is not meaningfool? To have a better feel, we present the following samples:
1. If we have two sets of data, X and Y, X being the independent variable and Y being the dependent variable and the relationship between X set of data and the Y set of data could be described by the following linear equation:
Y = A*X + B, where A and B are constants.
In this case the Y is 100% dependent ont he X data, there is no two Y value for any X data, and in this case the correlation is 1, assuming that A > 0.
This is actually the equation of a simple line, with positive slope.
2. If we have two sets of data, X and Y, X being the independent variable and Y being the dependent variable and the relationship between X set of data and the Y set of data could be described by the following linear equation:
Y = A*X + B, where A and B are constants.
In this case the Y is 100% dependent ont he X data, there is no two Y value for any X data, and in this case the correlation is -1, assuming that A < 0.
This is actually the equation of a simple line, with negative slope.
3. Having two datasets X and Y and both datasets being random data, than the correlation between X and Y will be zero or close to zero, if we have a limited set of data.
4. Probably a bit more interesting for us as a sample is the periodic function Y = SIN(X)
Certain economic and market movements show periodic nature. Actually there are many different cycles in the market in effect at any specific time.
The Y = SIN(X) periodic curve could be constracted from four of its quadrant.
4/ 1 When X goes from 0 To T/4
4/ 1 When X goes from T/4 To T/2
4/ 1 When X goes from T/2 To 3/4 *T
4/ 1 When X goes from 3/4 *T To T
T being the cyle period time.
When we have only data of X going only from 0 to T/4, than the correlation we will get between X and Y will be 1.
When we have only data of X going only from T/4 to T/2, than the correlation we will get between X and Y will be -1.
When we have data of X going from value A to B and when the difference between A and B is not a multiple of the T period, than the correlation could be either negative or positive, depending on how much X data we are using int he calculation.
We can see, that the correlation, we get will be dependent on how much data we are using in the calculation, and how does that data related to the complete cycle from 0 to T,
T being the cycle period time.
For this reason very often we use the rolling correlation, when we use the last N data to calculate the correlation, and it will change over time with the same cycle frequency as the cycle frequency of the underlying data.
Actually very often the change in the correlatin in time is more important than the correlatin level itself as it points to some changes int he percieved dependency, relationship between the related datasets in question.
As one example:
Today the economy is very dependent on energy, especially on oil. So there is a relativelly high correlation between oil consumpltion and economic activity.
When we look at long time period, and we notice, that this high correlation is decreasing, we might conclude, that something changed. How would that be possible? For example if a completelly new, unlimited energy source discovered by the researchers, or the dependency on oil consumption is not high because of shifts in energy sources, than it would show up on our correlation charts. Actually about 100 years ago the coal had similar importance than oil Today, but that dependency changed a bit over time.
5. As another correlation sample presentation we created a table, which shows the correlation between two datasets, where the two datasets constucted the following way:
First we created two datasets, dataset A and dataset B, both of them created using a random number generator.
So the starting point is a correlation of zero (Or close to that), between these two datasets.
Then we changed the data in both dataset A and dataset B so that 5% of the data within A and B had a correlation of 1, being completelly correlated, and 95% of the data within the Dataset A and B still remained the original random series.
With this modified datasets we calculated the overall correlation.
Than we proceeded and modified the A and B datasets, so that 10% of the data between A and B was completely correlated, and the 90% of the remaining data remained random, and recalculated the overall correlation coefficient for all data.
Than again proceeded and modified the A and B datasets and increased the potion of the completelly correlated data up untill all data were 100% correlated, increasing the amount of comletelly correlated portion in every step by 5%, than calculating the overall correlation for the whole data, which was mixed, containing random seriers and completelly correlated pairs.
The final results presented int he following table:
Looking at the data we see, that for example when 80% of the data within A and B series were random and 20% of the data were perfectly correlated, than we get the overall correlation as 0.568
When 50% of the data series were random in A and B series, we got that the overall correlation was 0.8448, a very high number.
The correlation is not linearly increased as we increased the percentage of prefectly correlated pairs within the A and B data series.
Some of the conclusions using correlation calculaton as a way to discover relationships, dependencies between data series:
- Correlation might be dependent on the number of data, we are investigating. To evaluate that effect, we can also calculate the correlation using only half of the data or a percentegae of data (30%, 40%...) and compare the calculated correlation with correlation calculation results on the whole data.
- Correlation might depend on time, and actually we might get important clues about market shifts when the correlation is changing in time.
- To state that a correlaton is high or low has a meaning only in relative terms. It depends on what is our purpose or objective, how do we want to use tha data, what is the historical correlation (If time is involved).
- Correlation is only one way of looking at our data. It might not reveal some other important patterns that other way of looking at the data might reveal. So it should not be the only way, we examine our data as it is compressing a lot of information into one number.
- Correlation calculation has a lot of application in the financial areas, not only in economic research, but for pairs trading, sector allocation, portfolio design and management...