Prediction Wizard

Registration Forgotten password  
  • Home
    • About
    • The Team
    • News
    • Contact Information
  • Learning center
    • Introduction
    • Why is this the best
    • Eleven compelling reasons
    • Detailed description
    • The Ten commandments
    • Supportive articles
  • Downloads
  • Predictions
  • Blog
  • Services

Blog / Correlation: Basics, feel for it and usability.

Correlation: Basics, feel for it and usability.

2010/05/09. - 14:01

 

When we hear and talk about correlation it is usually the Pearson correlation, which is  valid if  our data  could be  described with normal distribution.

It is defined as a statistical relationship between two or more random variables  or collected data values. It is similar to the degree of dependence giving us a mathematical  measure about it.

Correlations are useful because they can indicate a predictive relationship that can be exploited in practice.

Pearson_Correlation_Calc

In he above equation   E is the expected value,  cov is the covariance,  and corr is the Pearson’s correlation.

If we have the Microsoft Excel application, than we can calculate correlation using  Excel.

For example  having  two datasets with 100 elements of A   in column  A and  B in column  B we can calculate the rorrelation between the A and B series using the following equation:

CORREL(A1:A100, B1:B100)

Many other statistical applications  support  the calculation of correlation by default.

Correlation between two sets of data could be between  -1  to  +1.

Imagine, that we know, that the correlation between   two  sets of data is very high, say 0.9

Dataset „A”  is the daily change in he stock market index,  dataset „B”  is a known  dataset, derived from our market knowledge, and we know the next element is the dataset „B” in advance.

Than we might have a god shot at  predicting the next  daily change in the stock market.

What does a value in the correlation coefficient mean, what number is  big, good correlation and what is  not  meaningfool?  To  have a better feel, we present the following  samples:

 

1.        If we have  two sets of data,   X and Y,   X being the independent variable and Y being the dependent variable and  the relationship between   X set of data and the Y set of data could be described by the following linear equation:

Y =  A*X + B, where  A and B are constants.

In this case the Y  is 100% dependent ont he X data, there is no two Y value for any  X data,  and in this case the  correlation is  1, assuming that A > 0.

This is actually the equation of a simple line, with positive slope.

2.       If we have  two sets of data,   X and Y,   X being the independent variable and Y being the dependent variable and  the relationship between   X set of data and the Y set of data could be described by the following linear equation:

Y =  A*X + B, where  A and B are constants.

In this case the Y  is 100% dependent ont he X data, there is no two Y value for any  X data,  and in this case the  correlation is  -1, assuming that A < 0.

This is actually the equation of a simple line, with negative slope.

3.       Having two datasets X and Y  and both datasets  being  random  data, than the correlation between  X and Y will be  zero or close to zero, if we have a limited set of data.

 

4.       Probably a bit more interesting for us as a sample  is the  periodic function Y = SIN(X)

Certain economic  and market movements  show  periodic nature. Actually there are many different cycles in  the  market in effect at any specific time.

 

The Y = SIN(X)   periodic curve could be constracted  from four of its quadrant.

4/ 1 When X goes from  0 To T/4

4/ 1 When X goes from  T/4  To T/2

4/ 1 When X goes from  T/2   To  3/4 *T

4/ 1 When X goes from  3/4 *T  To   T

 

T being the cyle period time.

When we have only data of X going only  from 0 to T/4,  than  the correlation  we will get  between X and Y  will be 1.

When we have only data of X going only from  T/4 to T/2, than  the correlation  we will get  between X and Y  will be -1.

When we have data of X  going from  value A to B and  when the difference between A and B is  not a multiple of the  T   period, than the correlation could be  either negative or positive, depending on how much  X data  we are using  int he calculation.

 

We can see, that  the correlation, we get will be dependent on how much data we are using in the calculation, and how does that data related to the  complete cycle  from 0 to T,  

T  being the  cycle period time.

For this reason  very often  we use the  rolling correlation, when we use the  last N data to calculate the correlation, and  it will change over time with the same cycle frequency as the cycle  frequency of the underlying data.

Actually very often the  change in the correlatin in time is more important than the correlatin level itself as it  points to some changes int he percieved  dependency, relationship between  the related  datasets in question.

As  one example:

Today the economy is very dependent  on energy, especially on oil. So there is a relativelly high correlation between oil consumpltion and economic activity.

When we look at long time period, and we notice, that this high correlation is decreasing, we might conclude, that  something changed. How  would that be possible?  For example if  a completelly new, unlimited energy source discovered by  the researchers, or the dependency on oil consumption is not  high because of  shifts in energy  sources, than  it would show up on our correlation charts.  Actually about 100  years ago the  coal had similar  importance than oil Today, but that  dependency   changed  a bit over time.

 

5.        As another  correlation sample  presentation  we created   a table, which shows  the correlation  between two datasets, where the two datasets constucted the  following way:

First we  created  two  datasets, dataset A and dataset B, both  of them created using a random number generator.

So the starting point  is  a correlation of zero (Or close to that), between these two datasets.

Then  we  changed the  data in both   dataset A and dataset B so that 5% of the data  within A and B had a correlation of 1, being completelly correlated, and  95% of the data within the Dataset A and B  still  remained the original  random series.

With this modified datasets  we calculated the overall correlation.

Than we proceeded and  modified   the A and B datasets,  so that 10% of the data between A and B was completely correlated, and the 90% of the remaining data remained  random, and recalculated the overall correlation coefficient for all data.

Than again  proceeded and  modified the A and B datasets  and increased the   potion of the completelly correlated  data up untill all data were 100% correlated,  increasing  the amount of comletelly correlated portion  in every step by 5%, than calculating the overall correlation for  the whole   data, which was  mixed, containing random seriers and completelly correlated pairs.

The  final results presented int he following table:

 

Correlation_Example5_Table

 

Looking at the data we see, that for example when 80% of the data within A and B series were random and 20% of the data were perfectly correlated, than we get the overall correlation as 0.568

When  50% of the data series were random in A and B series, we got  that the overall correlation was 0.8448, a very high number.

The correlation is not linearly increased as we increased the percentage of prefectly correlated pairs  within the A and B data series.

 

 

Some of the conclusions using correlation calculaton as  a way to discover relationships, dependencies between data series:

 

-          Correlation might be dependent on the number of data, we are investigating.  To evaluate that effect, we can also calculate the correlation using only half of the data or  a percentegae of data (30%, 40%...) and compare the calculated correlation with correlation calculation results on the whole data.

-          Correlation  might depend on time,  and actually we might get  important clues  about market shifts  when  the correlation is changing in time.

-          To state that a correlaton is high or low has  a meaning only in relative terms. It depends on  what  is  our purpose or objective, how do we want to use tha data,  what is the  historical  correlation (If  time is  involved).

-          Correlation is only one way of looking at  our data. It might not reveal  some other important  patterns that  other way of looking at the data might reveal. So  it should not be the only way, we examine  our data as it is compressing a lot of information into one number.

-     Correlation  calculation has a lot of application in the financial  areas, not only in economic research, but   for pairs trading, sector allocation, portfolio design  and management...

 

Tell me about it!
Post comments, please sign in.

Comments


 

Contact | Impressum | Sitemap | Copyright Notice | Disclaimer

Tel: +36 (1) 000-000  |  Fax: +36 (1) 000-000   info@predictionwizard.com
© 2010 - PredictionWizard.com

CMS & Design: Dolphinet Interactive