Four Methods to Statistically Measure Your Data Correlation

Image by https://elements.envato.com/

Correlations is a measure of the association between variables. They measure to what extent one variable is affected by a change in another variable. In this article, we will explain the importance of data correlation in machine learning, and introduce four common methods to calculate your data correlation.

Why Data Correlation is Important?

Understanding which variables affect the target we want to predict allows us to choose the right factors to investigate and include in our model, which can significantly improve model performance. We would also like to compare these specific variables to all other variables within the dataset, and even across different machine learning problems for reference. Therefore, it is very useful to have a measurable representation of the association between variables. In this short article, we will discuss four of the most common methods of measuring data correlation.

1. Pearson Correlation

It is a measure of the linear correlation between two sets of numeric data. It can take values between -1 and 1. A value of 1 indicates a perfect positive relationship between the two variables, a value of -1 indicates a perfect negative relationship, and a value of 0 indicates no linear relationship between the two variables.

How to Calculate:

2. Spearman Correlation

A measure of the linear correlation between the ranks of two sets of numeric data. It is very similar to the Pearson Correlation, except of directly measuring the correlation, it instead measures the Pearson Correlation of the rankings of the variables.

The ranking of a variable is the assignment of ordering (0, 1, 2, …) to different observations of the variable in question. As an example, the ranking of the data [5, 15, 6, 20] would be [1, 3, 2, 4].

Unlike the Pearson Correlation, the Spearman Correlation can measure non-linear relationships between variables, but still requires the relationship to be monotonic. It can take values between -1 and 1. A value of 1 indicates a perfect positive relationship between the two variables, a value of -1 indicates a perfect negative relationship, and a value of 0 indicates no monotonic relationship between the two variables.

How to Calculate:

3. Correlation Ratio

It is a measure of the correlation between a categorical column and a numeric column. It measures the variance of the mean of the numeric column across different categories of the categorical column. Unlike the Pearson and Spearman Correlations, the Correlation Ratio is unable to measure the “direction” of the correlation, and can only measure its magnitude. It can take values between 0 and 1. A value of 1 indicates that the variance in the numeric data is purely due to the difference within the categorical data. A value of 0 indicates that the variance in the numeric data is completely unaffected by any differences within the categorical data.

How to Calculate:

4. Cramer’s V

It is a measure of the correlation between two categorical columns. Based on the chi-squared metric, the Cramer’s V statistic “scales” the chi-squared to be a percentage of its maximum possible variation. It can take values between 0 and 1, with 1 indicating a complete association between the two variables, and a 0 indicating no association.

How to Calculate:

The Cramer’s V is a heavily biased estimator and tends to overestimate the strength of the correlation. Therefore, a biased correction is normally applied to the statistic, shown below:

Conclusion

As we can see, there are many different ways of calculating the correlation between two variables to gain a better understanding of your data. However, it’s time consuming to manually choose the right approach and calculate the correlation for hundreds of variables. EvoML automatically calculates these correlation statistics and generates interactive visuals, so you can easily understand the importance of data relationships, while saving time to focus on more important tasks such as selecting the right features for better model performance.

Data Correlation Analysis on EvoML Platform ​

About the Author

Harvey Eaton Uy | TurinTech Research Team

Data Scientist with a Masters in Financial Economics and a Bachelor’s Degree in Mathematics

About TurinTech

TurinTech is the leader in Artificial Intelligence Optimisation. TurinTech empowers businesses to build efficient and scalable AI by automating the whole data science lifecycle with multi-objective optimisation. TurinTech enables organisations to drive AI transformation with minimum human effort, at scale and at speed.

TurinTech — AI. Optimised.

Learn more about TurinTech
Follow us on social media: LinkedIn and Twitter

Originally published at https://turintech.ai.

AI.Optimised.