Computer Science, asked by lisahaydon8032, 1 year ago

Why there is no correlation between categorial variables

Answers

Answered by sam7544

Explanation:

continuous variables

Go to the profile of Outside Two Standard Deviations

Outside Two Standard Deviations

Sep 13, 2018

The last few days I have been thinking a lot about different ways of measuring correlations between variables and their pros and cons. Here’s the problem: there are two kinds of variables — continuous and categorical (sometimes called discrete or factor variables) and hence, we need a single or different metrics which can quantify correlation or association between continuous-continuous, categorical-categorical and categorical-continuous variable pairs. Computing correlation can be broken down into two sub-problems — i). Testing if there is a statistically significant correlation between two variables and ii). Quantifying the association or ‘goodness of fit’ between the two variables. Ideally, we also need to be able to compare such goodness of fit metrics between variable pair classes on some universal scale. This problem becomes important if the matrix you are analyzing has a combination of categorical and continuous variables. In these cases, if you want a universal criterion to drop columns above a certain correlation from further analyses, it is important that all correlations computed are comparable. There is no single technique to correlate all the three variable pairs and so having such a universal scale for comparing correlations obtained from different methods is tricky and needs some thinking.

You might be wondering why anyone would ever need to compare correlation metrics between different variable types. In general, knowing if two variables are correlated and hence substitutable is useful for understanding variance structures in data and feature selection in machine learning. To expand, for data exploration and hypothesis testing, you want to be able to understand the associations between variables. Additionally, for building efficient predictive models, you would ideally only include variables that uniquely explain some amount of variance in the outcome. In all these applications, it is likely that you will be comparing correlations between continuous, categorical and continuous-categorical pairs with each other and hence having a shared estimate of association between variable pairs is essential. One thing to note is that for all these applications while a statistical significance test of correlation between the two variables is helpful, it is far more important to quantify the association in a comparable manner i.e. have a comparabale ‘goodness of fit’ metric.

I was surprised that I did not find a comprehensive overview detailing correlation measurement between different kinds of variables, especially goodness of fit metrics, so I decided to write this up.

There has been a lot of focus on calculating correlations between two continuous variables and so I plan to only list some of the popular techniques for this pair. Out of these three variable combinations, computing correlation between a categorical-continuous variable is the most non-standard and tricky. Surprisingly (or may be not so much), there is very little formal literature on correlating such variables. Hence, I plan to spend most parts of this post expanding on standard and non-standard ways to calculate such correlations. Finally, with the rise of categorical variables in datasets, it is important to calculate correlations between this pair of variables (i.e., a categorical and another categorical variable). Let us start with a discussion surrounding computing correlation between two categorical variables.

Correlation between two discrete or categorical variables

Broadly speaking, there are two different ways to find association between categorical variables. One set of approaches rely on distance metrics such as Euclidean distance or Manhattan distance while another set of approaches span various statistical metrics such as chi-square test or Goodman Kruskal’s lambda, which was initially developed to analyze contingency tables. Now the mathematical purist out there could correctly argue that distance metrics cannot be a correlation metric since correlation needs to be unit independent which distance by definition can’t be. I do agree with that argument and I will point it out later but for now I include it since many people use distance as a proxy for correlation between categorical variables. Additionally, in certain special situations there is an easy conversion between Pearson correlation and Euclidean distance.

Below, I list some common metrics within both approaches and then discuss some relative strengths and weaknesses of the two broad approaches. Then, I list some commonly used metrics within both approaches and end with a brief discussion of their relative merits.

Two Categorical Variables

Checking if two categorical variables are independent can be done with Chi-Squared test of independence.

Previous Question

Next Question