Analysis of Two Categorical Variables
Crosstab is the most preferred way of analyzing two categorical variables. It helps you to identify the affinity between the two categories. Graphically a Crosstab can be displayed using Stacked Bar Chart.
The Application of crosstab can be best understood by working on sample data. So let’s jump into an example.
Analysis of the MBA Data continued…
For crosstab, we require two categorical variables. Let us analyze the association between MBA Specialization chosen by students and their stream in 12th Standard. The fields in the data file are “mba_specialization” and “ten_plus_2_stream”.
Data Import in Python
Crosstab of MBA Specialization vs 12th Standard Stream
Before jumping into the creation of crosstab. Let us quickly do the univariate analysis of both fields.
Data cleaning of stream
We need to do some data cleaning. Recategorize the individual cases into Commerce / Science category.
From the above crosstab, we observe that there are 57 Commerce students who have taken Finance and likewise, 23 Science students have taken Finance. We can make more such statements but it is not giving any significant insights as such. To get the insight we must convert these numbers into some proportions like Row Proportions or Column Proportions.
Inferences / Take away
Converting the absolute values of crosstab into Row Proportions / Column Proportions provides us more insight into data.
- 47% of Science graduate students prefer to pursue an MBA in Marketing
- 45% of Commerce graduate students have an inclination to pursue MBA in Finance.
- Only 5% of Science Graduate preferred to take HR Specialization in MBA
From the crosstab, it is very clear that Science Students prefer to do MBA Specialization in Marketing and Commerce Students prefer Finance.
A statistical test called the chi-square test for independence, also called the chi-square test of association, is used on the Crosstab data (Contingency Table) to discover if there is a statistically significant relationship between two categorical variables
Stacked Bar Plot
Crosstab provides a great deal of insight when analyzing two categorical variables. Visually we can represent the crosstab as a Stacked Bar Chart as shown below:
Analyze Gender vs Stream of the student in 12th Standard (“gender” variable vs “ten_plus_2_stream”).
In the upcoming blog, we will learn about “Analysis of Two Variables – One Categorical and Other Continuous”