Cross-Tabulation
Descriptive Statistics
Unlock the Hidden Relationships in Your Categorical Data
Cross-tabulation reveals how two categorical variables interact — turning raw counts into actionable insights about association and dependence.
- Discover patterns — See how categories of one variable distribute across another
- Compare groups — Use row, column, and total percentages to uncover disparities
- Test significance — Apply chi-square testing to determine if associations are real or due to chance
- Visualize findings — Transform tables into heatmaps that tell a compelling story
Every contingency table is a window into the structure of your data.
What is Cross-Tabulation?
Definition
Cross-tabulation (cross-tab or crosstab) displays the joint distribution of two or more categorical variables in a matrix format.
DfCross-Tabulation
A cross-tabulation is a tabular display of the frequency distribution of two or more categorical variables simultaneously. Each cell contains the count (and optionally percentage) of observations that fall into the intersection of specific categories.
import numpy as np
import pandas as pd
from scipy import stats
np.random.seed(42)
# Sample data
n = 200
data = pd.DataFrame({
'Gender': np.random.choice(['Male', 'Female'], n),
'Preference': np.random.choice(['Tea', 'Coffee', 'Juice'], n),
'Age_Group': np.random.choice(['18-30', '31-50', '51+'], n)
})
# Basic cross-tabulation
ct = pd.crosstab(data['Gender'], data['Preference'])
print("Cross-Tabulation: Gender vs Preference")
print(ct)
Row and Column Percentages
# Row percentages (normalize by row)
ct_row = pd.crosstab(data['Gender'], data['Preference'], normalize='index') * 100
print("Row Percentages (%):\n", ct_row.round(1))
# Column percentages (normalize by column)
ct_col = pd.crosstab(data['Gender'], data['Preference'], normalize='columns') * 100
print("\nColumn Percentages (%):\n", ct_col.round(1))
# Overall percentages
ct_all = pd.crosstab(data['Gender'], data['Preference'], normalize='all') * 100
print("\nOverall Percentages (%):\n", ct_all.round(1))
Which Percentage to Use
- Row percentages: compare distributions across rows (e.g., preference by gender)
- Column percentages: compare distributions across columns (e.g., gender by preference)
- Overall percentages: show the proportion of each cell relative to the total
Three-Way Cross-Tabulation
# Three-way crosstab
ct_3way = pd.crosstab(
[data['Gender'], data['Age_Group']],
data['Preference'],
margins=True
)
print("Three-way Cross-Tabulation:")
print(ct_3way)
Chi-Square Test of Independence
ct_table = pd.crosstab(data['Gender'], data['Preference'])
chi2, p_value, dof, expected = stats.chi2_contingency(ct_table)
print(f"Chi-square = {chi2:.4f}")
print(f"p-value = {p_value:.4f}")
print(f"Degrees of freedom = {dof}")
print(f"\nExpected frequencies:\n{pd.DataFrame(expected, index=ct_table.index, columns=ct_table.columns).round(1)}")
Heatmap Visualization
import matplotlib.pyplot as plt
import seaborn as sns
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
sns.heatmap(ct, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_title('Counts')
sns.heatmap(ct_row, annot=True, fmt='.1f', cmap='Blues', ax=axes[1])
axes[1].set_title('Row Percentages (%)')
plt.tight_layout()
plt.savefig('cross-tabulation.png', dpi=150)
plt.show()
Cross-Tabulation in Machine Learning
| ML Application | Cross-Tab Usage | Why |
|---|---|---|
| EDA | Explore feature relationships | Understand data structure |
| NLP | Word vs category counts | Feature engineering |
| A/B testing | Treatment vs outcome | Quick significance check |
| Data validation | Expected vs actual counts | Detect data issues |
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency
np.random.seed(42)
n = 500
data = pd.DataFrame({
'predicted': np.random.choice(['cat_a', 'cat_b', 'cat_c'], n),
'actual': np.random.choice(['cat_a', 'cat_b', 'cat_c'], n)
})
ct = pd.crosstab(data['predicted'], data['actual'], margins=True)
print("Cross-tabulation (confusion matrix style):")
print(ct)
chi2, p, dof, _ = chi2_contingency(pd.crosstab(data['predicted'], data['actual']))
print(f"\nChi-square test: χ² = {chi2:.3f}, p = {p:.4f}")
Key Takeaways
Cross-tabs display joint distributions of two or more categorical variables in a matrix format.
Use row percentages to compare distributions across groups, and column percentages to compare group compositions within categories.
pandas crosstab() is the primary tool — it supports margins, normalization, and multi-level indexing for flexible analysis.
The chi-square test determines whether an observed association between variables is statistically significant or due to random chance.
Cross-tabulation is the Swiss Army knife of categorical data analysis — simple to construct, yet powerful enough to reveal the structure hidden in your data.