🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Cross-Tabulation — Analyzing Relationships Between Categorical Variables

Foundations of StatisticsDescriptive Statistics🟢 Free Lesson

Advertisement

Cross-Tabulation

Descriptive Statistics

Unlock the Hidden Relationships in Your Categorical Data

Cross-tabulation reveals how two categorical variables interact — turning raw counts into actionable insights about association and dependence.

  • Discover patterns — See how categories of one variable distribute across another
  • Compare groups — Use row, column, and total percentages to uncover disparities
  • Test significance — Apply chi-square testing to determine if associations are real or due to chance
  • Visualize findings — Transform tables into heatmaps that tell a compelling story

Every contingency table is a window into the structure of your data.


What is Cross-Tabulation?

Definition

Cross-tabulation (cross-tab or crosstab) displays the joint distribution of two or more categorical variables in a matrix format.

DfCross-Tabulation

A cross-tabulation is a tabular display of the frequency distribution of two or more categorical variables simultaneously. Each cell contains the count (and optionally percentage) of observations that fall into the intersection of specific categories.

import numpy as np
import pandas as pd
from scipy import stats

np.random.seed(42)

# Sample data
n = 200
data = pd.DataFrame({
    'Gender': np.random.choice(['Male', 'Female'], n),
    'Preference': np.random.choice(['Tea', 'Coffee', 'Juice'], n),
    'Age_Group': np.random.choice(['18-30', '31-50', '51+'], n)
})

# Basic cross-tabulation
ct = pd.crosstab(data['Gender'], data['Preference'])
print("Cross-Tabulation: Gender vs Preference")
print(ct)

Row and Column Percentages

# Row percentages (normalize by row)
ct_row = pd.crosstab(data['Gender'], data['Preference'], normalize='index') * 100
print("Row Percentages (%):\n", ct_row.round(1))

# Column percentages (normalize by column)
ct_col = pd.crosstab(data['Gender'], data['Preference'], normalize='columns') * 100
print("\nColumn Percentages (%):\n", ct_col.round(1))

# Overall percentages
ct_all = pd.crosstab(data['Gender'], data['Preference'], normalize='all') * 100
print("\nOverall Percentages (%):\n", ct_all.round(1))

Which Percentage to Use

  • Row percentages: compare distributions across rows (e.g., preference by gender)
  • Column percentages: compare distributions across columns (e.g., gender by preference)
  • Overall percentages: show the proportion of each cell relative to the total

Three-Way Cross-Tabulation

# Three-way crosstab
ct_3way = pd.crosstab(
    [data['Gender'], data['Age_Group']],
    data['Preference'],
    margins=True
)
print("Three-way Cross-Tabulation:")
print(ct_3way)

Chi-Square Test of Independence

ct_table = pd.crosstab(data['Gender'], data['Preference'])
chi2, p_value, dof, expected = stats.chi2_contingency(ct_table)

print(f"Chi-square = {chi2:.4f}")
print(f"p-value    = {p_value:.4f}")
print(f"Degrees of freedom = {dof}")
print(f"\nExpected frequencies:\n{pd.DataFrame(expected, index=ct_table.index, columns=ct_table.columns).round(1)}")

Heatmap Visualization

import matplotlib.pyplot as plt
import seaborn as sns

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

sns.heatmap(ct, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_title('Counts')

sns.heatmap(ct_row, annot=True, fmt='.1f', cmap='Blues', ax=axes[1])
axes[1].set_title('Row Percentages (%)')

plt.tight_layout()
plt.savefig('cross-tabulation.png', dpi=150)
plt.show()

Cross-Tabulation in Machine Learning

ML ApplicationCross-Tab UsageWhy
EDAExplore feature relationshipsUnderstand data structure
NLPWord vs category countsFeature engineering
A/B testingTreatment vs outcomeQuick significance check
Data validationExpected vs actual countsDetect data issues
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

np.random.seed(42)
n = 500
data = pd.DataFrame({
    'predicted': np.random.choice(['cat_a', 'cat_b', 'cat_c'], n),
    'actual': np.random.choice(['cat_a', 'cat_b', 'cat_c'], n)
})
ct = pd.crosstab(data['predicted'], data['actual'], margins=True)
print("Cross-tabulation (confusion matrix style):")
print(ct)

chi2, p, dof, _ = chi2_contingency(pd.crosstab(data['predicted'], data['actual']))
print(f"\nChi-square test: χ² = {chi2:.3f}, p = {p:.4f}")

Key Takeaways

Cross-tabs display joint distributions of two or more categorical variables in a matrix format.

Use row percentages to compare distributions across groups, and column percentages to compare group compositions within categories.

pandas crosstab() is the primary tool — it supports margins, normalization, and multi-level indexing for flexible analysis.

The chi-square test determines whether an observed association between variables is statistically significant or due to random chance.

Cross-tabulation is the Swiss Army knife of categorical data analysis — simple to construct, yet powerful enough to reveal the structure hidden in your data.

Premium Content

Cross-Tabulation — Analyzing Relationships Between Categorical Variables

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement