Cross-Tabulation

Descriptive Statistics

Unlock the Hidden Relationships in Your Categorical Data

Cross-tabulation reveals how two categorical variables interact — turning raw counts into actionable insights about association and dependence.

Discover patterns — See how categories of one variable distribute across another
Compare groups — Use row, column, and total percentages to uncover disparities
Test significance — Apply chi-square testing to determine if associations are real or due to chance
Visualize findings — Transform tables into heatmaps that tell a compelling story

Every contingency table is a window into the structure of your data.

What is Cross-Tabulation?

Definition

Cross-tabulation (cross-tab or crosstab) displays the joint distribution of two or more categorical variables in a matrix format.

DfCross-Tabulation

A cross-tabulation is a tabular display of the frequency distribution of two or more categorical variables simultaneously. Each cell contains the count (and optionally percentage) of observations that fall into the intersection of specific categories.

import numpy as np
import pandas as pd
from scipy import stats

np.random.seed(42)

# Sample data
n = 200
data = pd.DataFrame({
    'Gender': np.random.choice(['Male', 'Female'], n),
    'Preference': np.random.choice(['Tea', 'Coffee', 'Juice'], n),
    'Age_Group': np.random.choice(['18-30', '31-50', '51+'], n)
})

# Basic cross-tabulation
ct = pd.crosstab(data['Gender'], data['Preference'])
print("Cross-Tabulation: Gender vs Preference")
print(ct)

Row and Column Percentages

# Row percentages (normalize by row)
ct_row = pd.crosstab(data['Gender'], data['Preference'], normalize='index') * 100
print("Row Percentages (%):\n", ct_row.round(1))

# Column percentages (normalize by column)
ct_col = pd.crosstab(data['Gender'], data['Preference'], normalize='columns') * 100
print("\nColumn Percentages (%):\n", ct_col.round(1))

# Overall percentages
ct_all = pd.crosstab(data['Gender'], data['Preference'], normalize='all') * 100
print("\nOverall Percentages (%):\n", ct_all.round(1))

Which Percentage to Use

Row percentages: compare distributions across rows (e.g., preference by gender)
Column percentages: compare distributions across columns (e.g., gender by preference)
Overall percentages: show the proportion of each cell relative to the total

Three-Way Cross-Tabulation

# Three-way crosstab
ct_3way = pd.crosstab(
    [data['Gender'], data['Age_Group']],
    data['Preference'],
    margins=True
)
print("Three-way Cross-Tabulation:")
print(ct_3way)

Chi-Square Test of Independence

ct_table = pd.crosstab(data['Gender'], data['Preference'])
chi2, p_value, dof, expected = stats.chi2_contingency(ct_table)

print(f"Chi-square = {chi2:.4f}")
print(f"p-value    = {p_value:.4f}")
print(f"Degrees of freedom = {dof}")
print(f"\nExpected frequencies:\n{pd.DataFrame(expected, index=ct_table.index, columns=ct_table.columns).round(1)}")

Heatmap Visualization

import matplotlib.pyplot as plt
import seaborn as sns

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

sns.heatmap(ct, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_title('Counts')

sns.heatmap(ct_row, annot=True, fmt='.1f', cmap='Blues', ax=axes[1])
axes[1].set_title('Row Percentages (%)')

plt.tight_layout()
plt.savefig('cross-tabulation.png', dpi=150)
plt.show()