Contingency Tables
Descriptive Statistics
Test Whether Your Categorical Variables Are Truly Independent
Contingency tables let you move beyond descriptive counts to formally test whether two categorical variables are associated — and how strongly.
- Construct tables — Build frequency matrices that cross-tabulate categorical variables
- Calculate expected frequencies — Determine what counts would look like under independence
- Apply chi-square testing — Quantify whether observed deviations from independence are significant
- Measure association strength — Use Cramér's V and Phi to assess how related your variables are
The chi-square test transforms a table of numbers into a verdict about independence.
What are Contingency Tables?
Definition
A contingency table displays the frequency distribution of two or more categorical variables to analyze their relationship.
DfContingency Table
A contingency table (also called a cross-tabulation or crosstab) is a matrix-format table that displays the multivariate frequency distribution of variables. It helps analyze the relationship between two categorical variables.
Expected Frequency
Here,
- =Row i total
- =Column j total
- =Grand total (all observations)
- =Expected frequency for cell (i,j)
import numpy as np
import pandas as pd
from scipy import stats
# Build a contingency table
data = pd.DataFrame({
'Treatment': ['Drug']*50 + ['Placebo']*50,
'Outcome': ['Improved']*35 + ['Not Improved']*15 + ['Improved']*20 + ['Not Improved']*30
})
ct = pd.crosstab(data['Treatment'], data['Outcome'])
print("Contingency Table:")
print(ct)
Chi-Square Test
chi2, p_value, dof, expected = stats.chi2_contingency(ct)
print(f"\nChi-square statistic = {chi2:.4f}")
print(f"p-value = {p_value:.4f}")
print(f"Degrees of freedom = {dof}")
print(f"\nExpected frequencies:")
print(pd.DataFrame(expected, index=ct.index, columns=ct.columns).round(2))
| Component | Description |
|---|---|
| χ² statistic | Measures discrepancy between observed and expected frequencies |
| df | (rows - 1) × (columns - 1) |
| p-value | Probability of observing χ² this large if variables are independent |
Fisher's Exact Test
For small sample sizes (expected frequencies < 5), use Fisher's exact test:
# Small sample example
small_table = np.array([[5, 2], [1, 4]])
odds_ratio, p_fisher = stats.fisher_exact(small_table)
print(f"Small table:\n{small_table}")
print(f"Odds ratio = {odds_ratio:.4f}")
print(f"Fisher p = {p_fisher:.4f}")
When to Use Fisher's Exact Test
Use Fisher's exact test instead of chi-square when: (1) any expected frequency is less than 5, (2) the total sample size is small (n < 20), or (3) the table is 2×2.
Measures of Association
# Cramér's V for any table size
n = ct.sum().sum()
r, c = ct.shape
v = np.sqrt(chi2 / (n * min(r-1, c-1)))
print(f"Cramér's V = {v:.4f}")
# Phi for 2x2 tables
phi = np.sqrt(chi2 / n)
print(f"Phi (φ) = {phi:.4f}")
Contingency Tables in Machine Learning
| ML Application | Usage | Why |
|---|---|---|
| Confusion matrix | Prediction vs actual | Core classification metric |
| Lift table | Predicted probability vs actual outcome | Model calibration |
| Feature analysis | Two categorical features | Relationship discovery |
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix, classification_report
np.random.seed(42)
y_true = np.random.choice([0, 1, 2], 300, p=[0.5, 0.3, 0.2])
y_pred = np.random.choice([0, 1, 2], 300, p=[0.5, 0.3, 0.2])
ct = pd.DataFrame(confusion_matrix(y_true, y_pred),
columns=['Pred 0', 'Pred 1', 'Pred 2'],
index=['True 0', 'True 1', 'True 2'])
print("Confusion matrix (contingency table):")
print(ct)
print(f"\nAccuracy: {np.trace(ct.values) / ct.values.sum():.3f}")
Key Takeaways
Contingency tables show joint frequency distributions of categorical variables in a matrix format.
Expected frequency = (row total × column total) / grand total — the count you'd expect if variables were independent.
The chi-square test assesses whether variables are independent, while Fisher's exact test is preferred for small samples.
Cramér's V quantifies the strength of association (0 = none, 1 = perfect), giving you a sense of practical significance beyond p-values.
Always check expected frequencies before interpreting chi-square results — the test's validity depends on having enough observations in each cell.