Introduction
Advanced Statistical Methods
Revealing Structure in Contingency Tables
Correspondence analysis decomposes contingency tables into principal coordinates, visualizing associations between row and column categories in a low-dimensional map. Chi-square distance drives the geometry.
- Market research — Map relationships between product attributes and consumer preferences
- Linguistics — Visualize word associations across different text corpora
- Sociology — Explore associations between demographic categories and survey responses
CA turns cross-tabulated counts into revealing geometric maps of association.
Correspondence Analysis (CA) is a dimension reduction technique for categorical data contained in a contingency table. Developed primarily by Jean-Paul Benzécri (1973) and popularized by Greenacre (1984), CA decomposes the chi-square statistic into orthogonal components, yielding a low-dimensional geometric representation of rows and columns.
Unlike PCA, which operates on continuous data with Euclidean distance, CA uses the chi-square distance — a metric naturally suited to frequency data where the magnitude of counts varies across categories.
Simple Correspondence Analysis
The Contingency Table
Let be an contingency table with entries , row marginals , column marginals , and grand total .
DfCorrespondence Matrix
The correspondence matrix (relative frequency matrix) is:
where is the cell proportion. Row and column profiles are defined as:
where is the -th row of and .
Chi-Square Distance
DfChi-Square Distance
The chi-square distance between rows and is:
This is the weighted squared Euclidean distance between row profiles, weighted by the inverse of the column marginals.
The chi-square distance is symmetric and non-negative, and it equals zero if and only if the two row profiles are identical. The weighting by ensures that categories with small marginals are upweighted.
The Inertia Decomposition
DfTotal Inertia
The total inertia (chi-square statistic divided by ) is:
This measures the total departure from independence (no association) between rows and columns.
ThInertia Decomposition
The total inertia decomposes as:
where are the eigenvalues of the matrix (or equivalently ), where and . Each represents the inertia explained by the -th dimension.
Principal Coordinates
DfPrincipal Coordinates
The principal coordinates of row on dimension are:
where is the -th element of the -th eigenvector of . Equivalently, in matrix form:
where contains the eigenvectors and the eigenvalues.
The standard coordinates are obtained by dividing principal coordinates by the square root of the eigenvalue:
Standard coordinates are used for plotting supplementary rows/columns.
The Symmetric Map
DfSymmetric Map
In the symmetric map, rows and columns are plotted together with coordinates:
where and are the -th eigenvectors. Distances between rows and columns cannot be interpreted directly; instead, the origin-to-point distances and the angles between points from the same set are meaningful.
In the symmetric map:
- The distance from a point to the origin is its contribution to inertia (centroids are at the origin)
- Points close to the origin are near the average profile
- The proximity of a row point to a column point indicates that the row category is over-represented in that column category
- Never compute distances between row and column points in the symmetric map
Contributions and Cosines
DfContribution of Point to Axis
The contribution of row to the -th axis is:
This measures how much row participates in defining dimension .
DfQuality (Cos²)
The quality (squared cosine) of row on the first dimensions is:
This measures the proportion of row 's total inertia that is captured by the -dimensional subspace.
Multiple Correspondence Analysis
The Indicator Matrix
For categorical variables with levels each, define the indicator matrix of dimension where :
DfIndicator Matrix
Each row of has exactly ones (one per variable).
MCA is equivalent to CA of the Burt matrix :
where is the indicator matrix for variable .
Adjusted Inertia (Greenacre Correction)
The eigenvalues from MCA of the Burt matrix are inflated. Greenacre (1993) proposed the correction:
DfAdjusted Eigenvalues
where is the raw eigenvalue from the Burt matrix. The adjusted eigenvalues provide a more accurate decomposition of the total inertia.
The total inertia in MCA equals for the indicator matrix (or for the Burt matrix before adjustment).
Factor Scores
where and the columns of are normalized eigenvectors.
Connection to Chi-Square Test
ThCA and Chi-Square Independence
The chi-square test of independence for the contingency table has:
Under (independence), . CA provides a geometric decomposition of this chi-square into orthogonal axes, with each axis contributing to the total.
Python Implementation
import numpy as np
from prince import CA, MCA
import pandas as pd
# --- Simple Correspondence Analysis ---
# Create a contingency table (e.g., smoking by profession)
data = np.array([
[4, 2, 3, 2, 3], # Doctors
[4, 3, 5, 5, 5], # Lawyers
[25, 10, 4, 6, 5], # Engineers
])
row_labels = ["Doctors", "Lawyers", "Engineers"]
col_labels = ["None", "Light", "Medium", "Heavy", "Very Heavy"]
df = pd.DataFrame(data, index=row_labels, columns=col_labels)
# --- Manual CA computation ---
def correspondence_analysis(N):
I, J = N.shape
n = N.sum()
P = N / n
r = P.sum(axis=1) # row marginals
c = P.sum(axis=0) # column marginals
# Standardized residuals
E = np.outer(r, c) # expected under independence
S = (P - E) / np.sqrt(E)
# Eigenvalue decomposition
U, D, Vt = np.linalg.svd(S, full_matrices=False)
lam = D**2
# Principal coordinates
F_row = np.diag(1.0 / np.sqrt(r)) @ U @ np.diag(np.sqrt(lam))
F_col = np.diag(1.0 / np.sqrt(c)) @ Vt.T @ np.diag(np.sqrt(lam))
# Contributions
ctr_row = (r[:, None] * F_row**2) / lam[None, :]
ctr_col = (c[:, None] * F_col**2) / lam[None, :]
# Cosines (quality)
cos2_row = F_row**2 / (F_row**2).sum(axis=1, keepdims=True)
cos2_col = F_col**2 / (F_col**2).sum(axis=1, keepdims=True)
# Total inertia
total_inertia = lam.sum()
chi2 = n * total_inertia
return {
'eigenvalues': lam,
'total_inertia': total_inertia,
'chi2': chi2,
'row_coords': F_row,
'col_coords': F_col,
'row_contrib': ctr_row,
'col_contrib': ctr_col,
'row_cos2': cos2_row,
'col_cos2': cos2_col,
}
result = correspondence_analysis(data)
print("Eigenvalues:", result['eigenvalues'])
print("Total inertia:", result['total_inertia'])
print("Chi-square:", result['chi2'])
print("Row principal coords:\n", np.round(result['row_coords'], 4))
print("Column principal coords:\n", np.round(result['col_coords'], 4))
print("Row contributions:\n", np.round(result['row_contrib'], 4))
# --- Using prince library ---
ca = CA(n_components=2, n_iter=10, random_state=42)
ca.fit(df)
print("\n--- prince CA ---")
print("Eigenvalues:", ca.eigenvalues_)
print("Row coordinates:\n", ca.row_coordinates(df))
print("Column coordinates:\n", ca.column_coordinates(df))
# --- Multiple Correspondence Analysis ---
np.random.seed(42)
n = 200
mca_data = pd.DataFrame({
'Education': np.random.choice(['HighSchool', 'Bachelor', 'Master', 'PhD'], n),
'Region': np.random.choice(['North', 'South', 'East', 'West'], n),
'Occupation': np.random.choice(['Engineer', 'Teacher', 'Doctor', 'Artist'], n),
})
mca = MCA(n_components=2, n_iter=10, random_state=42)
mca.fit(mca_data)
print("\n--- MCA ---")
print("Adjusted eigenvalues:", mca.eigenvalues_)
print("Row coordinates shape:", mca.row_coordinates(mca_data).shape)
# --- Symmetric map coordinates ---
def symmetric_map(N):
I, J = N.shape
n = N.sum()
P = N / n
r = P.sum(axis=1)
c = P.sum(axis=0)
# Row standard coordinates
F_row_std = np.diag(1.0 / np.sqrt(r)) @ (P - np.outer(r, c)) @ np.diag(1.0 / np.sqrt(c))
# SVD
U, D, Vt = np.linalg.svd(F_row_std, full_matrices=False)
lam = D**2
# Symmetric coordinates
F_sym = np.diag(np.sqrt(r)) @ U
G_sym = np.diag(np.sqrt(c)) @ Vt.T
return F_sym, G_sym, lam
F_sym, G_sym, lam = symmetric_map(data)
print("\nSymmetric row coords:\n", np.round(F_sym, 4))
print("Symmetric col coords:\n", np.round(G_sym, 4))
Interpretation Rules
Interpreting Correspondence Analysis:
-
Number of dimensions: Examine the eigenvalue scree plot. The total number of dimensions is . The first two typically capture most inertia.
-
Inertia explained: Each eigenvalue is the inertia on dimension . Report the cumulative percentage of total inertia.
-
Proximity: In the symmetric map, the proximity of a row point to a column point indicates association. Specifically, row is associated with column if (above independence).
-
Origin: Points near the origin have profiles close to the average — they are not distinctive.
-
Contribution: High contribution () indicates the point defines the axis. Check both row and column contributions.
-
Quality (): Low quality means the point is poorly represented in the low-dimensional display — its configuration may be misleading.
-
Interpretation flow: First identify the axes (via contributions and profiles), then interpret the spatial configuration (proximity, origin distance), then assess reliability (quality, inertia).
-
Avoid: Interpreting distances between row and column points in the symmetric map. Use the symmetric map only for relative positioning.
Extensions
Canonical Correspondence Analysis (CCA) combines CA with constrained ordination, relating community composition to environmental variables. Non-symmetric correspondence analysis decomposes the statistic asymmetrically, focusing on how column categories explain row categories (or vice versa). Joint correspondence analysis maximizes the off-diagonal blocks of the Burt matrix, reducing the inflation effect without the algebraic correction.