Introduction

Advanced Statistical Methods

Revealing Structure in Contingency Tables

Correspondence analysis decomposes contingency tables into principal coordinates, visualizing associations between row and column categories in a low-dimensional map. Chi-square distance drives the geometry.

Market research — Map relationships between product attributes and consumer preferences
Linguistics — Visualize word associations across different text corpora
Sociology — Explore associations between demographic categories and survey responses

CA turns cross-tabulated counts into revealing geometric maps of association.

Correspondence Analysis (CA) is a dimension reduction technique for categorical data contained in a contingency table. Developed primarily by Jean-Paul Benzécri (1973) and popularized by Greenacre (1984), CA decomposes the chi-square statistic into orthogonal components, yielding a low-dimensional geometric representation of rows and columns.

Unlike PCA, which operates on continuous data with Euclidean distance, CA uses the chi-square distance — a metric naturally suited to frequency data where the magnitude of counts varies across categories.

Simple Correspondence Analysis

The Contingency Table

Let $\mathbf{N}$ be an $I \times J$ contingency table with entries $n_{ij} \geq 0$ , row marginals $n_{i+} = \sum_j n_{ij}$ , column marginals $n_{+j} = \sum_i n_{ij}$ , and grand total $n = \sum_{ij} n_{ij}$ .

DfCorrespondence Matrix

The correspondence matrix (relative frequency matrix) is:

\mathbf{P} = \frac{1}{n}\mathbf{N} = [p_{ij}]

where $p_{ij} = n_{ij}/n$ is the cell proportion. Row and column profiles are defined as:

\mathbf{r}_i = \frac{1}{p_{i+}}\mathbf{p}_i^{\top} \quad (\text{row profile}), \qquad \mathbf{c}_j = \frac{1}{p_{+j}}\mathbf{p}_j \quad (\text{column profile})

where $\mathbf{p}_i$ is the $i$ -th row of $\mathbf{P}$ and $p_{i+} = \sum_j p_{ij}$ .

Chi-Square Distance

DfChi-Square Distance

The chi-square distance between rows $i$ and $i'$ is:

d_{\chi^2}^2(i, i') = \sum_{j=1}^{J} \frac{1}{p_{+j}} \left(\frac{p_{ij}}{p_{i+}} - \frac{p_{i'j}}{p_{i'+}}\right)^2 = \sum_{j=1}^{J} \frac{(r_{ij} - r_{i'j})^2}{p_{+j}}

This is the weighted squared Euclidean distance between row profiles, weighted by the inverse of the column marginals.

The chi-square distance is symmetric and non-negative, and it equals zero if and only if the two row profiles are identical. The weighting by $1/p_{+j}$ ensures that categories with small marginals are upweighted.

The Inertia Decomposition

DfTotal Inertia

The total inertia (chi-square statistic divided by $n$ ) is:

\text{Inertia}_{\text{total}} = \frac{\chi^2}{n} = \frac{1}{n}\sum_{i=1}^{I}\sum_{j=1}^{J} \frac{(n_{ij} - n_{i+}n_{+j}/n)^2}{n_{i+}n_{+j}/n}

This measures the total departure from independence (no association) between rows and columns.

ThInertia Decomposition

The total inertia decomposes as:

\text{Inertia}_{\text{total}} = \sum_{k=1}^{K} \lambda_k

where $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_K \geq 0$ are the eigenvalues of the matrix $\mathbf{S}_r = \mathbf{D}_r^{-1/2}\mathbf{P}\mathbf{D}_c^{-1}\mathbf{P}^{\top}\mathbf{D}_r^{-1/2}$ (or equivalently $\mathbf{S}_c = \mathbf{D}_c^{-1/2}\mathbf{P}^{\top}\mathbf{D}_r^{-1}\mathbf{P}\mathbf{D}_c^{-1/2}$ ), where $\mathbf{D}_r = \text{diag}(p_{i+})$ and $\mathbf{D}_c = \text{diag}(p_{+j})$ . Each $\lambda_k$ represents the inertia explained by the $k$ -th dimension.

Principal Coordinates

DfPrincipal Coordinates

The principal coordinates of row $i$ on dimension $k$ are:

F_{ik} = \frac{1}{\sqrt{\lambda_k}} \sum_{j=1}^{J} \frac{p_{ij}}{p_{i+}} g_{jk}

where $g_{jk}$ is the $j$ -th element of the $k$ -th eigenvector of $\mathbf{S}_c$ . Equivalently, in matrix form:

\mathbf{F}_r = \mathbf{D}_r^{-1/2}\mathbf{V}\boldsymbol{\Lambda}^{1/2}

where $\mathbf{V}$ contains the eigenvectors and $\boldsymbol{\Lambda}$ the eigenvalues.

The standard coordinates are obtained by dividing principal coordinates by the square root of the eigenvalue:

G_{ik} = \frac{F_{ik}}{\sqrt{\lambda_k}}

Standard coordinates are used for plotting supplementary rows/columns.

The Symmetric Map

DfSymmetric Map

In the symmetric map, rows and columns are plotted together with coordinates:

\mathbf{x}_{i,\text{sym}} = \frac{1}{\sqrt{\lambda_k}}\mathbf{D}_r^{-1/2}\mathbf{v}_k, \qquad \mathbf{y}_{j,\text{sym}} = \frac{1}{\sqrt{\lambda_k}}\mathbf{D}_c^{-1/2}\mathbf{u}_k

where $\mathbf{v}_k$ and $\mathbf{u}_k$ are the $k$ -th eigenvectors. Distances between rows and columns cannot be interpreted directly; instead, the origin-to-point distances and the angles between points from the same set are meaningful.

In the symmetric map:

The distance from a point to the origin is its contribution to inertia (centroids are at the origin)
Points close to the origin are near the average profile
The proximity of a row point to a column point indicates that the row category is over-represented in that column category
Never compute distances between row and column points in the symmetric map

Contributions and Cosines

DfContribution of Point to Axis

The contribution of row $i$ to the $k$ -th axis is:

\text{ctr}_{ik} = \frac{p_{i+} F_{ik}^2}{\lambda_k}

This measures how much row $i$ participates in defining dimension $k$ .

DfQuality (Cos²)

The quality (squared cosine) of row $i$ on the first $m$ dimensions is:

\cos^2_{i(m)} = \frac{\sum_{k=1}^{m} F_{ik}^2}{\sum_{k=1}^{K} F_{ik}^2}

This measures the proportion of row $i$ 's total inertia that is captured by the $m$ -dimensional subspace.

Multiple Correspondence Analysis

The Indicator Matrix

For categorical variables $A_1, \dots, A_J$ with $K_j$ levels each, define the indicator matrix $\mathbf{Z}$ of dimension $n \times K$ where $K = \sum_j K_j$ :

DfIndicator Matrix

Z_{ij} = \begin{cases} 1 & \text{if individual } i \text{ has level } j \text{ of its variable} \\ 0 & \text{otherwise} \end{cases}

Each row of $\mathbf{Z}$ has exactly $J$ ones (one per variable).

MCA is equivalent to CA of the Burt matrix $\mathbf{B} = \mathbf{Z}^{\top}\mathbf{Z}$ :

\mathbf{B} = \begin{pmatrix} \mathbf{N}_1^{\top}\mathbf{D}_1^{-1}\mathbf{N}_1 & \cdots & \mathbf{N}_1^{\top}\mathbf{D}_1^{-1}\mathbf{N}_J \\ \vdots & \ddots & \vdots \\ \mathbf{N}_J^{\top}\mathbf{D}_J^{-1}\mathbf{N}_1 & \cdots & \mathbf{N}_J^{\top}\mathbf{D}_J^{-1}\mathbf{N}_J \end{pmatrix}

where $\mathbf{N}_j$ is the $n \times K_j$ indicator matrix for variable $j$ .

Adjusted Inertia (Greenacre Correction)

The eigenvalues from MCA of the Burt matrix are inflated. Greenacre (1993) proposed the correction:

DfAdjusted Eigenvalues

\tilde{\lambda}_k = \left(\frac{J}{J-1}\right)^2 \left(\hat{\lambda}_k - \frac{1}{J}\right)^2

where $\hat{\lambda}_k$ is the raw eigenvalue from the Burt matrix. The adjusted eigenvalues provide a more accurate decomposition of the total inertia.

The total inertia in MCA equals $J - 1$ for the indicator matrix (or $J$ for the Burt matrix before adjustment).

Factor Scores

\mathbf{F} = \mathbf{Z}\mathbf{W}\boldsymbol{\Lambda}^{-1/2}

where $\mathbf{W} = (\mathbf{Z}^{\top}\mathbf{D}_r^{-1}\mathbf{Z})^{-1}$ and the columns of $\boldsymbol{\Lambda}^{-1/2}$ are normalized eigenvectors.

Connection to Chi-Square Test

ThCA and Chi-Square Independence

The chi-square test of independence for the contingency table $\mathbf{N}$ has:

\chi^2 = n \cdot \text{Inertia}_{\text{total}} = n \sum_{k=1}^{K} \lambda_k

Under $H_0$ (independence), $\chi^2 \sim \chi^2_{(I-1)(J-1)}$ . CA provides a geometric decomposition of this chi-square into orthogonal axes, with each axis contributing $\lambda_k$ to the total.

Python Implementation

import numpy as np
from prince import CA, MCA
import pandas as pd

# --- Simple Correspondence Analysis ---
# Create a contingency table (e.g., smoking by profession)
data = np.array([
    [4, 2, 3, 2, 3],  # Doctors
    [4, 3, 5, 5, 5],  # Lawyers
    [25, 10, 4, 6, 5], # Engineers
])
row_labels = ["Doctors", "Lawyers", "Engineers"]
col_labels = ["None", "Light", "Medium", "Heavy", "Very Heavy"]
df = pd.DataFrame(data, index=row_labels, columns=col_labels)

# --- Manual CA computation ---
def correspondence_analysis(N):
    I, J = N.shape
    n = N.sum()

    P = N / n
    r = P.sum(axis=1)   # row marginals
    c = P.sum(axis=0)   # column marginals

    # Standardized residuals
    E = np.outer(r, c)  # expected under independence
    S = (P - E) / np.sqrt(E)

    # Eigenvalue decomposition
    U, D, Vt = np.linalg.svd(S, full_matrices=False)
    lam = D**2

    # Principal coordinates
    F_row = np.diag(1.0 / np.sqrt(r)) @ U @ np.diag(np.sqrt(lam))
    F_col = np.diag(1.0 / np.sqrt(c)) @ Vt.T @ np.diag(np.sqrt(lam))

    # Contributions
    ctr_row = (r[:, None] * F_row**2) / lam[None, :]
    ctr_col = (c[:, None] * F_col**2) / lam[None, :]

    # Cosines (quality)
    cos2_row = F_row**2 / (F_row**2).sum(axis=1, keepdims=True)
    cos2_col = F_col**2 / (F_col**2).sum(axis=1, keepdims=True)

    # Total inertia
    total_inertia = lam.sum()
    chi2 = n * total_inertia

    return {
        'eigenvalues': lam,
        'total_inertia': total_inertia,
        'chi2': chi2,
        'row_coords': F_row,
        'col_coords': F_col,
        'row_contrib': ctr_row,
        'col_contrib': ctr_col,
        'row_cos2': cos2_row,
        'col_cos2': cos2_col,
    }

result = correspondence_analysis(data)
print("Eigenvalues:", result['eigenvalues'])
print("Total inertia:", result['total_inertia'])
print("Chi-square:", result['chi2'])
print("Row principal coords:\n", np.round(result['row_coords'], 4))
print("Column principal coords:\n", np.round(result['col_coords'], 4))
print("Row contributions:\n", np.round(result['row_contrib'], 4))

# --- Using prince library ---
ca = CA(n_components=2, n_iter=10, random_state=42)
ca.fit(df)

print("\n--- prince CA ---")
print("Eigenvalues:", ca.eigenvalues_)
print("Row coordinates:\n", ca.row_coordinates(df))
print("Column coordinates:\n", ca.column_coordinates(df))

# --- Multiple Correspondence Analysis ---
np.random.seed(42)
n = 200
mca_data = pd.DataFrame({
    'Education': np.random.choice(['HighSchool', 'Bachelor', 'Master', 'PhD'], n),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], n),
    'Occupation': np.random.choice(['Engineer', 'Teacher', 'Doctor', 'Artist'], n),
})

mca = MCA(n_components=2, n_iter=10, random_state=42)
mca.fit(mca_data)
print("\n--- MCA ---")
print("Adjusted eigenvalues:", mca.eigenvalues_)
print("Row coordinates shape:", mca.row_coordinates(mca_data).shape)

# --- Symmetric map coordinates ---
def symmetric_map(N):
    I, J = N.shape
    n = N.sum()
    P = N / n
    r = P.sum(axis=1)
    c = P.sum(axis=0)

    # Row standard coordinates
    F_row_std = np.diag(1.0 / np.sqrt(r)) @ (P - np.outer(r, c)) @ np.diag(1.0 / np.sqrt(c))

    # SVD
    U, D, Vt = np.linalg.svd(F_row_std, full_matrices=False)
    lam = D**2

    # Symmetric coordinates
    F_sym = np.diag(np.sqrt(r)) @ U
    G_sym = np.diag(np.sqrt(c)) @ Vt.T

    return F_sym, G_sym, lam

F_sym, G_sym, lam = symmetric_map(data)
print("\nSymmetric row coords:\n", np.round(F_sym, 4))
print("Symmetric col coords:\n", np.round(G_sym, 4))

Interpretation Rules

Interpreting Correspondence Analysis:

Number of dimensions: Examine the eigenvalue scree plot. The total number of dimensions is $\min(I, J) - 1$ . The first two typically capture most inertia.
Inertia explained: Each eigenvalue $\lambda_k$ is the inertia on dimension $k$ . Report the cumulative percentage of total inertia.
Proximity: In the symmetric map, the proximity of a row point to a column point indicates association. Specifically, row $i$ is associated with column $j$ if $p_{ij} > p_{i+}p_{+j}/n$ (above independence).
Origin: Points near the origin have profiles close to the average — they are not distinctive.
Contribution: High contribution ( $\text{ctr}_{ik} > 1/K$ ) indicates the point defines the axis. Check both row and column contributions.
Quality ( $\cos^2$ ): Low quality means the point is poorly represented in the low-dimensional display — its configuration may be misleading.
Interpretation flow: First identify the axes (via contributions and profiles), then interpret the spatial configuration (proximity, origin distance), then assess reliability (quality, inertia).
Avoid: Interpreting distances between row and column points in the symmetric map. Use the symmetric map only for relative positioning.

Extensions

Canonical Correspondence Analysis (CCA) combines CA with constrained ordination, relating community composition to environmental variables. Non-symmetric correspondence analysis decomposes the $\chi^2$ statistic asymmetrically, focusing on how column categories explain row categories (or vice versa). Joint correspondence analysis maximizes the off-diagonal blocks of the Burt matrix, reducing the inflation effect without the algebraic correction.

Correspondence Analysis