🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Correspondence Analysis

Advanced Statistical MethodsDimensionality Reduction🟢 Free Lesson

Advertisement

Introduction

Advanced Statistical Methods

Revealing Structure in Contingency Tables

Correspondence analysis decomposes contingency tables into principal coordinates, visualizing associations between row and column categories in a low-dimensional map. Chi-square distance drives the geometry.

  • Market research — Map relationships between product attributes and consumer preferences
  • Linguistics — Visualize word associations across different text corpora
  • Sociology — Explore associations between demographic categories and survey responses

CA turns cross-tabulated counts into revealing geometric maps of association.


Correspondence Analysis (CA) is a dimension reduction technique for categorical data contained in a contingency table. Developed primarily by Jean-Paul Benzécri (1973) and popularized by Greenacre (1984), CA decomposes the chi-square statistic into orthogonal components, yielding a low-dimensional geometric representation of rows and columns.

Unlike PCA, which operates on continuous data with Euclidean distance, CA uses the chi-square distance — a metric naturally suited to frequency data where the magnitude of counts varies across categories.

Simple Correspondence Analysis

The Contingency Table

Let N\mathbf{N} be an I×JI \times J contingency table with entries nij0n_{ij} \geq 0, row marginals ni+=jnijn_{i+} = \sum_j n_{ij}, column marginals n+j=inijn_{+j} = \sum_i n_{ij}, and grand total n=ijnijn = \sum_{ij} n_{ij}.

DfCorrespondence Matrix

The correspondence matrix (relative frequency matrix) is:

P=1nN=[pij]\mathbf{P} = \frac{1}{n}\mathbf{N} = [p_{ij}]

where pij=nij/np_{ij} = n_{ij}/n is the cell proportion. Row and column profiles are defined as:

ri=1pi+pi(row profile),cj=1p+jpj(column profile)\mathbf{r}_i = \frac{1}{p_{i+}}\mathbf{p}_i^{\top} \quad (\text{row profile}), \qquad \mathbf{c}_j = \frac{1}{p_{+j}}\mathbf{p}_j \quad (\text{column profile})

where pi\mathbf{p}_i is the ii-th row of P\mathbf{P} and pi+=jpijp_{i+} = \sum_j p_{ij}.

Chi-Square Distance

DfChi-Square Distance

The chi-square distance between rows ii and ii' is:

dχ22(i,i)=j=1J1p+j(pijpi+pijpi+)2=j=1J(rijrij)2p+jd_{\chi^2}^2(i, i') = \sum_{j=1}^{J} \frac{1}{p_{+j}} \left(\frac{p_{ij}}{p_{i+}} - \frac{p_{i'j}}{p_{i'+}}\right)^2 = \sum_{j=1}^{J} \frac{(r_{ij} - r_{i'j})^2}{p_{+j}}

This is the weighted squared Euclidean distance between row profiles, weighted by the inverse of the column marginals.

The chi-square distance is symmetric and non-negative, and it equals zero if and only if the two row profiles are identical. The weighting by 1/p+j1/p_{+j} ensures that categories with small marginals are upweighted.

The Inertia Decomposition

DfTotal Inertia

The total inertia (chi-square statistic divided by nn) is:

Inertiatotal=χ2n=1ni=1Ij=1J(nijni+n+j/n)2ni+n+j/n\text{Inertia}_{\text{total}} = \frac{\chi^2}{n} = \frac{1}{n}\sum_{i=1}^{I}\sum_{j=1}^{J} \frac{(n_{ij} - n_{i+}n_{+j}/n)^2}{n_{i+}n_{+j}/n}

This measures the total departure from independence (no association) between rows and columns.

ThInertia Decomposition

The total inertia decomposes as:

Inertiatotal=k=1Kλk\text{Inertia}_{\text{total}} = \sum_{k=1}^{K} \lambda_k

where λ1λ2λK0\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_K \geq 0 are the eigenvalues of the matrix Sr=Dr1/2PDc1PDr1/2\mathbf{S}_r = \mathbf{D}_r^{-1/2}\mathbf{P}\mathbf{D}_c^{-1}\mathbf{P}^{\top}\mathbf{D}_r^{-1/2} (or equivalently Sc=Dc1/2PDr1PDc1/2\mathbf{S}_c = \mathbf{D}_c^{-1/2}\mathbf{P}^{\top}\mathbf{D}_r^{-1}\mathbf{P}\mathbf{D}_c^{-1/2}), where Dr=diag(pi+)\mathbf{D}_r = \text{diag}(p_{i+}) and Dc=diag(p+j)\mathbf{D}_c = \text{diag}(p_{+j}). Each λk\lambda_k represents the inertia explained by the kk-th dimension.

Principal Coordinates

DfPrincipal Coordinates

The principal coordinates of row ii on dimension kk are:

Fik=1λkj=1Jpijpi+gjkF_{ik} = \frac{1}{\sqrt{\lambda_k}} \sum_{j=1}^{J} \frac{p_{ij}}{p_{i+}} g_{jk}

where gjkg_{jk} is the jj-th element of the kk-th eigenvector of Sc\mathbf{S}_c. Equivalently, in matrix form:

Fr=Dr1/2VΛ1/2\mathbf{F}_r = \mathbf{D}_r^{-1/2}\mathbf{V}\boldsymbol{\Lambda}^{1/2}

where V\mathbf{V} contains the eigenvectors and Λ\boldsymbol{\Lambda} the eigenvalues.

The standard coordinates are obtained by dividing principal coordinates by the square root of the eigenvalue:

Gik=FikλkG_{ik} = \frac{F_{ik}}{\sqrt{\lambda_k}}

Standard coordinates are used for plotting supplementary rows/columns.

The Symmetric Map

DfSymmetric Map

In the symmetric map, rows and columns are plotted together with coordinates:

xi,sym=1λkDr1/2vk,yj,sym=1λkDc1/2uk\mathbf{x}_{i,\text{sym}} = \frac{1}{\sqrt{\lambda_k}}\mathbf{D}_r^{-1/2}\mathbf{v}_k, \qquad \mathbf{y}_{j,\text{sym}} = \frac{1}{\sqrt{\lambda_k}}\mathbf{D}_c^{-1/2}\mathbf{u}_k

where vk\mathbf{v}_k and uk\mathbf{u}_k are the kk-th eigenvectors. Distances between rows and columns cannot be interpreted directly; instead, the origin-to-point distances and the angles between points from the same set are meaningful.

In the symmetric map:

  • The distance from a point to the origin is its contribution to inertia (centroids are at the origin)
  • Points close to the origin are near the average profile
  • The proximity of a row point to a column point indicates that the row category is over-represented in that column category
  • Never compute distances between row and column points in the symmetric map

Contributions and Cosines

DfContribution of Point to Axis

The contribution of row ii to the kk-th axis is:

ctrik=pi+Fik2λk\text{ctr}_{ik} = \frac{p_{i+} F_{ik}^2}{\lambda_k}

This measures how much row ii participates in defining dimension kk.

DfQuality (Cos²)

The quality (squared cosine) of row ii on the first mm dimensions is:

cosi(m)2=k=1mFik2k=1KFik2\cos^2_{i(m)} = \frac{\sum_{k=1}^{m} F_{ik}^2}{\sum_{k=1}^{K} F_{ik}^2}

This measures the proportion of row ii's total inertia that is captured by the mm-dimensional subspace.

Multiple Correspondence Analysis

The Indicator Matrix

For categorical variables A1,,AJA_1, \dots, A_J with KjK_j levels each, define the indicator matrix Z\mathbf{Z} of dimension n×Kn \times K where K=jKjK = \sum_j K_j:

DfIndicator Matrix

Zij={1if individual i has level j of its variable0otherwiseZ_{ij} = \begin{cases} 1 & \text{if individual } i \text{ has level } j \text{ of its variable} \\ 0 & \text{otherwise} \end{cases}

Each row of Z\mathbf{Z} has exactly JJ ones (one per variable).

MCA is equivalent to CA of the Burt matrix B=ZZ\mathbf{B} = \mathbf{Z}^{\top}\mathbf{Z}:

B=(N1D11N1N1D11NJNJDJ1N1NJDJ1NJ)\mathbf{B} = \begin{pmatrix} \mathbf{N}_1^{\top}\mathbf{D}_1^{-1}\mathbf{N}_1 & \cdots & \mathbf{N}_1^{\top}\mathbf{D}_1^{-1}\mathbf{N}_J \\ \vdots & \ddots & \vdots \\ \mathbf{N}_J^{\top}\mathbf{D}_J^{-1}\mathbf{N}_1 & \cdots & \mathbf{N}_J^{\top}\mathbf{D}_J^{-1}\mathbf{N}_J \end{pmatrix}

where Nj\mathbf{N}_j is the n×Kjn \times K_j indicator matrix for variable jj.

Adjusted Inertia (Greenacre Correction)

The eigenvalues from MCA of the Burt matrix are inflated. Greenacre (1993) proposed the correction:

DfAdjusted Eigenvalues

λ~k=(JJ1)2(λ^k1J)2\tilde{\lambda}_k = \left(\frac{J}{J-1}\right)^2 \left(\hat{\lambda}_k - \frac{1}{J}\right)^2

where λ^k\hat{\lambda}_k is the raw eigenvalue from the Burt matrix. The adjusted eigenvalues provide a more accurate decomposition of the total inertia.

The total inertia in MCA equals J1J - 1 for the indicator matrix (or JJ for the Burt matrix before adjustment).

Factor Scores

F=ZWΛ1/2\mathbf{F} = \mathbf{Z}\mathbf{W}\boldsymbol{\Lambda}^{-1/2}

where W=(ZDr1Z)1\mathbf{W} = (\mathbf{Z}^{\top}\mathbf{D}_r^{-1}\mathbf{Z})^{-1} and the columns of Λ1/2\boldsymbol{\Lambda}^{-1/2} are normalized eigenvectors.

Connection to Chi-Square Test

ThCA and Chi-Square Independence

The chi-square test of independence for the contingency table N\mathbf{N} has:

χ2=nInertiatotal=nk=1Kλk\chi^2 = n \cdot \text{Inertia}_{\text{total}} = n \sum_{k=1}^{K} \lambda_k

Under H0H_0 (independence), χ2χ(I1)(J1)2\chi^2 \sim \chi^2_{(I-1)(J-1)}. CA provides a geometric decomposition of this chi-square into orthogonal axes, with each axis contributing λk\lambda_k to the total.

Python Implementation

import numpy as np
from prince import CA, MCA
import pandas as pd

# --- Simple Correspondence Analysis ---
# Create a contingency table (e.g., smoking by profession)
data = np.array([
    [4, 2, 3, 2, 3],  # Doctors
    [4, 3, 5, 5, 5],  # Lawyers
    [25, 10, 4, 6, 5], # Engineers
])
row_labels = ["Doctors", "Lawyers", "Engineers"]
col_labels = ["None", "Light", "Medium", "Heavy", "Very Heavy"]
df = pd.DataFrame(data, index=row_labels, columns=col_labels)

# --- Manual CA computation ---
def correspondence_analysis(N):
    I, J = N.shape
    n = N.sum()

    P = N / n
    r = P.sum(axis=1)   # row marginals
    c = P.sum(axis=0)   # column marginals

    # Standardized residuals
    E = np.outer(r, c)  # expected under independence
    S = (P - E) / np.sqrt(E)

    # Eigenvalue decomposition
    U, D, Vt = np.linalg.svd(S, full_matrices=False)
    lam = D**2

    # Principal coordinates
    F_row = np.diag(1.0 / np.sqrt(r)) @ U @ np.diag(np.sqrt(lam))
    F_col = np.diag(1.0 / np.sqrt(c)) @ Vt.T @ np.diag(np.sqrt(lam))

    # Contributions
    ctr_row = (r[:, None] * F_row**2) / lam[None, :]
    ctr_col = (c[:, None] * F_col**2) / lam[None, :]

    # Cosines (quality)
    cos2_row = F_row**2 / (F_row**2).sum(axis=1, keepdims=True)
    cos2_col = F_col**2 / (F_col**2).sum(axis=1, keepdims=True)

    # Total inertia
    total_inertia = lam.sum()
    chi2 = n * total_inertia

    return {
        'eigenvalues': lam,
        'total_inertia': total_inertia,
        'chi2': chi2,
        'row_coords': F_row,
        'col_coords': F_col,
        'row_contrib': ctr_row,
        'col_contrib': ctr_col,
        'row_cos2': cos2_row,
        'col_cos2': cos2_col,
    }

result = correspondence_analysis(data)
print("Eigenvalues:", result['eigenvalues'])
print("Total inertia:", result['total_inertia'])
print("Chi-square:", result['chi2'])
print("Row principal coords:\n", np.round(result['row_coords'], 4))
print("Column principal coords:\n", np.round(result['col_coords'], 4))
print("Row contributions:\n", np.round(result['row_contrib'], 4))

# --- Using prince library ---
ca = CA(n_components=2, n_iter=10, random_state=42)
ca.fit(df)

print("\n--- prince CA ---")
print("Eigenvalues:", ca.eigenvalues_)
print("Row coordinates:\n", ca.row_coordinates(df))
print("Column coordinates:\n", ca.column_coordinates(df))

# --- Multiple Correspondence Analysis ---
np.random.seed(42)
n = 200
mca_data = pd.DataFrame({
    'Education': np.random.choice(['HighSchool', 'Bachelor', 'Master', 'PhD'], n),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], n),
    'Occupation': np.random.choice(['Engineer', 'Teacher', 'Doctor', 'Artist'], n),
})

mca = MCA(n_components=2, n_iter=10, random_state=42)
mca.fit(mca_data)
print("\n--- MCA ---")
print("Adjusted eigenvalues:", mca.eigenvalues_)
print("Row coordinates shape:", mca.row_coordinates(mca_data).shape)

# --- Symmetric map coordinates ---
def symmetric_map(N):
    I, J = N.shape
    n = N.sum()
    P = N / n
    r = P.sum(axis=1)
    c = P.sum(axis=0)

    # Row standard coordinates
    F_row_std = np.diag(1.0 / np.sqrt(r)) @ (P - np.outer(r, c)) @ np.diag(1.0 / np.sqrt(c))

    # SVD
    U, D, Vt = np.linalg.svd(F_row_std, full_matrices=False)
    lam = D**2

    # Symmetric coordinates
    F_sym = np.diag(np.sqrt(r)) @ U
    G_sym = np.diag(np.sqrt(c)) @ Vt.T

    return F_sym, G_sym, lam

F_sym, G_sym, lam = symmetric_map(data)
print("\nSymmetric row coords:\n", np.round(F_sym, 4))
print("Symmetric col coords:\n", np.round(G_sym, 4))

Interpretation Rules

Interpreting Correspondence Analysis:

  1. Number of dimensions: Examine the eigenvalue scree plot. The total number of dimensions is min(I,J)1\min(I, J) - 1. The first two typically capture most inertia.

  2. Inertia explained: Each eigenvalue λk\lambda_k is the inertia on dimension kk. Report the cumulative percentage of total inertia.

  3. Proximity: In the symmetric map, the proximity of a row point to a column point indicates association. Specifically, row ii is associated with column jj if pij>pi+p+j/np_{ij} > p_{i+}p_{+j}/n (above independence).

  4. Origin: Points near the origin have profiles close to the average — they are not distinctive.

  5. Contribution: High contribution (ctrik>1/K\text{ctr}_{ik} > 1/K) indicates the point defines the axis. Check both row and column contributions.

  6. Quality (cos2\cos^2): Low quality means the point is poorly represented in the low-dimensional display — its configuration may be misleading.

  7. Interpretation flow: First identify the axes (via contributions and profiles), then interpret the spatial configuration (proximity, origin distance), then assess reliability (quality, inertia).

  8. Avoid: Interpreting distances between row and column points in the symmetric map. Use the symmetric map only for relative positioning.

Extensions

Canonical Correspondence Analysis (CCA) combines CA with constrained ordination, relating community composition to environmental variables. Non-symmetric correspondence analysis decomposes the χ2\chi^2 statistic asymmetrically, focusing on how column categories explain row categories (or vice versa). Joint correspondence analysis maximizes the off-diagonal blocks of the Burt matrix, reducing the inflation effect without the algebraic correction.

Premium Content

Correspondence Analysis

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement