Introduction

Advanced Statistical Methods

Finding Relationships Between Two Sets of Variables

Canonical correlation analysis identifies linear combinations of two variable sets that are maximally correlated, revealing the deepest relationships between paired multivariate data. Wilks' Lambda tests overall significance.

Psychology — Relate personality test batteries to behavioral outcome measures
Ecology — Link environmental variables to species abundance patterns across sites
Marketing — Connect consumer attitude surveys to purchasing behavior datasets

CCA reveals the hidden threads that bind two multivariate worlds together.

Canonical Correlation Analysis (CCA), introduced by Harold Hotelling in 1936, investigates the relationships between two sets of variables $\mathbf{X} = (X_1, \dots, X_p)^{\top}$ and $\mathbf{Y} = (Y_1, \dots, Y_q)^{\top}$ measured on the same $n$ subjects. Rather than examining individual bivariate correlations, CCA seeks linear combinations of each set that are maximally correlated with each other.

The method answers the question: What are the strongest possible linear relationships between two multidimensional datasets?

Mathematical Formulation

The Canonical Correlation Problem

Let $\boldsymbol{\Sigma}_{XX}$ ( $p \times p$ ), $\boldsymbol{\Sigma}_{YY}$ ( $q \times q$ ), and $\boldsymbol{\Sigma}_{XY}$ ( $p \times q$ ) be the within-set and cross-set covariance matrices. Without loss of generality, assume $p \leq q$ .

DfCanonical Variates

Canonical variates are linear combinations:

U = \mathbf{a}^{\top}\mathbf{X}, \qquad V = \mathbf{b}^{\top}\mathbf{Y}

that maximize the Pearson correlation $\text{corr}(U, V)$ subject to $\text{Var}(U) = 1$ and $\text{Var}(V) = 1$ .

The $k$ -th canonical pair $(\mathbf{a}_k, \mathbf{b}_k)$ solves:

\max_{\mathbf{a}, \mathbf{b}} \quad \mathbf{a}^{\top} \boldsymbol{\Sigma}_{XY} \mathbf{b}

\text{subject to} \quad \mathbf{a}^{\top} \boldsymbol{\Sigma}_{XX} \mathbf{a} = 1, \quad \mathbf{b}^{\top} \boldsymbol{\Sigma}_{YY} \mathbf{b} = 1

Using Lagrange multipliers, this reduces to a generalized eigenvalue problem. The solution proceeds through the matrices:

\mathbf{M}_1 = \boldsymbol{\Sigma}_{XX}^{-1} \boldsymbol{\Sigma}_{XY} \boldsymbol{\Sigma}_{YY}^{-1} \boldsymbol{\Sigma}_{YX}

\mathbf{M}_2 = \boldsymbol{\Sigma}_{YY}^{-1} \boldsymbol{\Sigma}_{YX} \boldsymbol{\Sigma}_{XX}^{-1} \boldsymbol{\Sigma}_{XY}

ThCanonical Correlation Theorem

The canonical correlations are the square roots of the eigenvalues $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_r$ of $\mathbf{M}_1$ (equivalently $\mathbf{M}_2$ ), where $r = \text{rank}(\boldsymbol{\Sigma}_{XY}) \leq \min(p, q)$ . The canonical weight vectors $\mathbf{a}_k$ and $\mathbf{b}_k$ are the corresponding eigenvectors, scaled to satisfy the unit-variance constraints. The $k$ -th canonical correlation is $\rho_k = \sqrt{\lambda_k}$ .

Canonical Variate Properties

The canonical variates satisfy orthogonality conditions:

\text{Cov}(U_i, U_j) = \mathbf{a}_i^{\top} \boldsymbol{\Sigma}_{XX} \mathbf{a}_j = \delta_{ij}

\text{Cov}(V_i, V_j) = \mathbf{b}_i^{\top} \boldsymbol{\Sigma}_{YY} \mathbf{b}_j = \delta_{ij}

\text{Cov}(U_i, V_j) = \rho_i \, \delta_{ij}

where $\delta_{ij}$ is the Kronecker delta. This means canonical variates within each set are uncorrelated, and the $i$ -th $\mathbf{X}$ -canonical variate is correlated only with the $i$ -th $\mathbf{Y}$ -canonical variate.

Number of Canonical Pairs

Dimensionality Determination

The number of non-trivial canonical correlations equals $r = \text{rank}(\boldsymbol{\Sigma}_{XY})$ . In practice, the effective dimensionality must be assessed:

Bartlett's test for remaining canonical correlations. To test whether canonical correlations $\rho_{k}, \rho_{k+1}, \dots, \rho_r$ are zero, compute:

\Lambda_k = \prod_{j=k}^{r} (1 - \hat{\rho}_j^2)

Under $H_0: \rho_k = \cdots = \rho_r = 0$ , the statistic:

\chi^2 = -\left[(n - k) - \frac{p + q + 1}{2}\right] \log \Lambda_k

follows approximately $\chi^2_{(p-k+1)(q-k+1)}$ . Sequential testing from $k = r$ downward identifies the number of significant pairs.

Wilks' Lambda

Wilks' Lambda provides an overall test of whether the two sets of variables are related:

DfWilks' Lambda for CCA

Wilks' Lambda for testing all canonical correlations simultaneously is:

\Lambda = \prod_{j=1}^{r} (1 - \hat{\rho}_j^2) = |\mathbf{I} - \boldsymbol{\Omega}|

where $\boldsymbol{\Omega} = \boldsymbol{\Sigma}_{XX}^{-1} \boldsymbol{\Sigma}_{XY} \boldsymbol{\Sigma}_{YY}^{-1} \boldsymbol{\Sigma}_{YX}$ . A small $\Lambda$ (close to 0) indicates significant multivariate relationship.

The exact $F$ -approximation to Wilks' Lambda for testing all $r$ canonical correlations simultaneously is:

F = \frac{1 - \sqrt{\Lambda'}}{1 - \Lambda'} \cdot \frac{df_2}{df_1}

where $\Lambda' = \Lambda^{2/(p+q+1-2r)}$ with $df_1 = (p)(q)$ and $df_2 = (p+q+1-2r)(n - 1 - (p+q)/2)$ .

Redundancy Analysis

Canonical correlations measure the relationship between canonical variates, but these may have poor interpretability. Redundancy analysis (Stewart & Love, 1968) quantifies how much of one set's variance is explained by the other set's canonical variates.

DfRedundancy Index

The redundancy of $\mathbf{Y}$ explained by the $k$ -th $\mathbf{X}$ -canonical variate $U_k$ is:

R^2_{Y|U_k} = \frac{\mathbf{s}_Y^{\top} \mathbf{e}_k \mathbf{e}_k^{\top} \mathbf{s}_Y}{\text{tr}(\boldsymbol{\Sigma}_{YY})}

where $\mathbf{s}_Y^{\top}$ is the vector of correlations between $\mathbf{Y}$ and $U_k$ , and $\mathbf{e}_k$ is the $k$ -th eigenvector of $\boldsymbol{\Sigma}_{YY}^{-1}$ . The total redundancy of $\mathbf{Y}$ explained by all $\mathbf{X}$ -canonical variates is:

R^2_{Y|U} = \frac{\sum_{k=1}^{r} \lambda_k \, \mathbf{c}_{Yk}^{\top}\mathbf{c}_{Yk}}{\text{tr}(\boldsymbol{\Sigma}_{YY})}

where $\mathbf{c}_{Yk}$ is the vector of structure correlations between $\mathbf{Y}$ and $U_k$ .

A high canonical correlation does not guarantee high redundancy. If the canonical variates are heavily weighted on a few variables, the overall redundancy of the full set may be low. Always report both canonical correlations and redundancy indices when interpreting CCA.

Structure Correlations

The structure correlations (canonical loadings) measure the relationship between original variables and canonical variates:

\mathbf{s}_{Xk} = \boldsymbol{\Sigma}_{XX} \mathbf{a}_k, \qquad \mathbf{s}_{Yk} = \boldsymbol{\Sigma}_{YY} \mathbf{b}_k

These correlations are often more interpretable than the canonical weights $\mathbf{a}_k, \mathbf{b}_k$ because they are less affected by multicollinearity.

Estimation and Computation

Sample CCA

Given a data matrix $\mathbf{Z} = [\mathbf{X} \; \mathbf{Y}]$ of dimension $n \times (p+q)$ :

\hat{\boldsymbol{\Sigma}} = \frac{1}{n-1}\mathbf{Z}^{\top}\mathbf{Z}, \qquad \hat{\boldsymbol{\Sigma}}_{XY} = \frac{1}{n-1}\mathbf{X}^{\top}\mathbf{Y}

\hat{\rho}_k = \sqrt{\hat{\lambda}_k} \quad \text{where } \hat{\lambda}_k \text{ are eigenvalues of } \hat{\mathbf{M}}_1 = \hat{\boldsymbol{\Sigma}}_{XX}^{-1}\hat{\boldsymbol{\Sigma}}_{XY}\hat{\boldsymbol{\Sigma}}_{YY}^{-1}\hat{\boldsymbol{\Sigma}}_{YX}

Regularized CCA

When $p + q > n$ or covariance matrices are singular, regularization is essential:

DfRegularized CCA

The regularized CCA objective adds penalties to the covariance matrices:

\max_{\mathbf{a}, \mathbf{b}} \quad \mathbf{a}^{\top}\hat{\boldsymbol{\Sigma}}_{XY}\mathbf{b}

\text{s.t.} \quad \mathbf{a}^{\top}(\hat{\boldsymbol{\Sigma}}_{XX} + \lambda_X \mathbf{I})\mathbf{a} = 1, \quad \mathbf{b}^{\top}(\hat{\boldsymbol{\Sigma}}_{YY} + \lambda_Y \mathbf{I})\mathbf{b} = 1

where $\lambda_X, \lambda_Y > 0$ are tuning parameters selected by cross-validation.

Python Implementation

import numpy as np
from sklearn.cross_decomposition import CCA
from sklearn.preprocessing import StandardScaler
from scipy import stats

np.random.seed(42)
n, p, q = 200, 5, 4

# Simulate correlated multivariate data
Z = np.random.randn(n, p + q)
L = np.linalg.cholesky(
    np.block([
        [np.eye(p), 0.4 * np.ones((p, q))],
        [0.4 * np.ones((q, p)), np.eye(q)]
    ])
)
X, Y = Z @ L[:, :p], Z @ L[:, p:]

scaler_X, scaler_Y = StandardScaler(), StandardScaler()
X_scaled = scaler_X.fit_transform(X)
Y_scaled = scaler_Y.fit_transform(Y)

# --- sklearn CCA ---
cca = CCA(n_components=min(p, q))
U, V = cca.fit_transform(X_scaled, Y_scaled)

# Canonical correlations
can_corr = np.array([
    np.corrcoef(U[:, k], V[:, k])[0, 1]
    for k in range(min(p, q))
])
print("Canonical correlations:", np.round(can_corr, 4))

# Canonical weights
print("X canonical weights:\n", np.round(cca.x_weights_, 4))
print("Y canonical weights:\n", np.round(cca.y_weights_, 4))

# Structure correlations (loadings)
X_loadings = np.corrcoef(X_scaled.T, U.T)[:p, p:]
Y_loadings = np.corrcoef(Y_scaled.T, V.T)[:q, q:]
print("X structure correlations:\n", np.round(X_loadings, 4))
print("Y structure correlations:\n", np.round(Y_loadings, 4))

# --- Wilks' Lambda ---
def wilks_lambda(can_corr, n, p, q):
    r = len(can_corr)
    Lambda = np.prod(1 - can_corr**2)
    # Bartlett's chi-square approximation
    chi2_stat = -(n - 1 - (p + q + 1) / 2) * np.log(LLambda)
    df = (p) * (q)
    p_value = 1 - stats.chi2.cdf(chi2_stat, df)
    return Lambda, chi2_stat, df, p_value

Lambda, chi2, df, pval = wilks_lambda(can_corr, n, p, q)
print(f"Wilks' Lambda: {Lambda:.4f}, chi2: {chi2:.2f}, df: {df}, p: {pval:.2e}")

# --- Redundancy analysis ---
def redundancy_analysis(X_scaled, Y_scaled, U, V, can_corr):
    p = X_scaled.shape[1]
    q = Y_scaled.shape[1]
    r = len(can_corr)

    # Structure correlations
    S_X = np.corrcoef(X_scaled.T, U.T)[:p, p:]
    S_Y = np.corrcoef(Y_scaled.T, V.T)[:q, q:]

    # Redundancy: proportion of Y variance explained by X canonical variates
    Red_Y = np.sum(S_Y**2, axis=0) / q
    Red_X = np.sum(S_X**2, axis=0) / p

    # Total redundancy
    total_red_Y = np.sum(Red_Y)
    total_red_X = np.sum(Red_X)

    return Red_X, Red_Y, total_red_X, total_red_Y

Red_X, Red_Y, tot_X, tot_Y = redundancy_analysis(X_scaled, Y_scaled, U, V, can_corr)
print(f"Total redundancy of Y explained by X variates: {tot_Y:.4f}")
print(f"Total redundancy of X explained by Y variates: {tot_X:.4f}")

# --- Manual CCA via SVD ---
def cca_svd(X, Y):
    n = X.shape[0]
    X_c = X - X.mean(axis=0)
    Y_c = Y - Y.mean(axis=0)

    S_XX = X_c.T @ X_c / (n - 1)
    S_YY = Y_c.T @ Y_c / (n - 1)
    S_XY = X_c.T @ Y_c / (n - 1)

    # Whitening
    Lx = np.linalg.cholesky(S_XX)
    Ly = np.linalg.cholesky(S_YY)

    K = np.linalg.solve(Lx, S_XY @ np.linalg.inv(Ly.T))
    U_svd, D, Vt = np.linalg.svd(K, full_matrices=False)

    A = np.linalg.solve(Lx.T, U_svd)
    B = np.linalg.solve(Ly.T, Vt.T)

    return A, B, D  # D contains canonical correlations

A, B, rho = cca_svd(X_scaled, Y_scaled)
print("Manual CCA correlations:", np.round(rho, 4))

Interpretation Guidelines

Interpreting Canonical Correlation Analysis:

Number of significant pairs: Use Bartlett's test sequentially or evaluate the scree plot of $\hat{\rho}_k^2$ .
Canonical correlations $\hat{\rho}_k$ measure the strength of the $k$ -th pair of canonical variates. Square them for the proportion of shared variance between variates.
Structure correlations (loadings) identify which original variables contribute most to each canonical variate. Loadings $|r| > 0.30$ are typically considered meaningful.
Redundancy indices quantify how much of one set's total variance is explained by the other set's variates — the most policy-relevant metric.
Canonical weights are analogous to regression coefficients and are sensitive to multicollinearity. Prefer structure correlations for interpretation.
Rotation: Canonical variates are determined up to sign. If sign reversal aids interpretation, flip the canonical weights and loadings.
Sample size: CCA requires $n > p + q$ for stable estimation. With $n \leq p + q$ , use regularized CCA or dimension reduction.

Extensions

Partial Least Squares (PLS) maximizes $\text{Cov}(\mathbf{a}^{\top}\mathbf{X}, \mathbf{b}^{\top}\mathbf{Y})$ without variance constraints, emphasizing covariance over correlation. Kernel CCA handles nonlinear relationships by mapping to reproducing kernel Hilbert spaces. Sparse CCA (Witten & Tibshirani, 2009) imposes $\ell_1$ penalties on canonical weights for interpretability in high-dimensional settings.

Canonical Correlation Analysis