πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Canonical Correlation Analysis

Advanced Statistical MethodsMultivariate Methods🟒 Free Lesson

Advertisement

Introduction

Advanced Statistical Methods

Finding Relationships Between Two Sets of Variables

Canonical correlation analysis identifies linear combinations of two variable sets that are maximally correlated, revealing the deepest relationships between paired multivariate data. Wilks' Lambda tests overall significance.

  • Psychology β€” Relate personality test batteries to behavioral outcome measures
  • Ecology β€” Link environmental variables to species abundance patterns across sites
  • Marketing β€” Connect consumer attitude surveys to purchasing behavior datasets

CCA reveals the hidden threads that bind two multivariate worlds together.


Canonical Correlation Analysis (CCA), introduced by Harold Hotelling in 1936, investigates the relationships between two sets of variables X=(X1,…,Xp)⊀\mathbf{X} = (X_1, \dots, X_p)^{\top} and Y=(Y1,…,Yq)⊀\mathbf{Y} = (Y_1, \dots, Y_q)^{\top} measured on the same nn subjects. Rather than examining individual bivariate correlations, CCA seeks linear combinations of each set that are maximally correlated with each other.

The method answers the question: What are the strongest possible linear relationships between two multidimensional datasets?

Mathematical Formulation

The Canonical Correlation Problem

Let Ξ£XX\boldsymbol{\Sigma}_{XX} (pΓ—pp \times p), Ξ£YY\boldsymbol{\Sigma}_{YY} (qΓ—qq \times q), and Ξ£XY\boldsymbol{\Sigma}_{XY} (pΓ—qp \times q) be the within-set and cross-set covariance matrices. Without loss of generality, assume p≀qp \leq q.

DfCanonical Variates

Canonical variates are linear combinations:

U=a⊀X,V=b⊀YU = \mathbf{a}^{\top}\mathbf{X}, \qquad V = \mathbf{b}^{\top}\mathbf{Y}

that maximize the Pearson correlation corr(U,V)\text{corr}(U, V) subject to Var(U)=1\text{Var}(U) = 1 and Var(V)=1\text{Var}(V) = 1.

The kk-th canonical pair (ak,bk)(\mathbf{a}_k, \mathbf{b}_k) solves:

max⁑a,ba⊀ΣXYb\max_{\mathbf{a}, \mathbf{b}} \quad \mathbf{a}^{\top} \boldsymbol{\Sigma}_{XY} \mathbf{b}
subject toa⊀ΣXXa=1,b⊀ΣYYb=1\text{subject to} \quad \mathbf{a}^{\top} \boldsymbol{\Sigma}_{XX} \mathbf{a} = 1, \quad \mathbf{b}^{\top} \boldsymbol{\Sigma}_{YY} \mathbf{b} = 1

Using Lagrange multipliers, this reduces to a generalized eigenvalue problem. The solution proceeds through the matrices:

M1=Ξ£XXβˆ’1Ξ£XYΞ£YYβˆ’1Ξ£YX\mathbf{M}_1 = \boldsymbol{\Sigma}_{XX}^{-1} \boldsymbol{\Sigma}_{XY} \boldsymbol{\Sigma}_{YY}^{-1} \boldsymbol{\Sigma}_{YX}
M2=Ξ£YYβˆ’1Ξ£YXΞ£XXβˆ’1Ξ£XY\mathbf{M}_2 = \boldsymbol{\Sigma}_{YY}^{-1} \boldsymbol{\Sigma}_{YX} \boldsymbol{\Sigma}_{XX}^{-1} \boldsymbol{\Sigma}_{XY}

ThCanonical Correlation Theorem

The canonical correlations are the square roots of the eigenvalues Ξ»1β‰₯Ξ»2β‰₯β‹―β‰₯Ξ»r\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_r of M1\mathbf{M}_1 (equivalently M2\mathbf{M}_2), where r=rank(Ξ£XY)≀min⁑(p,q)r = \text{rank}(\boldsymbol{\Sigma}_{XY}) \leq \min(p, q). The canonical weight vectors ak\mathbf{a}_k and bk\mathbf{b}_k are the corresponding eigenvectors, scaled to satisfy the unit-variance constraints. The kk-th canonical correlation is ρk=Ξ»k\rho_k = \sqrt{\lambda_k}.

Canonical Variate Properties

The canonical variates satisfy orthogonality conditions:

Cov(Ui,Uj)=ai⊀ΣXXaj=δij\text{Cov}(U_i, U_j) = \mathbf{a}_i^{\top} \boldsymbol{\Sigma}_{XX} \mathbf{a}_j = \delta_{ij}
Cov(Vi,Vj)=bi⊀ΣYYbj=δij\text{Cov}(V_i, V_j) = \mathbf{b}_i^{\top} \boldsymbol{\Sigma}_{YY} \mathbf{b}_j = \delta_{ij}
Cov(Ui,Vj)=ρi δij\text{Cov}(U_i, V_j) = \rho_i \, \delta_{ij}

where Ξ΄ij\delta_{ij} is the Kronecker delta. This means canonical variates within each set are uncorrelated, and the ii-th X\mathbf{X}-canonical variate is correlated only with the ii-th Y\mathbf{Y}-canonical variate.

Number of Canonical Pairs

Dimensionality Determination

The number of non-trivial canonical correlations equals r=rank(Ξ£XY)r = \text{rank}(\boldsymbol{\Sigma}_{XY}). In practice, the effective dimensionality must be assessed:

Bartlett's test for remaining canonical correlations. To test whether canonical correlations ρk,ρk+1,…,ρr\rho_{k}, \rho_{k+1}, \dots, \rho_r are zero, compute:

Ξ›k=∏j=kr(1βˆ’Ο^j2)\Lambda_k = \prod_{j=k}^{r} (1 - \hat{\rho}_j^2)

Under H0:ρk=β‹―=ρr=0H_0: \rho_k = \cdots = \rho_r = 0, the statistic:

Ο‡2=βˆ’[(nβˆ’k)βˆ’p+q+12]log⁑Λk\chi^2 = -\left[(n - k) - \frac{p + q + 1}{2}\right] \log \Lambda_k

follows approximately Ο‡(pβˆ’k+1)(qβˆ’k+1)2\chi^2_{(p-k+1)(q-k+1)}. Sequential testing from k=rk = r downward identifies the number of significant pairs.

Wilks' Lambda

Wilks' Lambda provides an overall test of whether the two sets of variables are related:

DfWilks' Lambda for CCA

Wilks' Lambda for testing all canonical correlations simultaneously is:

Ξ›=∏j=1r(1βˆ’Ο^j2)=∣Iβˆ’Ξ©βˆ£\Lambda = \prod_{j=1}^{r} (1 - \hat{\rho}_j^2) = |\mathbf{I} - \boldsymbol{\Omega}|

where Ξ©=Ξ£XXβˆ’1Ξ£XYΞ£YYβˆ’1Ξ£YX\boldsymbol{\Omega} = \boldsymbol{\Sigma}_{XX}^{-1} \boldsymbol{\Sigma}_{XY} \boldsymbol{\Sigma}_{YY}^{-1} \boldsymbol{\Sigma}_{YX}. A small Ξ›\Lambda (close to 0) indicates significant multivariate relationship.

The exact FF-approximation to Wilks' Lambda for testing all rr canonical correlations simultaneously is:

F=1βˆ’Ξ›β€²1βˆ’Ξ›β€²β‹…df2df1F = \frac{1 - \sqrt{\Lambda'}}{1 - \Lambda'} \cdot \frac{df_2}{df_1}

where Ξ›β€²=Ξ›2/(p+q+1βˆ’2r)\Lambda' = \Lambda^{2/(p+q+1-2r)} with df1=(p)(q)df_1 = (p)(q) and df2=(p+q+1βˆ’2r)(nβˆ’1βˆ’(p+q)/2)df_2 = (p+q+1-2r)(n - 1 - (p+q)/2).

Redundancy Analysis

Canonical correlations measure the relationship between canonical variates, but these may have poor interpretability. Redundancy analysis (Stewart & Love, 1968) quantifies how much of one set's variance is explained by the other set's canonical variates.

DfRedundancy Index

The redundancy of Y\mathbf{Y} explained by the kk-th X\mathbf{X}-canonical variate UkU_k is:

RY∣Uk2=sY⊀ekek⊀sYtr(ΣYY)R^2_{Y|U_k} = \frac{\mathbf{s}_Y^{\top} \mathbf{e}_k \mathbf{e}_k^{\top} \mathbf{s}_Y}{\text{tr}(\boldsymbol{\Sigma}_{YY})}

where sY⊀\mathbf{s}_Y^{\top} is the vector of correlations between Y\mathbf{Y} and UkU_k, and ek\mathbf{e}_k is the kk-th eigenvector of Ξ£YYβˆ’1\boldsymbol{\Sigma}_{YY}^{-1}. The total redundancy of Y\mathbf{Y} explained by all X\mathbf{X}-canonical variates is:

RY∣U2=βˆ‘k=1rΞ»k cYk⊀cYktr(Ξ£YY)R^2_{Y|U} = \frac{\sum_{k=1}^{r} \lambda_k \, \mathbf{c}_{Yk}^{\top}\mathbf{c}_{Yk}}{\text{tr}(\boldsymbol{\Sigma}_{YY})}

where cYk\mathbf{c}_{Yk} is the vector of structure correlations between Y\mathbf{Y} and UkU_k.

A high canonical correlation does not guarantee high redundancy. If the canonical variates are heavily weighted on a few variables, the overall redundancy of the full set may be low. Always report both canonical correlations and redundancy indices when interpreting CCA.

Structure Correlations

The structure correlations (canonical loadings) measure the relationship between original variables and canonical variates:

sXk=Ξ£XXak,sYk=Ξ£YYbk\mathbf{s}_{Xk} = \boldsymbol{\Sigma}_{XX} \mathbf{a}_k, \qquad \mathbf{s}_{Yk} = \boldsymbol{\Sigma}_{YY} \mathbf{b}_k

These correlations are often more interpretable than the canonical weights ak,bk\mathbf{a}_k, \mathbf{b}_k because they are less affected by multicollinearity.

Estimation and Computation

Sample CCA

Given a data matrix Z=[Xβ€…β€ŠY]\mathbf{Z} = [\mathbf{X} \; \mathbf{Y}] of dimension nΓ—(p+q)n \times (p+q):

Ξ£^=1nβˆ’1Z⊀Z,Ξ£^XY=1nβˆ’1X⊀Y\hat{\boldsymbol{\Sigma}} = \frac{1}{n-1}\mathbf{Z}^{\top}\mathbf{Z}, \qquad \hat{\boldsymbol{\Sigma}}_{XY} = \frac{1}{n-1}\mathbf{X}^{\top}\mathbf{Y}
ρ^k=Ξ»^kwhereΒ Ξ»^kΒ areΒ eigenvaluesΒ ofΒ M^1=Ξ£^XXβˆ’1Ξ£^XYΞ£^YYβˆ’1Ξ£^YX\hat{\rho}_k = \sqrt{\hat{\lambda}_k} \quad \text{where } \hat{\lambda}_k \text{ are eigenvalues of } \hat{\mathbf{M}}_1 = \hat{\boldsymbol{\Sigma}}_{XX}^{-1}\hat{\boldsymbol{\Sigma}}_{XY}\hat{\boldsymbol{\Sigma}}_{YY}^{-1}\hat{\boldsymbol{\Sigma}}_{YX}

Regularized CCA

When p+q>np + q > n or covariance matrices are singular, regularization is essential:

DfRegularized CCA

The regularized CCA objective adds penalties to the covariance matrices:

max⁑a,ba⊀Σ^XYb\max_{\mathbf{a}, \mathbf{b}} \quad \mathbf{a}^{\top}\hat{\boldsymbol{\Sigma}}_{XY}\mathbf{b}
s.t.a⊀(Σ^XX+λXI)a=1,b⊀(Σ^YY+λYI)b=1\text{s.t.} \quad \mathbf{a}^{\top}(\hat{\boldsymbol{\Sigma}}_{XX} + \lambda_X \mathbf{I})\mathbf{a} = 1, \quad \mathbf{b}^{\top}(\hat{\boldsymbol{\Sigma}}_{YY} + \lambda_Y \mathbf{I})\mathbf{b} = 1

where Ξ»X,Ξ»Y>0\lambda_X, \lambda_Y > 0 are tuning parameters selected by cross-validation.

Python Implementation

import numpy as np
from sklearn.cross_decomposition import CCA
from sklearn.preprocessing import StandardScaler
from scipy import stats

np.random.seed(42)
n, p, q = 200, 5, 4

# Simulate correlated multivariate data
Z = np.random.randn(n, p + q)
L = np.linalg.cholesky(
    np.block([
        [np.eye(p), 0.4 * np.ones((p, q))],
        [0.4 * np.ones((q, p)), np.eye(q)]
    ])
)
X, Y = Z @ L[:, :p], Z @ L[:, p:]

scaler_X, scaler_Y = StandardScaler(), StandardScaler()
X_scaled = scaler_X.fit_transform(X)
Y_scaled = scaler_Y.fit_transform(Y)

# --- sklearn CCA ---
cca = CCA(n_components=min(p, q))
U, V = cca.fit_transform(X_scaled, Y_scaled)

# Canonical correlations
can_corr = np.array([
    np.corrcoef(U[:, k], V[:, k])[0, 1]
    for k in range(min(p, q))
])
print("Canonical correlations:", np.round(can_corr, 4))

# Canonical weights
print("X canonical weights:\n", np.round(cca.x_weights_, 4))
print("Y canonical weights:\n", np.round(cca.y_weights_, 4))

# Structure correlations (loadings)
X_loadings = np.corrcoef(X_scaled.T, U.T)[:p, p:]
Y_loadings = np.corrcoef(Y_scaled.T, V.T)[:q, q:]
print("X structure correlations:\n", np.round(X_loadings, 4))
print("Y structure correlations:\n", np.round(Y_loadings, 4))

# --- Wilks' Lambda ---
def wilks_lambda(can_corr, n, p, q):
    r = len(can_corr)
    Lambda = np.prod(1 - can_corr**2)
    # Bartlett's chi-square approximation
    chi2_stat = -(n - 1 - (p + q + 1) / 2) * np.log(LLambda)
    df = (p) * (q)
    p_value = 1 - stats.chi2.cdf(chi2_stat, df)
    return Lambda, chi2_stat, df, p_value

Lambda, chi2, df, pval = wilks_lambda(can_corr, n, p, q)
print(f"Wilks' Lambda: {Lambda:.4f}, chi2: {chi2:.2f}, df: {df}, p: {pval:.2e}")

# --- Redundancy analysis ---
def redundancy_analysis(X_scaled, Y_scaled, U, V, can_corr):
    p = X_scaled.shape[1]
    q = Y_scaled.shape[1]
    r = len(can_corr)

    # Structure correlations
    S_X = np.corrcoef(X_scaled.T, U.T)[:p, p:]
    S_Y = np.corrcoef(Y_scaled.T, V.T)[:q, q:]

    # Redundancy: proportion of Y variance explained by X canonical variates
    Red_Y = np.sum(S_Y**2, axis=0) / q
    Red_X = np.sum(S_X**2, axis=0) / p

    # Total redundancy
    total_red_Y = np.sum(Red_Y)
    total_red_X = np.sum(Red_X)

    return Red_X, Red_Y, total_red_X, total_red_Y

Red_X, Red_Y, tot_X, tot_Y = redundancy_analysis(X_scaled, Y_scaled, U, V, can_corr)
print(f"Total redundancy of Y explained by X variates: {tot_Y:.4f}")
print(f"Total redundancy of X explained by Y variates: {tot_X:.4f}")

# --- Manual CCA via SVD ---
def cca_svd(X, Y):
    n = X.shape[0]
    X_c = X - X.mean(axis=0)
    Y_c = Y - Y.mean(axis=0)

    S_XX = X_c.T @ X_c / (n - 1)
    S_YY = Y_c.T @ Y_c / (n - 1)
    S_XY = X_c.T @ Y_c / (n - 1)

    # Whitening
    Lx = np.linalg.cholesky(S_XX)
    Ly = np.linalg.cholesky(S_YY)

    K = np.linalg.solve(Lx, S_XY @ np.linalg.inv(Ly.T))
    U_svd, D, Vt = np.linalg.svd(K, full_matrices=False)

    A = np.linalg.solve(Lx.T, U_svd)
    B = np.linalg.solve(Ly.T, Vt.T)

    return A, B, D  # D contains canonical correlations

A, B, rho = cca_svd(X_scaled, Y_scaled)
print("Manual CCA correlations:", np.round(rho, 4))

Interpretation Guidelines

Interpreting Canonical Correlation Analysis:

  1. Number of significant pairs: Use Bartlett's test sequentially or evaluate the scree plot of ρ^k2\hat{\rho}_k^2.

  2. Canonical correlations ρ^k\hat{\rho}_k measure the strength of the kk-th pair of canonical variates. Square them for the proportion of shared variance between variates.

  3. Structure correlations (loadings) identify which original variables contribute most to each canonical variate. Loadings ∣r∣>0.30|r| > 0.30 are typically considered meaningful.

  4. Redundancy indices quantify how much of one set's total variance is explained by the other set's variates β€” the most policy-relevant metric.

  5. Canonical weights are analogous to regression coefficients and are sensitive to multicollinearity. Prefer structure correlations for interpretation.

  6. Rotation: Canonical variates are determined up to sign. If sign reversal aids interpretation, flip the canonical weights and loadings.

  7. Sample size: CCA requires n>p+qn > p + q for stable estimation. With n≀p+qn \leq p + q, use regularized CCA or dimension reduction.

Extensions

Partial Least Squares (PLS) maximizes Cov(a⊀X,b⊀Y)\text{Cov}(\mathbf{a}^{\top}\mathbf{X}, \mathbf{b}^{\top}\mathbf{Y}) without variance constraints, emphasizing covariance over correlation. Kernel CCA handles nonlinear relationships by mapping to reproducing kernel Hilbert spaces. Sparse CCA (Witten & Tibshirani, 2009) imposes β„“1\ell_1 penalties on canonical weights for interpretability in high-dimensional settings.

⭐

Premium Content

Canonical Correlation Analysis

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement