Introduction
Advanced Statistical Methods
Finding Relationships Between Two Sets of Variables
Canonical correlation analysis identifies linear combinations of two variable sets that are maximally correlated, revealing the deepest relationships between paired multivariate data. Wilks' Lambda tests overall significance.
- Psychology β Relate personality test batteries to behavioral outcome measures
- Ecology β Link environmental variables to species abundance patterns across sites
- Marketing β Connect consumer attitude surveys to purchasing behavior datasets
CCA reveals the hidden threads that bind two multivariate worlds together.
Canonical Correlation Analysis (CCA), introduced by Harold Hotelling in 1936, investigates the relationships between two sets of variables and measured on the same subjects. Rather than examining individual bivariate correlations, CCA seeks linear combinations of each set that are maximally correlated with each other.
The method answers the question: What are the strongest possible linear relationships between two multidimensional datasets?
Mathematical Formulation
The Canonical Correlation Problem
Let (), (), and () be the within-set and cross-set covariance matrices. Without loss of generality, assume .
DfCanonical Variates
Canonical variates are linear combinations:
that maximize the Pearson correlation subject to and .
The -th canonical pair solves:
Using Lagrange multipliers, this reduces to a generalized eigenvalue problem. The solution proceeds through the matrices:
ThCanonical Correlation Theorem
The canonical correlations are the square roots of the eigenvalues of (equivalently ), where . The canonical weight vectors and are the corresponding eigenvectors, scaled to satisfy the unit-variance constraints. The -th canonical correlation is .
Canonical Variate Properties
The canonical variates satisfy orthogonality conditions:
where is the Kronecker delta. This means canonical variates within each set are uncorrelated, and the -th -canonical variate is correlated only with the -th -canonical variate.
Number of Canonical Pairs
Dimensionality Determination
The number of non-trivial canonical correlations equals . In practice, the effective dimensionality must be assessed:
Bartlett's test for remaining canonical correlations. To test whether canonical correlations are zero, compute:
Under , the statistic:
follows approximately . Sequential testing from downward identifies the number of significant pairs.
Wilks' Lambda
Wilks' Lambda provides an overall test of whether the two sets of variables are related:
DfWilks' Lambda for CCA
Wilks' Lambda for testing all canonical correlations simultaneously is:
where . A small (close to 0) indicates significant multivariate relationship.
The exact -approximation to Wilks' Lambda for testing all canonical correlations simultaneously is:
where with and .
Redundancy Analysis
Canonical correlations measure the relationship between canonical variates, but these may have poor interpretability. Redundancy analysis (Stewart & Love, 1968) quantifies how much of one set's variance is explained by the other set's canonical variates.
DfRedundancy Index
The redundancy of explained by the -th -canonical variate is:
where is the vector of correlations between and , and is the -th eigenvector of . The total redundancy of explained by all -canonical variates is:
where is the vector of structure correlations between and .
A high canonical correlation does not guarantee high redundancy. If the canonical variates are heavily weighted on a few variables, the overall redundancy of the full set may be low. Always report both canonical correlations and redundancy indices when interpreting CCA.
Structure Correlations
The structure correlations (canonical loadings) measure the relationship between original variables and canonical variates:
These correlations are often more interpretable than the canonical weights because they are less affected by multicollinearity.
Estimation and Computation
Sample CCA
Given a data matrix of dimension :
Regularized CCA
When or covariance matrices are singular, regularization is essential:
DfRegularized CCA
The regularized CCA objective adds penalties to the covariance matrices:
where are tuning parameters selected by cross-validation.
Python Implementation
import numpy as np
from sklearn.cross_decomposition import CCA
from sklearn.preprocessing import StandardScaler
from scipy import stats
np.random.seed(42)
n, p, q = 200, 5, 4
# Simulate correlated multivariate data
Z = np.random.randn(n, p + q)
L = np.linalg.cholesky(
np.block([
[np.eye(p), 0.4 * np.ones((p, q))],
[0.4 * np.ones((q, p)), np.eye(q)]
])
)
X, Y = Z @ L[:, :p], Z @ L[:, p:]
scaler_X, scaler_Y = StandardScaler(), StandardScaler()
X_scaled = scaler_X.fit_transform(X)
Y_scaled = scaler_Y.fit_transform(Y)
# --- sklearn CCA ---
cca = CCA(n_components=min(p, q))
U, V = cca.fit_transform(X_scaled, Y_scaled)
# Canonical correlations
can_corr = np.array([
np.corrcoef(U[:, k], V[:, k])[0, 1]
for k in range(min(p, q))
])
print("Canonical correlations:", np.round(can_corr, 4))
# Canonical weights
print("X canonical weights:\n", np.round(cca.x_weights_, 4))
print("Y canonical weights:\n", np.round(cca.y_weights_, 4))
# Structure correlations (loadings)
X_loadings = np.corrcoef(X_scaled.T, U.T)[:p, p:]
Y_loadings = np.corrcoef(Y_scaled.T, V.T)[:q, q:]
print("X structure correlations:\n", np.round(X_loadings, 4))
print("Y structure correlations:\n", np.round(Y_loadings, 4))
# --- Wilks' Lambda ---
def wilks_lambda(can_corr, n, p, q):
r = len(can_corr)
Lambda = np.prod(1 - can_corr**2)
# Bartlett's chi-square approximation
chi2_stat = -(n - 1 - (p + q + 1) / 2) * np.log(LLambda)
df = (p) * (q)
p_value = 1 - stats.chi2.cdf(chi2_stat, df)
return Lambda, chi2_stat, df, p_value
Lambda, chi2, df, pval = wilks_lambda(can_corr, n, p, q)
print(f"Wilks' Lambda: {Lambda:.4f}, chi2: {chi2:.2f}, df: {df}, p: {pval:.2e}")
# --- Redundancy analysis ---
def redundancy_analysis(X_scaled, Y_scaled, U, V, can_corr):
p = X_scaled.shape[1]
q = Y_scaled.shape[1]
r = len(can_corr)
# Structure correlations
S_X = np.corrcoef(X_scaled.T, U.T)[:p, p:]
S_Y = np.corrcoef(Y_scaled.T, V.T)[:q, q:]
# Redundancy: proportion of Y variance explained by X canonical variates
Red_Y = np.sum(S_Y**2, axis=0) / q
Red_X = np.sum(S_X**2, axis=0) / p
# Total redundancy
total_red_Y = np.sum(Red_Y)
total_red_X = np.sum(Red_X)
return Red_X, Red_Y, total_red_X, total_red_Y
Red_X, Red_Y, tot_X, tot_Y = redundancy_analysis(X_scaled, Y_scaled, U, V, can_corr)
print(f"Total redundancy of Y explained by X variates: {tot_Y:.4f}")
print(f"Total redundancy of X explained by Y variates: {tot_X:.4f}")
# --- Manual CCA via SVD ---
def cca_svd(X, Y):
n = X.shape[0]
X_c = X - X.mean(axis=0)
Y_c = Y - Y.mean(axis=0)
S_XX = X_c.T @ X_c / (n - 1)
S_YY = Y_c.T @ Y_c / (n - 1)
S_XY = X_c.T @ Y_c / (n - 1)
# Whitening
Lx = np.linalg.cholesky(S_XX)
Ly = np.linalg.cholesky(S_YY)
K = np.linalg.solve(Lx, S_XY @ np.linalg.inv(Ly.T))
U_svd, D, Vt = np.linalg.svd(K, full_matrices=False)
A = np.linalg.solve(Lx.T, U_svd)
B = np.linalg.solve(Ly.T, Vt.T)
return A, B, D # D contains canonical correlations
A, B, rho = cca_svd(X_scaled, Y_scaled)
print("Manual CCA correlations:", np.round(rho, 4))
Interpretation Guidelines
Interpreting Canonical Correlation Analysis:
-
Number of significant pairs: Use Bartlett's test sequentially or evaluate the scree plot of .
-
Canonical correlations measure the strength of the -th pair of canonical variates. Square them for the proportion of shared variance between variates.
-
Structure correlations (loadings) identify which original variables contribute most to each canonical variate. Loadings are typically considered meaningful.
-
Redundancy indices quantify how much of one set's total variance is explained by the other set's variates β the most policy-relevant metric.
-
Canonical weights are analogous to regression coefficients and are sensitive to multicollinearity. Prefer structure correlations for interpretation.
-
Rotation: Canonical variates are determined up to sign. If sign reversal aids interpretation, flip the canonical weights and loadings.
-
Sample size: CCA requires for stable estimation. With , use regularized CCA or dimension reduction.
Extensions
Partial Least Squares (PLS) maximizes without variance constraints, emphasizing covariance over correlation. Kernel CCA handles nonlinear relationships by mapping to reproducing kernel Hilbert spaces. Sparse CCA (Witten & Tibshirani, 2009) imposes penalties on canonical weights for interpretability in high-dimensional settings.