Principal Component Analysis (PCA)
Foundations of Statistics
Reducing Dimensionality While Preserving Variance
PCA finds orthogonal directions of maximum variance in high-dimensional data. By projecting onto principal components, you reduce complexity while retaining the most informative structure in the data.
- Genomics — Visualize thousands of gene expressions in 2D plots
- Image Processing — Compress facial recognition features while preserving identity information
- Finance — Extract key risk factors from correlated asset returns
The first few components often capture the essence that hundreds of variables conceal.
PCA finds orthogonal directions (principal components) of maximum variance in the data. Used for dimensionality reduction, visualization, and feature extraction.
DfPrincipal Component
A principal component is a linear combination of original features that captures the maximum variance in the data, subject to being orthogonal to previous components.
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
np.random.seed(42)
# High-dimensional data: 10 variables, some correlated
n = 200
true_components = 3
X_latent = np.random.randn(n, true_components)
loading_matrix = np.random.randn(10, true_components)
X = X_latent @ loading_matrix.T + np.random.randn(n, 10) * 0.5
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)
# Scree plot
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
axes[0].plot(range(1, 11), pca.explained_variance_ratio_*100, 'bo-', markersize=8)
axes[0].bar(range(1, 11), pca.explained_variance_ratio_*100, alpha=0.3, color='steelblue')
axes[0].set_xlabel('Principal Component')
axes[0].set_ylabel('Variance Explained (%)')
axes[0].set_title('Scree Plot')
# Cumulative variance
cumvar = np.cumsum(pca.explained_variance_ratio_)*100
axes[1].plot(range(1, 11), cumvar, 'ro-', markersize=8)
axes[1].axhline(80, color='green', linestyle='--', label='80% threshold')
axes[1].axhline(95, color='blue', linestyle='--', label='95% threshold')
axes[1].set_title('Cumulative Variance Explained')
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Cumulative Variance (%)')
axes[1].legend()
n_comp_80 = np.argmax(cumvar >= 80) + 1
n_comp_95 = np.argmax(cumvar >= 95) + 1
print(f"Components needed for 80% variance: {n_comp_80}")
print(f"Components needed for 95% variance: {n_comp_95}")
# 2D visualization using first 2 PCs
axes[2].scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.5, color='steelblue')
axes[2].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)')
axes[2].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)')
axes[2].set_title('Data in PC Space')
plt.tight_layout()
plt.savefig('pca.png', dpi=150)
plt.show()
# Biplot: feature loadings
fig, ax = plt.subplots(figsize=(8, 8))
loadings = pca.components_.T
ax.scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.3, s=10, color='gray')
for i, (x, y) in enumerate(loadings[:, :2]):
ax.arrow(0, 0, x*5, y*5, head_width=0.1, head_length=0.05,
fc='red', ec='red')
ax.text(x*5.5, y*5.5, f'X{i+1}', fontsize=9, color='red', ha='center')
ax.axhline(0, color='black', linewidth=0.5)
ax.axvline(0, color='black', linewidth=0.5)
ax.set_title('PCA Biplot')
ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)')
ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)')
plt.tight_layout()
plt.savefig('pca_biplot.png', dpi=150)
plt.show()
Standardize Before PCA
Always standardize features before PCA — otherwise high-variance features will dominate the principal components.
Key Takeaways
Summary: PCA
- PCs are orthogonal (uncorrelated) linear combinations of original features
- Scree plot and cumulative variance guide how many PCs to keep
- Standardize features before PCA — otherwise high-variance features dominate
- Loadings show how original features contribute to each PC
- PCA assumes linearity — use t-SNE or UMAP for nonlinear dimensionality reduction