High-Dimensional Statistics
Advanced Statistical Methods
When Features Outnumber Samples
High-dimensional statistics addresses the challenge of p >> n problems where classical methods break down. Sparse estimation techniques like LASSO, SCAD, and MCP recover signals from high-dimensional noise.
- Genomics — Identify important genes from thousands of expression features with few samples
- Image recognition — Select relevant pixels or features from high-dimensional image data
- Text classification — Build sparse models from tens of thousands of word features
High-dimensional statistics finds needles in haystacks when the haystack has more needles than hay.
When the number of parameters grows with or exceeds the sample size , classical asymptotic theory breaks down. The classical estimator does not exist when , confidence intervals lose coverage, and the maximum likelihood estimator may not be unique. High-dimensional statistics develops the theory and methods for inference under this regime.
The Curse of Dimensionality
DfCurse of Dimensionality
As dimension :
- The volume of the unit ball relative to the unit cube .
- All pairwise distances concentrate: .
- The fraction of data in any -ball around a point vanishes: .
ThDimension Dependence
For nonparametric regression with features and observations, the minimax optimal rate is:
where is the smoothness of . The rate deteriorates exponentially with , rendering nonparametric methods useless unless .
Sparse Estimation
Sparsity assumes that only of the parameters are nonzero. This is the key structural assumption enabling estimation in high dimensions.
DfSparsity Condition
The true parameter vector satisfies where . The "norm" counts nonzero entries (not a true norm).
LASSO (Least Absolute Shrinkage and Selection Operator)
LASSO Estimator
The penalty induces exact zeros in , performing automatic variable selection. The regularization path is piecewise linear and can be computed exactly via the LARS algorithm.
ThLASSO Oracle Inequality
Under the irrepresentable condition (or restricted eigenvalue condition) with :
with probability at least . This matches the oracle rate up to the factor.
SCAD (Smoothly Clipped Absolute Deviation)
SCAD Penalty
The SCAD penalty is quadratic near zero (like ridge) and constant for large coefficients (like hard thresholding). It satisfies three properties: unbiasedness (for large coefficients), sparsity, and continuity.
MCP (Minimax Concave Penalty)
MCP
The MCP interpolates between LASSO () and hard thresholding (). For , MCP is an unbiased, sparse, and continuous penalty.
Oracle Properties
Both SCAD and MCP satisfy the oracle property: with appropriate ,
where is the estimator obtained by knowing the true support.
Compressed Sensing
DfCompressed Sensing
Recover a sparse signal with from linear measurements where is a measurement matrix.
ThRestricted Isometry Property (RIP)
The matrix satisfies the RIP of order with constant if:
for all -sparse vectors . If , then -minimization recovers any -sparse exactly from .
Random matrices with i.i.d. sub-Gaussian entries satisfy the RIP with high probability when . This is the foundation of the compressed sensing revolution.
Random Projections
Johnson-Lindenstrauss Lemma
For any and points in , there exists a linear map with such that for all :
The dimension reduction is independent of the original dimension —distances are approximately preserved after projecting to dimensions.
Concentration of Measures
ThSub-Gaussian Concentration
If are independent sub-Gaussian random variables with , then for :
Implications for High-Dimensional Inference
Concentration inequalities provide the foundation for:
- Talagrand's inequality: Controls the deviation of empirical processes
- Rademacher complexity: Measures the complexity of function classes for uniform convergence
- Bernstein-type bounds: Tighter than Hoeffding when individual contributions are bounded
- Matrix concentration: Controls singular values of random matrices via matrix Bernstein
Double Descent
DfDouble Descent
In overparameterized models (), the test error exhibits a non-monotonic behavior: it first decreases (underfitting regime), increases near (interpolation threshold), then decreases again in the overparameterized regime. This "double descent" contradicts the classical bias–variance tradeoff.
ThDouble Descent for Linear Regression
For the minimum -norm interpolant when , the excess risk decomposes as:
The variance term dominates near (explaining the peak), while for , the variance decreases because the minimum-norm solution spreads mass across many directions.
Benign Overfitting
ThBenign Overfitting (Bartlett et al., 2020)
For minimum-norm interpolating linear regression, the excess risk is:
The key condition for benign overfitting is that the data covariance must have a small effective rank—most eigenvalues must be concentrated near zero, with a few large ones.
When Overfitting is Harmful
Benign overfitting requires:
- The signal must align with the top eigenvectors of the covariance (weak dependence on noise directions)
- The effective rank must be small relative to
- The noise must be sub-Gaussian
In general settings (e.g., heterogeneous signals, heavy-tailed noise), overfitting can still be harmful.
Python Implementation
import numpy as np
import pandas as pd
from scipy.optimize import minimize
from sklearn.linear_model import Lasso, LassoCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
np.random.seed(42)
# --- Generate high-dimensional data ---
n, p, s = 100, 500, 10
X = np.random.randn(n, p)
beta_true = np.zeros(p)
support = np.random.choice(p, s, replace=False)
beta_true[support] = np.random.randn(s) * 3
y = X @ beta_true + np.random.randn(n) * 0.5
print(f"Setup: n={n}, p={p}, s={s}")
print(f"True support: {sorted(support)}")
# --- LASSO with cross-validation ---
lasso_cv = LassoCV(cv=5, random_state=42).fit(X, y)
beta_lasso = lasso_cv.coef_
n_selected = np.sum(beta_lasso != 0)
true_pos = len(set(np.where(beta_lasso != 0)[0]) & set(support))
false_pos = n_selected - true_pos
false_neg = s - true_pos
print(f"\nLASSO Results (λ={lasso_cv.alpha_:.4f}):")
print(f" Selected: {n_selected} variables")
print(f" True positives: {true_pos}/{s}")
print(f" False positives: {false_pos}")
print(f" False negatives: {false_neg}")
print(f" MSE: {mean_squared_error(y, X @ beta_lasso):.4f}")
# --- SCAD penalty (proximal gradient) ---
def scad_penalty(t, lam, a=3.7):
t_abs = np.abs(t)
penalty = np.where(t_abs <= lam, lam * t_abs,
np.where(t_abs <= a * lam,
(-t_abs**2 + 2*a*lam*t_abs - lam**2) / (2*(a-1)),
(a+1)*lam**2 / 2))
return penalty
def scad_prox(v, lam, a=3.7):
"""Proximal operator for SCAD."""
result = np.zeros_like(v)
for i in range(len(v)):
t = v[i]
t_abs = np.abs(t)
if t_abs <= 2 * lam:
result[i] = np.sign(t) * max(t_abs - lam, 0)
elif t_abs <= a * lam:
result[i] = (t - np.sign(t) * a * lam / (a - 1)) / (1 - 1/(a-1))
result[i] = np.sign(t) * max(np.abs(result[i]), 0)
else:
result[i] = t
return result
# Proximal gradient descent for SCAD
def fit_scad(X, y, lam, a=3.7, lr=0.01, max_iter=1000):
n, p = X.shape
beta = np.zeros(p)
L = np.linalg.norm(X.T @ X) / n
for iteration in range(max_iter):
grad = -X.T @ (y - X @ beta) / n
beta_new = scad_prox(beta - lr * grad, lam * lr, a)
if np.linalg.norm(beta_new - beta) < 1e-8:
break
beta = beta_new
return beta
beta_scad = fit_scad(X, y, lam=0.1)
n_scad = np.sum(beta_scad != 0)
true_pos_scad = len(set(np.where(beta_scad != 0)[0]) & set(support))
print(f"\nSCAD Results:")
print(f" Selected: {n_scad} variables")
print(f" True positives: {true_pos_scad}/{s}")
# --- MCP penalty ---
def mcp_penalty(t, lam, gamma=3.0):
t_abs = np.abs(t)
return np.where(t_abs <= gamma * lam,
lam * t_abs - t_abs**2 / (2 * gamma),
gamma * lam**2 / 2)
def mcp_prox(v, lam, gamma=3.0):
result = np.zeros_like(v)
for i in range(len(v)):
t_abs = np.abs(v[i])
if t_abs <= lam:
result[i] = np.sign(v[i]) * max(t_abs - lam, 0)
elif t_abs <= gamma * lam:
result[i] = v[i] * (gamma - 1) / gamma - np.sign(v[i]) * lam / gamma
result[i] = np.sign(v[i]) * max(np.abs(result[i]), 0)
else:
result[i] = v[i]
return result
def fit_mcp(X, y, lam, gamma=3.0, lr=0.01, max_iter=1000):
n, p = X.shape
beta = np.zeros(p)
for _ in range(max_iter):
grad = -X.T @ (y - X @ beta) / n
beta_new = mcp_prox(beta - lr * grad, lam * lr, gamma)
if np.linalg.norm(beta_new - beta) < 1e-8:
break
beta = beta_new
return beta
beta_mcp = fit_mcp(X, y, lam=0.1)
n_mcp = np.sum(beta_mcp != 0)
true_pos_mcp = len(set(np.where(beta_mcp != 0)[0]) & set(support))
print(f"\nMCP Results:")
print(f" Selected: {n_mcp} variables")
print(f" True positives: {true_pos_mcp}/{s}")
# --- Double descent demonstration ---
ratios = np.linspace(0.2, 3.0, 50)
test_errors = []
train_errors = []
for ratio in ratios:
n_i = int(ratio * p) if ratio * p < n * 3 else int(n * 3)
X_sub = X[:n_i, :n_i] if n_i <= p else X[:min(n_i, n), :p]
y_sub = y[:X_sub.shape[0]]
if X_sub.shape[0] >= X_sub.shape[1]:
try:
beta_hat = np.linalg.lstsq(X_sub, y_sub, rcond=None)[0]
except:
beta_hat = np.zeros(X_sub.shape[1])
else:
# Minimum norm solution
beta_hat = X_sub.T @ np.linalg.solve(X_sub @ X_sub.T + 1e-10*np.eye(X_sub.shape[0]), y_sub)
y_pred = X[:X_sub.shape[0], :X_sub.shape[1]] @ beta_hat
test_errors.append(mean_squared_error(y[:len(y_pred)], y_pred))
train_errors.append(np.mean((X_sub @ beta_hat - y_sub)**2))
plt.figure(figsize=(8, 5))
plt.plot(ratios, test_errors, 'b-', label='Test Error', lw=2)
plt.plot(ratios, train_errors, 'r--', label='Train Error', lw=2)
plt.axvline(x=1.0, color='gray', ls=':', label='p/n = 1')
plt.xlabel('Ratio p/n')
plt.ylabel('Mean Squared Error')
plt.title('Double Descent Phenomenon')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("double_descent.png", dpi=150, bbox_inches="tight")
plt.show()
print("\nDouble descent: test error peaks near p/n = 1, then decreases in overparameterized regime")
Summary
Key Takeaways: High-Dimensional Statistics
- The curse of dimensionality makes nonparametric methods rate-infeasible when is large; sparsity (only nonzero coefficients) is the key structural assumption.
- LASSO ( penalty) achieves near-oracle rates with penalty; SCAD and MCP achieve the oracle property (unbiased + sparse) via nonconvex penalties.
- Compressed sensing recovers sparse signals from measurements via RIP; random projections reduce dimension to while preserving distances.
- Concentration of measures provides the probabilistic machinery for high-dimensional guarantees; sub-Gaussian tails are essential.
- Double descent and benign overfitting reveal that overparameterized models can generalize well when the signal aligns with top eigenvectors and the effective rank is small.