High-Dimensional Statistics

Advanced Statistical Methods

When Features Outnumber Samples

High-dimensional statistics addresses the challenge of p >> n problems where classical methods break down. Sparse estimation techniques like LASSO, SCAD, and MCP recover signals from high-dimensional noise.

Genomics — Identify important genes from thousands of expression features with few samples
Image recognition — Select relevant pixels or features from high-dimensional image data
Text classification — Build sparse models from tens of thousands of word features

High-dimensional statistics finds needles in haystacks when the haystack has more needles than hay.

When the number of parameters $p$ grows with or exceeds the sample size $n$ , classical asymptotic theory breaks down. The classical estimator $\hat{\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ does not exist when $p > n$ , confidence intervals lose coverage, and the maximum likelihood estimator may not be unique. High-dimensional statistics develops the theory and methods for inference under this regime.

The Curse of Dimensionality

DfCurse of Dimensionality

As dimension $p \to \infty$ :

The volume of the unit ball $V_p = \pi^{p/2}/\Gamma(p/2 + 1) \to 0$ relative to the unit cube $[-1, 1]^p$ .
All pairwise distances concentrate: $\frac{\max_{i,j}\|\mathbf{x}_i - \mathbf{x}_j\| - \min_{i,j}\|\mathbf{x}_i - \mathbf{x}_j\|}{\min_{i,j}\|\mathbf{x}_i - \mathbf{x}_j\|} \to 0$ .
The fraction of data in any $\epsilon$ -ball around a point vanishes: $V_p(\epsilon)/V_p(1) = \epsilon^p \to 0$ .

ThDimension Dependence

For nonparametric regression with $p$ features and $n$ observations, the minimax optimal rate is:

\inf_{\hat{f}} \sup_{f \in \mathcal{F}} \mathbb{E}\|\hat{f} - f\|^2 \asymp n^{-2\alpha/(2\alpha + p)}

where $\alpha$ is the smoothness of $f$ . The rate deteriorates exponentially with $p$ , rendering nonparametric methods useless unless $n \gg e^p$ .

Sparse Estimation

Sparsity assumes that only $s \ll p$ of the $p$ parameters are nonzero. This is the key structural assumption enabling estimation in high dimensions.

DfSparsity Condition

The true parameter vector satisfies $\|\beta_0\|_0 = |\{j : \beta_{0,j} \neq 0\}| = s$ where $s \ll p$ . The $\ell_0$ "norm" counts nonzero entries (not a true norm).

LASSO (Least Absolute Shrinkage and Selection Operator)

LASSO Estimator

\hat{\beta}_{\text{LASSO}} = \arg\min_{\beta} \left\{\frac{1}{2n}\|\mathbf{y} - \mathbf{X}\beta\|_2^2 + \lambda\|\beta\|_1\right\}

The $\ell_1$ penalty induces exact zeros in $\hat{\beta}$ , performing automatic variable selection. The regularization path $\{\hat{\beta}(\lambda)\}_{\lambda \geq 0}$ is piecewise linear and can be computed exactly via the LARS algorithm.

ThLASSO Oracle Inequality

Under the irrepresentable condition (or restricted eigenvalue condition) with $\lambda \asymp \sigma\sqrt{s\log(p)/n}$ :

\|\hat{\beta} - \beta_0\|_2^2 \leq C_1 \frac{s\sigma^2\log p}{n}

with probability at least $1 - c_1p^{-c_2}$ . This matches the oracle rate up to the $\log p$ factor.

SCAD (Smoothly Clipped Absolute Deviation)

SCAD Penalty

p_\lambda(|t|) = \begin{cases} \lambda|t| & |t| \leq \lambda \\ \frac{-t^2 + 2a\lambda|t| - \lambda^2}{2(a-1)} & \lambda < |t| \leq a\lambda \\ \frac{(a+1)\lambda^2}{2} & |t| > a\lambda \end{cases}

The SCAD penalty is quadratic near zero (like ridge) and constant for large coefficients (like hard thresholding). It satisfies three properties: unbiasedness (for large coefficients), sparsity, and continuity.

MCP (Minimax Concave Penalty)

MCP

p_\lambda(|t|) = \begin{cases} \lambda|t| - \frac{t^2}{2\gamma} & |t| \leq \gamma\lambda \\ \frac{\gamma\lambda^2}{2} & |t| > \gamma\lambda \end{cases}

The MCP interpolates between LASSO ( $\gamma \to \infty$ ) and hard thresholding ( $\gamma \to 1$ ). For $\gamma > 1$ , MCP is an unbiased, sparse, and continuous penalty.

Oracle Properties

Both SCAD and MCP satisfy the oracle property: with appropriate $\lambda_n$ ,

\hat{\beta}_j = \begin{cases} 0 & \text{if } \beta_{0,j} = 0 \quad \text{(selection consistency)} \\ \hat{\beta}_j^{\text{oracle}} + o_p(1) & \text{if } \beta_{0,j} \neq 0 \quad \text{(asymptotic normality)} \end{cases}

where $\hat{\beta}_j^{\text{oracle}}$ is the estimator obtained by knowing the true support.

Compressed Sensing

DfCompressed Sensing

Recover a sparse signal $\mathbf{x} \in \mathbb{R}^p$ with $\|\mathbf{x}\|_0 = s$ from $m \ll p$ linear measurements $\mathbf{y} = \mathbf{A}\mathbf{x}$ where $\mathbf{A} \in \mathbb{R}^{m \times p}$ is a measurement matrix.

ThRestricted Isometry Property (RIP)

The matrix $\mathbf{A}$ satisfies the RIP of order $k$ with constant $\delta_k$ if:

(1 - \delta_k)\|\mathbf{x}\|_2^2 \leq \|\mathbf{A}\mathbf{x}\|_2^2 \leq (1 + \delta_k)\|\mathbf{x}\|_2^2

for all $k$ -sparse vectors $\mathbf{x}$ . If $\delta_{2s} < \sqrt{2} - 1$ , then $\ell_1$ -minimization recovers any $s$ -sparse $\mathbf{x}$ exactly from $\mathbf{y} = \mathbf{A}\mathbf{x}$ .

Random matrices with i.i.d. sub-Gaussian entries satisfy the RIP with high probability when $m \geq C\,s\log(p/s)$ . This is the foundation of the compressed sensing revolution.

Random Projections

Johnson-Lindenstrauss Lemma

For any $\epsilon \in (0, 1)$ and $n$ points in $\mathbb{R}^p$ , there exists a linear map $\mathbf{\Pi}: \mathbb{R}^p \to \mathbb{R}^k$ with $k = O(\epsilon^{-2}\log n)$ such that for all $i, j$ :

(1 - \epsilon)\|\mathbf{x}_i - \mathbf{x}_j\|^2 \leq \|\mathbf{\Pi}\mathbf{x}_i - \mathbf{\Pi}\mathbf{x}_j\|^2 \leq (1 + \epsilon)\|\mathbf{x}_i - \mathbf{x}_j\|^2

The dimension reduction is independent of the original dimension $p$ —distances are approximately preserved after projecting to $O(\log n)$ dimensions.

Concentration of Measures

ThSub-Gaussian Concentration

If $X_1, \dots, X_n$ are independent sub-Gaussian random variables with $\|X_i\|_{\psi_2} \leq K$ , then for $\mathbf{a} \in \mathbb{R}^n$ :

\mathbb{P}\left(\left|\sum_{i=1}^n a_i X_i - \mathbb{E}\left[\sum_{i=1}^n a_i X_i\right]\right| \geq t\right) \leq 2\exp\left(-\frac{ct^2}{K^2\|\mathbf{a}\|_2^2}\right)

Implications for High-Dimensional Inference

Concentration inequalities provide the foundation for:

Talagrand's inequality: Controls the deviation of empirical processes
Rademacher complexity: Measures the complexity of function classes for uniform convergence
Bernstein-type bounds: Tighter than Hoeffding when individual contributions are bounded
Matrix concentration: Controls singular values of random matrices via matrix Bernstein

Double Descent

DfDouble Descent

In overparameterized models ( $p > n$ ), the test error exhibits a non-monotonic behavior: it first decreases (underfitting regime), increases near $p \approx n$ (interpolation threshold), then decreases again in the overparameterized regime. This "double descent" contradicts the classical bias–variance tradeoff.

ThDouble Descent for Linear Regression

For the minimum $\ell_2$ -norm interpolant $\hat{\beta} = \mathbf{X}^\top(\mathbf{X}\mathbf{X}^\top)^{-1}\mathbf{y}$ when $p > n$ , the excess risk decomposes as:

\mathbb{E}\|\hat{\beta} - \beta_0\|_2^2 = \underbrace{\frac{\sigma^2(p - n)}{n - 1}}_{\text{variance}} + \underbrace{\frac{s\sigma^2\log(p/n)}{n}}_{\text{bias from regularization}}

The variance term dominates near $p = n$ (explaining the peak), while for $p \gg n$ , the variance decreases because the minimum-norm solution spreads mass across many directions.

Benign Overfitting

ThBenign Overfitting (Bartlett et al., 2020)

For minimum-norm interpolating linear regression, the excess risk is:

\mathbb{E}\left[\frac{\|\mathbf{X}(\hat{\beta} - \beta_0)\|^2}{n}\right] \leq \underbrace{\frac{\sigma^2}{n}\left(\frac{\text{tr}(\mathbf{X}\mathbf{X}^\top)^{-1}\sigma^2}{\sigma^2}\right)}_{\text{effective noise}} + \text{bias}^2

The key condition for benign overfitting is that the data covariance $\mathbf{X}^\top\mathbf{X}/n$ must have a small effective rank—most eigenvalues must be concentrated near zero, with a few large ones.

When Overfitting is Harmful

Benign overfitting requires:

The signal must align with the top eigenvectors of the covariance (weak dependence on noise directions)
The effective rank $r_{\text{eff}} = \text{tr}(\Sigma)/\|\Sigma\|_{\text{op}}$ must be small relative to $n$
The noise must be sub-Gaussian

In general settings (e.g., heterogeneous signals, heavy-tailed noise), overfitting can still be harmful.

Python Implementation

import numpy as np
import pandas as pd
from scipy.optimize import minimize
from sklearn.linear_model import Lasso, LassoCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

np.random.seed(42)

# --- Generate high-dimensional data ---
n, p, s = 100, 500, 10
X = np.random.randn(n, p)
beta_true = np.zeros(p)
support = np.random.choice(p, s, replace=False)
beta_true[support] = np.random.randn(s) * 3
y = X @ beta_true + np.random.randn(n) * 0.5

print(f"Setup: n={n}, p={p}, s={s}")
print(f"True support: {sorted(support)}")

# --- LASSO with cross-validation ---
lasso_cv = LassoCV(cv=5, random_state=42).fit(X, y)
beta_lasso = lasso_cv.coef_

n_selected = np.sum(beta_lasso != 0)
true_pos = len(set(np.where(beta_lasso != 0)[0]) & set(support))
false_pos = n_selected - true_pos
false_neg = s - true_pos

print(f"\nLASSO Results (λ={lasso_cv.alpha_:.4f}):")
print(f"  Selected: {n_selected} variables")
print(f"  True positives: {true_pos}/{s}")
print(f"  False positives: {false_pos}")
print(f"  False negatives: {false_neg}")
print(f"  MSE: {mean_squared_error(y, X @ beta_lasso):.4f}")

# --- SCAD penalty (proximal gradient) ---
def scad_penalty(t, lam, a=3.7):
    t_abs = np.abs(t)
    penalty = np.where(t_abs <= lam, lam * t_abs,
                       np.where(t_abs <= a * lam,
                               (-t_abs**2 + 2*a*lam*t_abs - lam**2) / (2*(a-1)),
                               (a+1)*lam**2 / 2))
    return penalty

def scad_prox(v, lam, a=3.7):
    """Proximal operator for SCAD."""
    result = np.zeros_like(v)
    for i in range(len(v)):
        t = v[i]
        t_abs = np.abs(t)
        if t_abs <= 2 * lam:
            result[i] = np.sign(t) * max(t_abs - lam, 0)
        elif t_abs <= a * lam:
            result[i] = (t - np.sign(t) * a * lam / (a - 1)) / (1 - 1/(a-1))
            result[i] = np.sign(t) * max(np.abs(result[i]), 0)
        else:
            result[i] = t
    return result

# Proximal gradient descent for SCAD
def fit_scad(X, y, lam, a=3.7, lr=0.01, max_iter=1000):
    n, p = X.shape
    beta = np.zeros(p)
    L = np.linalg.norm(X.T @ X) / n

    for iteration in range(max_iter):
        grad = -X.T @ (y - X @ beta) / n
        beta_new = scad_prox(beta - lr * grad, lam * lr, a)
        if np.linalg.norm(beta_new - beta) < 1e-8:
            break
        beta = beta_new
    return beta

beta_scad = fit_scad(X, y, lam=0.1)
n_scad = np.sum(beta_scad != 0)
true_pos_scad = len(set(np.where(beta_scad != 0)[0]) & set(support))
print(f"\nSCAD Results:")
print(f"  Selected: {n_scad} variables")
print(f"  True positives: {true_pos_scad}/{s}")

# --- MCP penalty ---
def mcp_penalty(t, lam, gamma=3.0):
    t_abs = np.abs(t)
    return np.where(t_abs <= gamma * lam,
                    lam * t_abs - t_abs**2 / (2 * gamma),
                    gamma * lam**2 / 2)

def mcp_prox(v, lam, gamma=3.0):
    result = np.zeros_like(v)
    for i in range(len(v)):
        t_abs = np.abs(v[i])
        if t_abs <= lam:
            result[i] = np.sign(v[i]) * max(t_abs - lam, 0)
        elif t_abs <= gamma * lam:
            result[i] = v[i] * (gamma - 1) / gamma - np.sign(v[i]) * lam / gamma
            result[i] = np.sign(v[i]) * max(np.abs(result[i]), 0)
        else:
            result[i] = v[i]
    return result

def fit_mcp(X, y, lam, gamma=3.0, lr=0.01, max_iter=1000):
    n, p = X.shape
    beta = np.zeros(p)
    for _ in range(max_iter):
        grad = -X.T @ (y - X @ beta) / n
        beta_new = mcp_prox(beta - lr * grad, lam * lr, gamma)
        if np.linalg.norm(beta_new - beta) < 1e-8:
            break
        beta = beta_new
    return beta

beta_mcp = fit_mcp(X, y, lam=0.1)
n_mcp = np.sum(beta_mcp != 0)
true_pos_mcp = len(set(np.where(beta_mcp != 0)[0]) & set(support))
print(f"\nMCP Results:")
print(f"  Selected: {n_mcp} variables")
print(f"  True positives: {true_pos_mcp}/{s}")

# --- Double descent demonstration ---
ratios = np.linspace(0.2, 3.0, 50)
test_errors = []
train_errors = []

for ratio in ratios:
    n_i = int(ratio * p) if ratio * p < n * 3 else int(n * 3)
    X_sub = X[:n_i, :n_i] if n_i <= p else X[:min(n_i, n), :p]
    y_sub = y[:X_sub.shape[0]]

    if X_sub.shape[0] >= X_sub.shape[1]:
        try:
            beta_hat = np.linalg.lstsq(X_sub, y_sub, rcond=None)[0]
        except:
            beta_hat = np.zeros(X_sub.shape[1])
    else:
        # Minimum norm solution
        beta_hat = X_sub.T @ np.linalg.solve(X_sub @ X_sub.T + 1e-10*np.eye(X_sub.shape[0]), y_sub)

    y_pred = X[:X_sub.shape[0], :X_sub.shape[1]] @ beta_hat
    test_errors.append(mean_squared_error(y[:len(y_pred)], y_pred))
    train_errors.append(np.mean((X_sub @ beta_hat - y_sub)**2))

plt.figure(figsize=(8, 5))
plt.plot(ratios, test_errors, 'b-', label='Test Error', lw=2)
plt.plot(ratios, train_errors, 'r--', label='Train Error', lw=2)
plt.axvline(x=1.0, color='gray', ls=':', label='p/n = 1')
plt.xlabel('Ratio p/n')
plt.ylabel('Mean Squared Error')
plt.title('Double Descent Phenomenon')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("double_descent.png", dpi=150, bbox_inches="tight")
plt.show()

print("\nDouble descent: test error peaks near p/n = 1, then decreases in overparameterized regime")

Summary

Key Takeaways: High-Dimensional Statistics

The curse of dimensionality makes nonparametric methods rate-infeasible when $p$ is large; sparsity (only $s \ll p$ nonzero coefficients) is the key structural assumption.
LASSO ( $\ell_1$ penalty) achieves near-oracle rates with $O(\log p)$ penalty; SCAD and MCP achieve the oracle property (unbiased + sparse) via nonconvex penalties.
Compressed sensing recovers sparse signals from $O(s\log(p/s))$ measurements via RIP; random projections reduce dimension to $O(\log n)$ while preserving distances.
Concentration of measures provides the probabilistic machinery for high-dimensional guarantees; sub-Gaussian tails are essential.
Double descent and benign overfitting reveal that overparameterized models can generalize well when the signal aligns with top eigenvectors and the effective rank is small.

High-Dimensional Statistics

High-Dimensional Statistics

When Features Outnumber Samples

The Curse of Dimensionality

DfCurse of Dimensionality

ThDimension Dependence

Sparse Estimation

DfSparsity Condition

LASSO (Least Absolute Shrinkage and Selection Operator)

LASSO Estimator

ThLASSO Oracle Inequality

SCAD (Smoothly Clipped Absolute Deviation)

SCAD Penalty

MCP (Minimax Concave Penalty)

MCP

Compressed Sensing

DfCompressed Sensing

ThRestricted Isometry Property (RIP)

Random Projections

Johnson-Lindenstrauss Lemma

Concentration of Measures

ThSub-Gaussian Concentration

Double Descent

DfDouble Descent

ThDouble Descent for Linear Regression

Benign Overfitting

ThBenign Overfitting (Bartlett et al., 2020)

Python Implementation

Summary

Key Takeaways: High-Dimensional Statistics

Premium Content

Need Expert Statistics Help?