🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

High-Dimensional Statistics

Advanced Statistical MethodsModern Methods🟢 Free Lesson

Advertisement

High-Dimensional Statistics

Advanced Statistical Methods

When Features Outnumber Samples

High-dimensional statistics addresses the challenge of p >> n problems where classical methods break down. Sparse estimation techniques like LASSO, SCAD, and MCP recover signals from high-dimensional noise.

  • Genomics — Identify important genes from thousands of expression features with few samples
  • Image recognition — Select relevant pixels or features from high-dimensional image data
  • Text classification — Build sparse models from tens of thousands of word features

High-dimensional statistics finds needles in haystacks when the haystack has more needles than hay.


When the number of parameters pp grows with or exceeds the sample size nn, classical asymptotic theory breaks down. The classical estimator β^=(XX)1Xy\hat{\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y} does not exist when p>np > n, confidence intervals lose coverage, and the maximum likelihood estimator may not be unique. High-dimensional statistics develops the theory and methods for inference under this regime.


The Curse of Dimensionality

DfCurse of Dimensionality

As dimension pp \to \infty:

  1. The volume of the unit ball Vp=πp/2/Γ(p/2+1)0V_p = \pi^{p/2}/\Gamma(p/2 + 1) \to 0 relative to the unit cube [1,1]p[-1, 1]^p.
  2. All pairwise distances concentrate: maxi,jxixjmini,jxixjmini,jxixj0\frac{\max_{i,j}\|\mathbf{x}_i - \mathbf{x}_j\| - \min_{i,j}\|\mathbf{x}_i - \mathbf{x}_j\|}{\min_{i,j}\|\mathbf{x}_i - \mathbf{x}_j\|} \to 0.
  3. The fraction of data in any ϵ\epsilon-ball around a point vanishes: Vp(ϵ)/Vp(1)=ϵp0V_p(\epsilon)/V_p(1) = \epsilon^p \to 0.

ThDimension Dependence

For nonparametric regression with pp features and nn observations, the minimax optimal rate is:

inff^supfFEf^f2n2α/(2α+p)\inf_{\hat{f}} \sup_{f \in \mathcal{F}} \mathbb{E}\|\hat{f} - f\|^2 \asymp n^{-2\alpha/(2\alpha + p)}

where α\alpha is the smoothness of ff. The rate deteriorates exponentially with pp, rendering nonparametric methods useless unless nepn \gg e^p.


Sparse Estimation

Sparsity assumes that only sps \ll p of the pp parameters are nonzero. This is the key structural assumption enabling estimation in high dimensions.

DfSparsity Condition

The true parameter vector satisfies β00={j:β0,j0}=s\|\beta_0\|_0 = |\{j : \beta_{0,j} \neq 0\}| = s where sps \ll p. The 0\ell_0 "norm" counts nonzero entries (not a true norm).

LASSO (Least Absolute Shrinkage and Selection Operator)

LASSO Estimator

β^LASSO=argminβ{12nyXβ22+λβ1}\hat{\beta}_{\text{LASSO}} = \arg\min_{\beta} \left\{\frac{1}{2n}\|\mathbf{y} - \mathbf{X}\beta\|_2^2 + \lambda\|\beta\|_1\right\}

The 1\ell_1 penalty induces exact zeros in β^\hat{\beta}, performing automatic variable selection. The regularization path {β^(λ)}λ0\{\hat{\beta}(\lambda)\}_{\lambda \geq 0} is piecewise linear and can be computed exactly via the LARS algorithm.

ThLASSO Oracle Inequality

Under the irrepresentable condition (or restricted eigenvalue condition) with λσslog(p)/n\lambda \asymp \sigma\sqrt{s\log(p)/n}:

β^β022C1sσ2logpn\|\hat{\beta} - \beta_0\|_2^2 \leq C_1 \frac{s\sigma^2\log p}{n}

with probability at least 1c1pc21 - c_1p^{-c_2}. This matches the oracle rate up to the logp\log p factor.

SCAD (Smoothly Clipped Absolute Deviation)

SCAD Penalty

pλ(t)={λttλt2+2aλtλ22(a1)λ<taλ(a+1)λ22t>aλp_\lambda(|t|) = \begin{cases} \lambda|t| & |t| \leq \lambda \\ \frac{-t^2 + 2a\lambda|t| - \lambda^2}{2(a-1)} & \lambda < |t| \leq a\lambda \\ \frac{(a+1)\lambda^2}{2} & |t| > a\lambda \end{cases}

The SCAD penalty is quadratic near zero (like ridge) and constant for large coefficients (like hard thresholding). It satisfies three properties: unbiasedness (for large coefficients), sparsity, and continuity.

MCP (Minimax Concave Penalty)

MCP

pλ(t)={λtt22γtγλγλ22t>γλp_\lambda(|t|) = \begin{cases} \lambda|t| - \frac{t^2}{2\gamma} & |t| \leq \gamma\lambda \\ \frac{\gamma\lambda^2}{2} & |t| > \gamma\lambda \end{cases}

The MCP interpolates between LASSO (γ\gamma \to \infty) and hard thresholding (γ1\gamma \to 1). For γ>1\gamma > 1, MCP is an unbiased, sparse, and continuous penalty.

Oracle Properties

Both SCAD and MCP satisfy the oracle property: with appropriate λn\lambda_n,

β^j={0if β0,j=0(selection consistency)β^joracle+op(1)if β0,j0(asymptotic normality)\hat{\beta}_j = \begin{cases} 0 & \text{if } \beta_{0,j} = 0 \quad \text{(selection consistency)} \\ \hat{\beta}_j^{\text{oracle}} + o_p(1) & \text{if } \beta_{0,j} \neq 0 \quad \text{(asymptotic normality)} \end{cases}

where β^joracle\hat{\beta}_j^{\text{oracle}} is the estimator obtained by knowing the true support.


Compressed Sensing

DfCompressed Sensing

Recover a sparse signal xRp\mathbf{x} \in \mathbb{R}^p with x0=s\|\mathbf{x}\|_0 = s from mpm \ll p linear measurements y=Ax\mathbf{y} = \mathbf{A}\mathbf{x} where ARm×p\mathbf{A} \in \mathbb{R}^{m \times p} is a measurement matrix.

ThRestricted Isometry Property (RIP)

The matrix A\mathbf{A} satisfies the RIP of order kk with constant δk\delta_k if:

(1δk)x22Ax22(1+δk)x22(1 - \delta_k)\|\mathbf{x}\|_2^2 \leq \|\mathbf{A}\mathbf{x}\|_2^2 \leq (1 + \delta_k)\|\mathbf{x}\|_2^2

for all kk-sparse vectors x\mathbf{x}. If δ2s<21\delta_{2s} < \sqrt{2} - 1, then 1\ell_1-minimization recovers any ss-sparse x\mathbf{x} exactly from y=Ax\mathbf{y} = \mathbf{A}\mathbf{x}.

Random matrices with i.i.d. sub-Gaussian entries satisfy the RIP with high probability when mCslog(p/s)m \geq C\,s\log(p/s). This is the foundation of the compressed sensing revolution.


Random Projections

Johnson-Lindenstrauss Lemma

For any ϵ(0,1)\epsilon \in (0, 1) and nn points in Rp\mathbb{R}^p, there exists a linear map Π:RpRk\mathbf{\Pi}: \mathbb{R}^p \to \mathbb{R}^k with k=O(ϵ2logn)k = O(\epsilon^{-2}\log n) such that for all i,ji, j:

(1ϵ)xixj2ΠxiΠxj2(1+ϵ)xixj2(1 - \epsilon)\|\mathbf{x}_i - \mathbf{x}_j\|^2 \leq \|\mathbf{\Pi}\mathbf{x}_i - \mathbf{\Pi}\mathbf{x}_j\|^2 \leq (1 + \epsilon)\|\mathbf{x}_i - \mathbf{x}_j\|^2

The dimension reduction is independent of the original dimension pp—distances are approximately preserved after projecting to O(logn)O(\log n) dimensions.


Concentration of Measures

ThSub-Gaussian Concentration

If X1,,XnX_1, \dots, X_n are independent sub-Gaussian random variables with Xiψ2K\|X_i\|_{\psi_2} \leq K, then for aRn\mathbf{a} \in \mathbb{R}^n:

P(i=1naiXiE[i=1naiXi]t)2exp(ct2K2a22)\mathbb{P}\left(\left|\sum_{i=1}^n a_i X_i - \mathbb{E}\left[\sum_{i=1}^n a_i X_i\right]\right| \geq t\right) \leq 2\exp\left(-\frac{ct^2}{K^2\|\mathbf{a}\|_2^2}\right)

Implications for High-Dimensional Inference

Concentration inequalities provide the foundation for:

  • Talagrand's inequality: Controls the deviation of empirical processes
  • Rademacher complexity: Measures the complexity of function classes for uniform convergence
  • Bernstein-type bounds: Tighter than Hoeffding when individual contributions are bounded
  • Matrix concentration: Controls singular values of random matrices via matrix Bernstein

Double Descent

DfDouble Descent

In overparameterized models (p>np > n), the test error exhibits a non-monotonic behavior: it first decreases (underfitting regime), increases near pnp \approx n (interpolation threshold), then decreases again in the overparameterized regime. This "double descent" contradicts the classical bias–variance tradeoff.

ThDouble Descent for Linear Regression

For the minimum 2\ell_2-norm interpolant β^=X(XX)1y\hat{\beta} = \mathbf{X}^\top(\mathbf{X}\mathbf{X}^\top)^{-1}\mathbf{y} when p>np > n, the excess risk decomposes as:

Eβ^β022=σ2(pn)n1variance+sσ2log(p/n)nbias from regularization\mathbb{E}\|\hat{\beta} - \beta_0\|_2^2 = \underbrace{\frac{\sigma^2(p - n)}{n - 1}}_{\text{variance}} + \underbrace{\frac{s\sigma^2\log(p/n)}{n}}_{\text{bias from regularization}}

The variance term dominates near p=np = n (explaining the peak), while for pnp \gg n, the variance decreases because the minimum-norm solution spreads mass across many directions.


Benign Overfitting

ThBenign Overfitting (Bartlett et al., 2020)

For minimum-norm interpolating linear regression, the excess risk is:

E[X(β^β0)2n]σ2n(tr(XX)1σ2σ2)effective noise+bias2\mathbb{E}\left[\frac{\|\mathbf{X}(\hat{\beta} - \beta_0)\|^2}{n}\right] \leq \underbrace{\frac{\sigma^2}{n}\left(\frac{\text{tr}(\mathbf{X}\mathbf{X}^\top)^{-1}\sigma^2}{\sigma^2}\right)}_{\text{effective noise}} + \text{bias}^2

The key condition for benign overfitting is that the data covariance XX/n\mathbf{X}^\top\mathbf{X}/n must have a small effective rank—most eigenvalues must be concentrated near zero, with a few large ones.

When Overfitting is Harmful

Benign overfitting requires:

  1. The signal must align with the top eigenvectors of the covariance (weak dependence on noise directions)
  2. The effective rank reff=tr(Σ)/Σopr_{\text{eff}} = \text{tr}(\Sigma)/\|\Sigma\|_{\text{op}} must be small relative to nn
  3. The noise must be sub-Gaussian

In general settings (e.g., heterogeneous signals, heavy-tailed noise), overfitting can still be harmful.


Python Implementation

import numpy as np
import pandas as pd
from scipy.optimize import minimize
from sklearn.linear_model import Lasso, LassoCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

np.random.seed(42)

# --- Generate high-dimensional data ---
n, p, s = 100, 500, 10
X = np.random.randn(n, p)
beta_true = np.zeros(p)
support = np.random.choice(p, s, replace=False)
beta_true[support] = np.random.randn(s) * 3
y = X @ beta_true + np.random.randn(n) * 0.5

print(f"Setup: n={n}, p={p}, s={s}")
print(f"True support: {sorted(support)}")

# --- LASSO with cross-validation ---
lasso_cv = LassoCV(cv=5, random_state=42).fit(X, y)
beta_lasso = lasso_cv.coef_

n_selected = np.sum(beta_lasso != 0)
true_pos = len(set(np.where(beta_lasso != 0)[0]) & set(support))
false_pos = n_selected - true_pos
false_neg = s - true_pos

print(f"\nLASSO Results (λ={lasso_cv.alpha_:.4f}):")
print(f"  Selected: {n_selected} variables")
print(f"  True positives: {true_pos}/{s}")
print(f"  False positives: {false_pos}")
print(f"  False negatives: {false_neg}")
print(f"  MSE: {mean_squared_error(y, X @ beta_lasso):.4f}")

# --- SCAD penalty (proximal gradient) ---
def scad_penalty(t, lam, a=3.7):
    t_abs = np.abs(t)
    penalty = np.where(t_abs <= lam, lam * t_abs,
                       np.where(t_abs <= a * lam,
                               (-t_abs**2 + 2*a*lam*t_abs - lam**2) / (2*(a-1)),
                               (a+1)*lam**2 / 2))
    return penalty

def scad_prox(v, lam, a=3.7):
    """Proximal operator for SCAD."""
    result = np.zeros_like(v)
    for i in range(len(v)):
        t = v[i]
        t_abs = np.abs(t)
        if t_abs <= 2 * lam:
            result[i] = np.sign(t) * max(t_abs - lam, 0)
        elif t_abs <= a * lam:
            result[i] = (t - np.sign(t) * a * lam / (a - 1)) / (1 - 1/(a-1))
            result[i] = np.sign(t) * max(np.abs(result[i]), 0)
        else:
            result[i] = t
    return result

# Proximal gradient descent for SCAD
def fit_scad(X, y, lam, a=3.7, lr=0.01, max_iter=1000):
    n, p = X.shape
    beta = np.zeros(p)
    L = np.linalg.norm(X.T @ X) / n

    for iteration in range(max_iter):
        grad = -X.T @ (y - X @ beta) / n
        beta_new = scad_prox(beta - lr * grad, lam * lr, a)
        if np.linalg.norm(beta_new - beta) < 1e-8:
            break
        beta = beta_new
    return beta

beta_scad = fit_scad(X, y, lam=0.1)
n_scad = np.sum(beta_scad != 0)
true_pos_scad = len(set(np.where(beta_scad != 0)[0]) & set(support))
print(f"\nSCAD Results:")
print(f"  Selected: {n_scad} variables")
print(f"  True positives: {true_pos_scad}/{s}")

# --- MCP penalty ---
def mcp_penalty(t, lam, gamma=3.0):
    t_abs = np.abs(t)
    return np.where(t_abs <= gamma * lam,
                    lam * t_abs - t_abs**2 / (2 * gamma),
                    gamma * lam**2 / 2)

def mcp_prox(v, lam, gamma=3.0):
    result = np.zeros_like(v)
    for i in range(len(v)):
        t_abs = np.abs(v[i])
        if t_abs <= lam:
            result[i] = np.sign(v[i]) * max(t_abs - lam, 0)
        elif t_abs <= gamma * lam:
            result[i] = v[i] * (gamma - 1) / gamma - np.sign(v[i]) * lam / gamma
            result[i] = np.sign(v[i]) * max(np.abs(result[i]), 0)
        else:
            result[i] = v[i]
    return result

def fit_mcp(X, y, lam, gamma=3.0, lr=0.01, max_iter=1000):
    n, p = X.shape
    beta = np.zeros(p)
    for _ in range(max_iter):
        grad = -X.T @ (y - X @ beta) / n
        beta_new = mcp_prox(beta - lr * grad, lam * lr, gamma)
        if np.linalg.norm(beta_new - beta) < 1e-8:
            break
        beta = beta_new
    return beta

beta_mcp = fit_mcp(X, y, lam=0.1)
n_mcp = np.sum(beta_mcp != 0)
true_pos_mcp = len(set(np.where(beta_mcp != 0)[0]) & set(support))
print(f"\nMCP Results:")
print(f"  Selected: {n_mcp} variables")
print(f"  True positives: {true_pos_mcp}/{s}")

# --- Double descent demonstration ---
ratios = np.linspace(0.2, 3.0, 50)
test_errors = []
train_errors = []

for ratio in ratios:
    n_i = int(ratio * p) if ratio * p < n * 3 else int(n * 3)
    X_sub = X[:n_i, :n_i] if n_i <= p else X[:min(n_i, n), :p]
    y_sub = y[:X_sub.shape[0]]

    if X_sub.shape[0] >= X_sub.shape[1]:
        try:
            beta_hat = np.linalg.lstsq(X_sub, y_sub, rcond=None)[0]
        except:
            beta_hat = np.zeros(X_sub.shape[1])
    else:
        # Minimum norm solution
        beta_hat = X_sub.T @ np.linalg.solve(X_sub @ X_sub.T + 1e-10*np.eye(X_sub.shape[0]), y_sub)

    y_pred = X[:X_sub.shape[0], :X_sub.shape[1]] @ beta_hat
    test_errors.append(mean_squared_error(y[:len(y_pred)], y_pred))
    train_errors.append(np.mean((X_sub @ beta_hat - y_sub)**2))

plt.figure(figsize=(8, 5))
plt.plot(ratios, test_errors, 'b-', label='Test Error', lw=2)
plt.plot(ratios, train_errors, 'r--', label='Train Error', lw=2)
plt.axvline(x=1.0, color='gray', ls=':', label='p/n = 1')
plt.xlabel('Ratio p/n')
plt.ylabel('Mean Squared Error')
plt.title('Double Descent Phenomenon')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("double_descent.png", dpi=150, bbox_inches="tight")
plt.show()

print("\nDouble descent: test error peaks near p/n = 1, then decreases in overparameterized regime")

Summary

Key Takeaways: High-Dimensional Statistics

  1. The curse of dimensionality makes nonparametric methods rate-infeasible when pp is large; sparsity (only sps \ll p nonzero coefficients) is the key structural assumption.
  2. LASSO (1\ell_1 penalty) achieves near-oracle rates with O(logp)O(\log p) penalty; SCAD and MCP achieve the oracle property (unbiased + sparse) via nonconvex penalties.
  3. Compressed sensing recovers sparse signals from O(slog(p/s))O(s\log(p/s)) measurements via RIP; random projections reduce dimension to O(logn)O(\log n) while preserving distances.
  4. Concentration of measures provides the probabilistic machinery for high-dimensional guarantees; sub-Gaussian tails are essential.
  5. Double descent and benign overfitting reveal that overparameterized models can generalize well when the signal aligns with top eigenvectors and the effective rank is small.

Premium Content

High-Dimensional Statistics

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement