Cross-Validation in Statistics
Statistics
Estimating How Well Models Generalize to New Data
Cross-validation partitions data into training and validation sets, repeatedly fitting and evaluating models to estimate out-of-sample performance. It prevents overfitting by testing models on data they haven't seen during training.
-
Model Selection β Choose between competing models with honest performance estimates
-
Hyperparameter Tuning β Find optimal settings without overfitting to validation data
-
Clinical Prediction β Validate risk scores on held-out patient populations
Cross-validation is the closest thing to a crystal ball for predicting model performance.
Cross-validation (CV) estimates how well a model generalizes to unseen data by training and testing on different subsets of the available data.
DfCross-Validation
A resampling method that partitions data into training and validation sets, fits the model on the training set, and evaluates it on the validation set. This process is repeated multiple times.
Why Cross-Validation?
The Overfitting Problem
Models can perform excellently on training data but poorly on new data. Cross-validation provides an honest estimate of predictive performance by simulating how the model would perform on unseen data.
K-Fold Cross-Validation
The most common CV method. Data is split into k roughly equal folds.
K-Fold CV Estimate
Here,
- =Number of folds (typically 5 or 10)
- =Mean squared error on fold i
Steps
| Step | Action |
|------|--------|
| 1 | Randomly partition data into k folds |
| 2 | For each fold i: train on k-1 folds, test on fold i |
| 3 | Compute the error metric for each fold |
| 4 | Average the k error estimates |
Common Values of k
| k | Name | Bias | Variance | Cost |
|---|------|------|----------|------|
| n | Leave-One-Out (LOO) | Low | High | Expensive |
| 5 | 5-Fold | Moderate | Moderate | Moderate |
| 10 | 10-Fold | Moderate | Lower than 5 | Higher |
| 1 | Holdout (single split) | High | Low | Cheap |
k = 10 is Standard
The literature generally recommends k = 10 as a good balance between bias and variance. For small datasets (n < 100), use LOO-CV.
Leave-One-Out Cross-Validation
Each observation serves as the test set exactly once.
LOO-CV
Here,
- =Prediction for observation i from model trained on all data except i
LOO Properties
-
Unbiased estimate of test error
-
Low variance (each training set differs by only one observation)
-
Computationally expensive (n model fits)
-
For linear models, LOO can be computed analytically
Stratified Cross-Validation
Ensures each fold has approximately the same class proportions as the full dataset.
Class Imbalance
With imbalanced classes, random splits may produce folds with no minority class samples. Stratified CV guarantees each fold represents the class distribution.
Repeated Cross-Validation
Repeat k-fold CV multiple times with different random partitions to reduce variance.
| Repetition | Description |
|-----------|-------------|
| 1 Γ 10-CV | Standard 10-fold |
| 5 Γ 2-CV | 5 repetitions of 2-fold |
| 10 Γ 10-CV | 10 repetitions of 10-fold |
Nested Cross-Validation
For simultaneous model selection and performance estimation.
DfNested CV
-
Outer loop: Estimates generalization error
-
Inner loop: Selects the best model (tunes hyperparameters)
| Loop | Purpose |
|------|---------|
| Outer | Test on held-out fold -> unbiased performance estimate |
| Inner | Tune hyperparameters on training fold -> model selection |
Why Nested CV?
Without nesting, the performance estimate is optimistically biased because the same data is used for both tuning and evaluation.
Python Implementation
import numpy as np
import pandas as pd
from sklearn.model_selection import (KFold, LeaveOneOut, cross_val_score,
StratifiedKFold, RepeatedKFold, cross_val_predict)
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
np.random.seed(42)
# Generate data
X, y = make_regression(n_samples=200, n_features=5, noise=10, random_state=42)
# 10-Fold CV
kf = KFold(n_splits=10, shuffle=True, random_state=42)
lr = LinearRegression()
scores_10fold = cross_val_score(lr, X, y, cv=kf, scoring='neg_mean_squared_error')
print(f"10-Fold CV MSE: {-scores_10fold.mean():.2f} (+/- {scores_10fold.std():.2f})")
# LOO-CV
loo = LeaveOneOut()
scores_loo = cross_val_score(lr, X, y, cv=loo, scoring='neg_mean_squared_error')
print(f"LOO-CV MSE: {-scores_loo.mean():.2f}")
# Repeated 5-Fold CV
rkf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)
scores_repeated = cross_val_score(lr, X, y, cv=rkf, scoring='neg_mean_squared_error')
print(f"Repeated 5-Fold MSE: {-scores_repeated.mean():.2f} (+/- {scores_repeated.std():.2f})")
# Nested CV for model selection
alphas = [0.1, 1.0, 10.0, 100.0]
outer_scores = []
for train_idx, test_idx in KFold(n_splits=5, shuffle=True, random_state=42).split(X):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# Inner loop: select alpha
best_alpha = None
best_score = -np.inf
for alpha in alphas:
ridge = Ridge(alpha=alpha)
inner_scores = cross_val_score(ridge, X_train, y_train, cv=3, scoring='neg_mean_squared_error')
if inner_scores.mean() > best_score:
best_score = inner_scores.mean()
best_alpha = alpha
# Outer loop: evaluate
ridge_best = Ridge(alpha=best_alpha)
ridge_best.fit(X_train, y_train)
outer_scores.append(mean_squared_error(y_test, ridge_best.predict(X_test)))
print(f"\nNested CV MSE: {np.mean(outer_scores):.2f} (+/- {np.std(outer_scores):.2f})")
print(f"Selected alpha: {best_alpha}")
Worked Example
Example: Comparing Regression Models
Evaluating Linear, Ridge, and Lasso regression using 10-fold CV:
| Model | CV MSE | Std |
|-------|--------|-----|
| Linear Regression | 102.5 | 15.3 |
| Ridge (a=1) | 98.2 | 14.1 |
| Lasso (a=0.1) | 96.8 | 13.7 |
Ridge and Lasso outperform plain linear regression. Lasso achieves the lowest MSE with the added benefit of feature selection (some coefficients are exactly zero).
Key Takeaways
Summary: Cross-Validation
-
CV provides an unbiased estimate of generalization error
-
10-fold CV is the standard choice; use LOO for small datasets
-
Stratified CV ensures balanced class proportions in each fold
-
Nested CV is needed for simultaneous model selection and evaluation
-
Repeated CV reduces variance from a single random partition
-
Always shuffle the data before partitioning (except for time series)
-
For time series, use forward chaining (rolling window) CV
Related Topics
-
See AIC and BIC for information criteria model selection
-
See Bootstrap Methods for another resampling approach
-
See ROC and AUC for classification model evaluation