Cross-Validation in Statistics

Statistics

Estimating How Well Models Generalize to New Data

Cross-validation partitions data into training and validation sets, repeatedly fitting and evaluating models to estimate out-of-sample performance. It prevents overfitting by testing models on data they haven't seen during training.

Model Selection — Choose between competing models with honest performance estimates
Hyperparameter Tuning — Find optimal settings without overfitting to validation data
Clinical Prediction — Validate risk scores on held-out patient populations

Cross-validation is the closest thing to a crystal ball for predicting model performance.

Cross-validation (CV) estimates how well a model generalizes to unseen data by training and testing on different subsets of the available data.

DfCross-Validation

A resampling method that partitions data into training and validation sets, fits the model on the training set, and evaluates it on the validation set. This process is repeated multiple times.

Why Cross-Validation?

The Overfitting Problem

Models can perform excellently on training data but poorly on new data. Cross-validation provides an honest estimate of predictive performance by simulating how the model would perform on unseen data.

K-Fold Cross-Validation

The most common CV method. Data is split into k roughly equal folds.

K-Fold CV Estimate

\text{CV}(k) = \frac{1}{k}\sum_{i=1}^{k}\text{MSE}_i

Here,

$k$ =Number of folds (typically 5 or 10)
$\text{MSE}_i$ =Mean squared error on fold i

Steps

| Step | Action |

|------|--------|

| 1 | Randomly partition data into k folds |

| 2 | For each fold i: train on k-1 folds, test on fold i |

| 3 | Compute the error metric for each fold |

| 4 | Average the k error estimates |

Common Values of k

|---|------|------|----------|------|

k = 10 is Standard

The literature generally recommends k = 10 as a good balance between bias and variance. For small datasets (n < 100), use LOO-CV.

Leave-One-Out Cross-Validation

Each observation serves as the test set exactly once.

LOO-CV

\text{LOO-CV} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_{-i})^2

Here,

$\hat{y}_{-i}$ =Prediction for observation i from model trained on all data except i

LOO Properties

Unbiased estimate of test error
Low variance (each training set differs by only one observation)
Computationally expensive (n model fits)
For linear models, LOO can be computed analytically

Stratified Cross-Validation

Ensures each fold has approximately the same class proportions as the full dataset.

Class Imbalance

With imbalanced classes, random splits may produce folds with no minority class samples. Stratified CV guarantees each fold represents the class distribution.

Repeated Cross-Validation

Repeat k-fold CV multiple times with different random partitions to reduce variance.

| Repetition | Description |

|-----------|-------------|

| 1 × 10-CV | Standard 10-fold |

| 5 × 2-CV | 5 repetitions of 2-fold |

| 10 × 10-CV | 10 repetitions of 10-fold |

Nested Cross-Validation

For simultaneous model selection and performance estimation.

DfNested CV

Outer loop: Estimates generalization error
Inner loop: Selects the best model (tunes hyperparameters)

| Loop | Purpose |

|------|---------|

| Outer | Test on held-out fold -> unbiased performance estimate |

| Inner | Tune hyperparameters on training fold -> model selection |

Why Nested CV?

Without nesting, the performance estimate is optimistically biased because the same data is used for both tuning and evaluation.

Python Implementation


import numpy as np

import pandas as pd

from sklearn.model_selection import (KFold, LeaveOneOut, cross_val_score,

                                      StratifiedKFold, RepeatedKFold, cross_val_predict)

from sklearn.linear_model import LinearRegression, Ridge

from sklearn.datasets import make_regression

from sklearn.metrics import mean_squared_error

import matplotlib.pyplot as plt



np.random.seed(42)



# Generate data

X, y = make_regression(n_samples=200, n_features=5, noise=10, random_state=42)



# 10-Fold CV

kf = KFold(n_splits=10, shuffle=True, random_state=42)

lr = LinearRegression()

scores_10fold = cross_val_score(lr, X, y, cv=kf, scoring='neg_mean_squared_error')

print(f"10-Fold CV MSE: {-scores_10fold.mean():.2f} (+/- {scores_10fold.std():.2f})")



# LOO-CV

loo = LeaveOneOut()

scores_loo = cross_val_score(lr, X, y, cv=loo, scoring='neg_mean_squared_error')

print(f"LOO-CV MSE: {-scores_loo.mean():.2f}")



# Repeated 5-Fold CV

rkf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)

scores_repeated = cross_val_score(lr, X, y, cv=rkf, scoring='neg_mean_squared_error')

print(f"Repeated 5-Fold MSE: {-scores_repeated.mean():.2f} (+/- {scores_repeated.std():.2f})")



# Nested CV for model selection

alphas = [0.1, 1.0, 10.0, 100.0]

outer_scores = []

for train_idx, test_idx in KFold(n_splits=5, shuffle=True, random_state=42).split(X):

    X_train, X_test = X[train_idx], X[test_idx]

    y_train, y_test = y[train_idx], y[test_idx]

    

    # Inner loop: select alpha

    best_alpha = None

    best_score = -np.inf

    for alpha in alphas:

        ridge = Ridge(alpha=alpha)

        inner_scores = cross_val_score(ridge, X_train, y_train, cv=3, scoring='neg_mean_squared_error')

        if inner_scores.mean() > best_score:

            best_score = inner_scores.mean()

            best_alpha = alpha

    

    # Outer loop: evaluate

    ridge_best = Ridge(alpha=best_alpha)

    ridge_best.fit(X_train, y_train)

    outer_scores.append(mean_squared_error(y_test, ridge_best.predict(X_test)))

    

print(f"\nNested CV MSE: {np.mean(outer_scores):.2f} (+/- {np.std(outer_scores):.2f})")

print(f"Selected alpha: {best_alpha}")

Worked Example

Example: Comparing Regression Models

Evaluating Linear, Ridge, and Lasso regression using 10-fold CV:

| Model | CV MSE | Std |

|-------|--------|-----|

| Linear Regression | 102.5 | 15.3 |

| Ridge (a=1) | 98.2 | 14.1 |

| Lasso (a=0.1) | 96.8 | 13.7 |

Ridge and Lasso outperform plain linear regression. Lasso achieves the lowest MSE with the added benefit of feature selection (some coefficients are exactly zero).

Key Takeaways

Summary: Cross-Validation

CV provides an unbiased estimate of generalization error
10-fold CV is the standard choice; use LOO for small datasets
Stratified CV ensures balanced class proportions in each fold
Nested CV is needed for simultaneous model selection and evaluation
Repeated CV reduces variance from a single random partition
Always shuffle the data before partitioning (except for time series)
For time series, use forward chaining (rolling window) CV

Cross-Validation in Statistics

Cross-Validation in Statistics

Estimating How Well Models Generalize to New Data

DfCross-Validation

Why Cross-Validation?

K-Fold Cross-Validation

K-Fold CV Estimate

Steps

Common Values of k

Leave-One-Out Cross-Validation

LOO-CV

Stratified Cross-Validation

Repeated Cross-Validation

Nested Cross-Validation

DfNested CV

Python Implementation

Worked Example

Example: Comparing Regression Models

Key Takeaways

Summary: Cross-Validation

Related Topics

Premium Content

Need Expert Statistics Help?