πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Cross-Validation in Statistics

StatisticsModel Evaluation🟒 Free Lesson

Advertisement

Cross-Validation in Statistics

Statistics

Estimating How Well Models Generalize to New Data

Cross-validation partitions data into training and validation sets, repeatedly fitting and evaluating models to estimate out-of-sample performance. It prevents overfitting by testing models on data they haven't seen during training.

  • Model Selection β€” Choose between competing models with honest performance estimates

  • Hyperparameter Tuning β€” Find optimal settings without overfitting to validation data

  • Clinical Prediction β€” Validate risk scores on held-out patient populations

Cross-validation is the closest thing to a crystal ball for predicting model performance.


Cross-validation (CV) estimates how well a model generalizes to unseen data by training and testing on different subsets of the available data.

DfCross-Validation

A resampling method that partitions data into training and validation sets, fits the model on the training set, and evaluates it on the validation set. This process is repeated multiple times.


Why Cross-Validation?

The Overfitting Problem

Models can perform excellently on training data but poorly on new data. Cross-validation provides an honest estimate of predictive performance by simulating how the model would perform on unseen data.


K-Fold Cross-Validation

The most common CV method. Data is split into k roughly equal folds.

K-Fold CV Estimate

CV(k)=1kβˆ‘i=1kMSEi\text{CV}(k) = \frac{1}{k}\sum_{i=1}^{k}\text{MSE}_i

Here,

  • kk=Number of folds (typically 5 or 10)
  • MSEi\text{MSE}_i=Mean squared error on fold i

Steps

| Step | Action |

|------|--------|

| 1 | Randomly partition data into k folds |

| 2 | For each fold i: train on k-1 folds, test on fold i |

| 3 | Compute the error metric for each fold |

| 4 | Average the k error estimates |


Common Values of k

| k | Name | Bias | Variance | Cost |

|---|------|------|----------|------|

| n | Leave-One-Out (LOO) | Low | High | Expensive |

| 5 | 5-Fold | Moderate | Moderate | Moderate |

| 10 | 10-Fold | Moderate | Lower than 5 | Higher |

| 1 | Holdout (single split) | High | Low | Cheap |

k = 10 is Standard

The literature generally recommends k = 10 as a good balance between bias and variance. For small datasets (n < 100), use LOO-CV.


Leave-One-Out Cross-Validation

Each observation serves as the test set exactly once.

LOO-CV

LOO-CV=1nβˆ‘i=1n(yiβˆ’y^βˆ’i)2\text{LOO-CV} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_{-i})^2

Here,

  • y^βˆ’i\hat{y}_{-i}=Prediction for observation i from model trained on all data except i

LOO Properties

  • Unbiased estimate of test error

  • Low variance (each training set differs by only one observation)

  • Computationally expensive (n model fits)

  • For linear models, LOO can be computed analytically


Stratified Cross-Validation

Ensures each fold has approximately the same class proportions as the full dataset.

Class Imbalance

With imbalanced classes, random splits may produce folds with no minority class samples. Stratified CV guarantees each fold represents the class distribution.


Repeated Cross-Validation

Repeat k-fold CV multiple times with different random partitions to reduce variance.

| Repetition | Description |

|-----------|-------------|

| 1 Γ— 10-CV | Standard 10-fold |

| 5 Γ— 2-CV | 5 repetitions of 2-fold |

| 10 Γ— 10-CV | 10 repetitions of 10-fold |


Nested Cross-Validation

For simultaneous model selection and performance estimation.

DfNested CV

  • Outer loop: Estimates generalization error

  • Inner loop: Selects the best model (tunes hyperparameters)

| Loop | Purpose |

|------|---------|

| Outer | Test on held-out fold -> unbiased performance estimate |

| Inner | Tune hyperparameters on training fold -> model selection |

Why Nested CV?

Without nesting, the performance estimate is optimistically biased because the same data is used for both tuning and evaluation.


Python Implementation


import numpy as np

import pandas as pd

from sklearn.model_selection import (KFold, LeaveOneOut, cross_val_score,

                                      StratifiedKFold, RepeatedKFold, cross_val_predict)

from sklearn.linear_model import LinearRegression, Ridge

from sklearn.datasets import make_regression

from sklearn.metrics import mean_squared_error

import matplotlib.pyplot as plt



np.random.seed(42)



# Generate data

X, y = make_regression(n_samples=200, n_features=5, noise=10, random_state=42)



# 10-Fold CV

kf = KFold(n_splits=10, shuffle=True, random_state=42)

lr = LinearRegression()

scores_10fold = cross_val_score(lr, X, y, cv=kf, scoring='neg_mean_squared_error')

print(f"10-Fold CV MSE: {-scores_10fold.mean():.2f} (+/- {scores_10fold.std():.2f})")



# LOO-CV

loo = LeaveOneOut()

scores_loo = cross_val_score(lr, X, y, cv=loo, scoring='neg_mean_squared_error')

print(f"LOO-CV MSE: {-scores_loo.mean():.2f}")



# Repeated 5-Fold CV

rkf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)

scores_repeated = cross_val_score(lr, X, y, cv=rkf, scoring='neg_mean_squared_error')

print(f"Repeated 5-Fold MSE: {-scores_repeated.mean():.2f} (+/- {scores_repeated.std():.2f})")



# Nested CV for model selection

alphas = [0.1, 1.0, 10.0, 100.0]

outer_scores = []

for train_idx, test_idx in KFold(n_splits=5, shuffle=True, random_state=42).split(X):

    X_train, X_test = X[train_idx], X[test_idx]

    y_train, y_test = y[train_idx], y[test_idx]

    

    # Inner loop: select alpha

    best_alpha = None

    best_score = -np.inf

    for alpha in alphas:

        ridge = Ridge(alpha=alpha)

        inner_scores = cross_val_score(ridge, X_train, y_train, cv=3, scoring='neg_mean_squared_error')

        if inner_scores.mean() > best_score:

            best_score = inner_scores.mean()

            best_alpha = alpha

    

    # Outer loop: evaluate

    ridge_best = Ridge(alpha=best_alpha)

    ridge_best.fit(X_train, y_train)

    outer_scores.append(mean_squared_error(y_test, ridge_best.predict(X_test)))

    

print(f"\nNested CV MSE: {np.mean(outer_scores):.2f} (+/- {np.std(outer_scores):.2f})")

print(f"Selected alpha: {best_alpha}")

Worked Example

Example: Comparing Regression Models

Evaluating Linear, Ridge, and Lasso regression using 10-fold CV:

| Model | CV MSE | Std |

|-------|--------|-----|

| Linear Regression | 102.5 | 15.3 |

| Ridge (a=1) | 98.2 | 14.1 |

| Lasso (a=0.1) | 96.8 | 13.7 |

Ridge and Lasso outperform plain linear regression. Lasso achieves the lowest MSE with the added benefit of feature selection (some coefficients are exactly zero).


Key Takeaways

Summary: Cross-Validation

  • CV provides an unbiased estimate of generalization error

  • 10-fold CV is the standard choice; use LOO for small datasets

  • Stratified CV ensures balanced class proportions in each fold

  • Nested CV is needed for simultaneous model selection and evaluation

  • Repeated CV reduces variance from a single random partition

  • Always shuffle the data before partitioning (except for time series)

  • For time series, use forward chaining (rolling window) CV


Related Topics

⭐

Premium Content

Cross-Validation in Statistics

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement