Cross-Validation and Bias-Variance Tradeoff

Why Cross-Validation?

The Holdout Method Limitation

The naive holdout approach splits data into training and test sets once. This has critical flaws:

Key limitations:

Performance estimate has high variance (depends on single split)
Wastes data (test set never used for training)
Can't assess model stability
Risk of optimistic/pessimistic bias

K-Fold Cross-Validation

The gold standard for model evaluation. Split data into $k$ folds, train on $k-1$ , test on 1, rotate.

Mathematical Formulation

For dataset $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^{n}$ partitioned into $k$ folds $\{F_1, F_2, \ldots, F_k\}$ :

\text{CV}_{(k)} = \frac{1}{k} \sum_{i=1}^{k} \mathcal{L}\left(\hat{f}^{-F_i}, F_i\right)

where $\hat{f}^{-F_i}$ is the model trained on all data except fold $F_i$ , and $\mathcal{L}$ is the loss function.

Choice of k

k	Pros	Cons
5	Good bias-variance tradeoff	Standard choice
10	Lower bias estimate	Higher computational cost
$n$ (LOO)	Nearly unbiased	High variance, expensive

Stratified K-Fold

Ensures each fold maintains the same class distribution as the full dataset critical for imbalanced problems.

Leave-One-Out (LOO) Cross-Validation

A special case of K-Fold where $k = n$ (number of samples):

\text{LOO-CV} = \frac{1}{n} \sum_{i=1}^{n} \mathcal{L}\left(\hat{f}^{-x_i}, (x_i, y_i)\right)

Characteristics:

Nearly unbiased estimate of generalization error
High variance (each training set differs by only 1 sample)
Computationally expensive: $O(n)$ model fits
Approximately equivalent to AIC for linear models

Time Series Cross-Validation

Standard K-Fold violates temporal ordering. Use expanding or sliding windows instead.

Bias-Variance Tradeoff

Mathematical Decomposition

For model $\hat{f}$ trained on dataset $\mathcal{D}$ , the expected prediction error at point $x$ decomposes as:

\mathbb{E}\left[\left(y - \hat{f}(x)\right)^2\right] = \underbrace{\text{Bias}^2\left(\hat{f}(x)\right)}_{\text{Systematic error}} + \underbrace{\text{Var}\left(\hat{f}(x)\right)}_{\text{Sensitivity to data}} + \underbrace{\sigma^2_{\epsilon}}_{\text{Irreducible noise}}

where:

\text{Bias}\left(\hat{f}(x)\right) = \mathbb{E}\left[\hat{f}(x)\right] - f(x)

\text{Var}\left(\hat{f}(x)\right) = \mathbb{E}\left[\left(\hat{f}(x) - \mathbb{E}[\hat{f}(x)]\right)^2\right]

Intuition

Underfitting vs Overfitting Diagnosis

Diagnostic Summary

Symptom	Diagnosis	Remedy
Train acc ∫ Val acc	Overfitting	Regularization, more data, simpler model
Train acc ∇ Val acc (both low)	Underfitting	More features, complex model
High CV variance	Unstable model	More data, simpler model, ensemble

Model Selection with Cross-Validation

Use nested cross-validation to avoid optimistic bias when selecting hyperparameters:

Architecture Diagram

Outer Loop: Evaluate generalization
┋─────────────────────────────────────────────┐
│  Split into Train/Test                      │
│                                             │
│  Inner Loop: Hyperparameter Tuning          │
│  ┋───────────────────────────────────────┐  │
│  │  Split Train into Train/Val           │  │
│  │  Try all hyperparameter combinations  │  │
│  │  Select best by validation score      │  │
│  └───────────────────────────────────────┘  │
│                                             │
│  Train final model with best params on Train│
│  Evaluate on outer Test                     │
└─────────────────────────────────────────────┘

Implementation in Python

import numpy as np
from sklearn.model_selection import (
    KFold, StratifiedKFold, TimeSeriesSplit,
    cross_val_score, GridSearchCV, learning_curve
)
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20,
                           n_informative=10, n_classes=2,
                           random_state=42)

# --- Basic K-Fold ---
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(Ridge(alpha=1.0), X, y, cv=kfold, scoring='accuracy')
print(f"K-Fold CV: {scores.mean():.4f} ± {scores.std():.4f}")

# --- Stratified K-Fold ---
skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(RandomForestClassifier(n_estimators=100),
                         X, y, cv=skfold, scoring='accuracy')
print(f"Stratified CV: {scores.mean():.4f} ± {scores.std():.4f}")

# --- Nested CV for model selection ---
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [5, 10, 20, None]
}

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

grid_search = GridSearchCV(pipe, param_grid, cv=inner_cv, scoring='accuracy')
nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv, scoring='accuracy')
print(f"Nested CV: {nested_scores.mean():.4f} ± {nested_scores.std():.4f}")

# --- Time Series CV ---
tscv = TimeSeriesSplit(n_splits=5)
for i, (train_idx, test_idx) in enumerate(tscv.split(X)):
    print(f"Split {i+1}: Train={len(train_idx)}, Test={len(test_idx)}")

# --- Learning Curves ---
train_sizes, train_scores, val_scores = learning_curve(
    RandomForestClassifier(n_estimators=100), X, y,
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5, scoring='accuracy', n_jobs=-1
)

print(f"\nLearning Curve (sample sizes):")
for size, train, val in zip(train_sizes, train_scores.mean(axis=1), val_scores.mean(axis=1)):
    print(f"  n={size:4d}: train={train:.3f}, val={val:.3f}, gap={train-val:.3f}")

Key Takeaways

Always use cross-validation holdout estimates are unreliable
Stratified K-Fold is essential for classification (especially imbalanced)
Time series require temporal ordering never shuffle
Bias-variance tradeoff is fundamental: optimize the total error, not just bias
Learning curves reveal whether you need more data, more features, or regularization
Nested CV avoids optimistic bias in model selection
Variance of CV scores matters high variance signals instability