Model Selection

Why It Matters

Choosing the right model complexity prevents overfitting and improves generalization. Too simple -> underfitting (high bias). Too complex -> overfitting (high variance). Model selection methods — cross-validation, AIC, BIC — find the sweet spot that balances fit and complexity for reliable prediction on unseen data.

Overview

The bias-variance tradeoff decomposes prediction error into bias² (error from incorrect assumptions), variance (error from sensitivity to training data), and irreducible noise. Cross-validation estimates out-of-sample performance by training on $k-1$ folds and testing on the held-out fold, repeating for each fold. AIC ( $-2\ell + 2k$ ) minimizes prediction error and favors larger models. BIC ( $-2\ell + k\log n$ ) penalizes complexity more heavily and favors simpler models. Regularization (Ridge/Lasso) implicitly selects complexity by shrinking coefficients — Lasso drives some to zero for automatic feature selection. The goal is always to minimize expected prediction error on new data.

Key Concepts

Bias-Variance Decomposition

\text{Expected Error} = \text{Bias}^2 + \text{Variance} + \text{Noise}

Here,

$\text{Bias}^2$ =Error from incorrect assumptions (underfitting)
$\text{Variance}$ =Error from sensitivity to training data (overfitting)
$\text{Noise}$ =Irreducible error from random variation

AIC (Akaike Information Criterion)

AIC = -2\ell + 2k

Here,

$\ell$ =Maximized log-likelihood
$k$ =Number of parameters

BIC (Bayesian Information Criterion)

BIC = -2\ell + k\log n

Here,

$n$ =Sample size
$k$ =Number of parameters

K-Fold Cross-Validation

CV_{(K)} = \frac{1}{K}\sum_{i=1}^{K} \text{MSE}_i

Here,

$K$ =Number of folds
$\text{MSE}_i$ =Mean squared error on fold i

Ridge Regression (L2)

\hat{\beta}_{Ridge} = \arg\min_\beta \left[\sum(y_i - X_i\beta)^2 + \alpha\sum\beta_j^2\right]

Here,

$\alpha$ =Regularization strength

Lasso Regression (L1)

\hat{\beta}_{Lasso} = \arg\min_\beta \left[\sum(y_i - X_i\beta)^2 + \alpha\sum|\beta_j|\right]

Here,

$\alpha$ =Regularization strength

AIC vs BIC

Criterion	Penalty	Favors	Best For	Consistency
AIC	$2k$	Larger models	Prediction accuracy	No
BIC	$k\log n$	Simpler models	Interpretability	Yes

Regularization Comparison

Method	Penalty	Effect	Feature Selection?
Ridge (L2)	$\alpha\sum\beta_j^2$	Shrinks all coefficients	No
Lasso (L1)	$\alpha\sum\|\beta_j\|$	Drives some to zero	Yes
Elastic Net	L1 + L2	Combines both	Partially

Quick Example

Choosing Between Models

Model A: AIC = 100, BIC = 110. Model B: AIC = 105, BIC = 105.

If prediction is the goal: prefer Model A (lower AIC).
If interpretability or sparse true model: prefer Model B (lower BIC).
In large samples, BIC is consistent (selects the true model if it's in the candidate set). AIC minimizes KL divergence (best predictive model).

Cross-Validation for Regularization

Ridge regression with $\alpha = 0.01$ : CV MSE = 45.2. $\alpha = 1$ : CV MSE = 38.7. $\alpha = 100$ : CV MSE = 52.1.

Best $\alpha = 1$ — it balances bias and variance. Too small $\alpha$ overfits; too large $\alpha$ underfits.

Key Takeaways

Summary: Model Selection

Bias-Variance Tradeoff: Increasing complexity reduces bias but increases variance. The optimal model minimizes their sum.
Cross-Validation: K-fold CV estimates generalization error. Use it to compare models and tune hyperparameters.
AIC vs BIC: AIC minimizes prediction error (favors larger models). BIC penalizes complexity more (favors simpler models).
Regularization: Ridge ( $\ell_2$ ) shrinks all coefficients. Lasso ( $\ell_1$ ) drives some to zero, performing feature selection.
Underfitting vs Overfitting: High bias = too simple; high variance = too complex. Monitor training vs. validation error curves.
Workflow: Split data into train/validation/test -> use CV to select hyperparameters -> evaluate once on held-out test set.
Adjusted R²: Penalizes for adding predictors. Use it to compare models with different numbers of features.

Deep Dive

For detailed explanations, worked examples, and Python implementations, explore the dedicated statistics lessons:

Cross-Validation

Cross-Validation — K-fold, stratified, leave-one-out, and nested cross-validation

Information Criteria

AIC and BIC — Derivation, interpretation, model averaging, and when each is appropriate

ROC and AUC

ROC and AUC — Threshold-independent evaluation, ROC curves, AUC interpretation, and trade-offs

Model Selection

Model Selection

Overview

Key Concepts

Bias-Variance Decomposition

AIC (Akaike Information Criterion)

BIC (Bayesian Information Criterion)

K-Fold Cross-Validation

Ridge Regression (L2)

Lasso Regression (L1)

AIC vs BIC

Regularization Comparison

Quick Example

Choosing Between Models

Cross-Validation for Regularization

Key Takeaways

Summary: Model Selection

Deep Dive

Cross-Validation

Information Criteria

ROC and AUC

Related Topics

Premium Content

Need Expert Mathematics Help?