πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Model Selection

StatisticsModel Evaluation🟒 Free Lesson

Advertisement

Model Selection

Why It Matters

Choosing the right model complexity prevents overfitting and improves generalization. Too simple -> underfitting (high bias). Too complex -> overfitting (high variance). Model selection methods β€” cross-validation, AIC, BIC β€” find the sweet spot that balances fit and complexity for reliable prediction on unseen data.


Overview

The bias-variance tradeoff decomposes prediction error into biasΒ² (error from incorrect assumptions), variance (error from sensitivity to training data), and irreducible noise. Cross-validation estimates out-of-sample performance by training on kβˆ’1k-1 folds and testing on the held-out fold, repeating for each fold. AIC (βˆ’2β„“+2k-2\ell + 2k) minimizes prediction error and favors larger models. BIC (βˆ’2β„“+klog⁑n-2\ell + k\log n) penalizes complexity more heavily and favors simpler models. Regularization (Ridge/Lasso) implicitly selects complexity by shrinking coefficients β€” Lasso drives some to zero for automatic feature selection. The goal is always to minimize expected prediction error on new data.


Key Concepts

Bias-Variance Decomposition

ExpectedΒ Error=Bias2+Variance+Noise\text{Expected Error} = \text{Bias}^2 + \text{Variance} + \text{Noise}

Here,

  • Bias2\text{Bias}^2=Error from incorrect assumptions (underfitting)
  • Variance\text{Variance}=Error from sensitivity to training data (overfitting)
  • Noise\text{Noise}=Irreducible error from random variation

AIC (Akaike Information Criterion)

AIC=βˆ’2β„“+2kAIC = -2\ell + 2k

Here,

  • β„“\ell=Maximized log-likelihood
  • kk=Number of parameters

BIC (Bayesian Information Criterion)

BIC=βˆ’2β„“+klog⁑nBIC = -2\ell + k\log n

Here,

  • nn=Sample size
  • kk=Number of parameters

K-Fold Cross-Validation

CV(K)=1Kβˆ‘i=1KMSEiCV_{(K)} = \frac{1}{K}\sum_{i=1}^{K} \text{MSE}_i

Here,

  • KK=Number of folds
  • MSEi\text{MSE}_i=Mean squared error on fold i

Ridge Regression (L2)

Ξ²^Ridge=arg⁑min⁑β[βˆ‘(yiβˆ’XiΞ²)2+Ξ±βˆ‘Ξ²j2]\hat{\beta}_{Ridge} = \arg\min_\beta \left[\sum(y_i - X_i\beta)^2 + \alpha\sum\beta_j^2\right]

Here,

  • Ξ±\alpha=Regularization strength

Lasso Regression (L1)

Ξ²^Lasso=arg⁑min⁑β[βˆ‘(yiβˆ’XiΞ²)2+Ξ±βˆ‘βˆ£Ξ²j∣]\hat{\beta}_{Lasso} = \arg\min_\beta \left[\sum(y_i - X_i\beta)^2 + \alpha\sum|\beta_j|\right]

Here,

  • Ξ±\alpha=Regularization strength

AIC vs BIC

CriterionPenaltyFavorsBest ForConsistency
AIC2k2kLarger modelsPrediction accuracyNo
BICklog⁑nk\log nSimpler modelsInterpretabilityYes

Regularization Comparison

MethodPenaltyEffectFeature Selection?
Ridge (L2)Ξ±βˆ‘Ξ²j2\alpha\sum\beta_j^2Shrinks all coefficientsNo
Lasso (L1)Ξ±βˆ‘βˆ£Ξ²j∣\alpha\sum|\beta_j|Drives some to zeroYes
Elastic NetL1 + L2Combines bothPartially

Quick Example

Choosing Between Models

Model A: AIC = 100, BIC = 110. Model B: AIC = 105, BIC = 105.

  • If prediction is the goal: prefer Model A (lower AIC).
  • If interpretability or sparse true model: prefer Model B (lower BIC).
  • In large samples, BIC is consistent (selects the true model if it's in the candidate set). AIC minimizes KL divergence (best predictive model).

Cross-Validation for Regularization

Ridge regression with Ξ±=0.01\alpha = 0.01: CV MSE = 45.2. Ξ±=1\alpha = 1: CV MSE = 38.7. Ξ±=100\alpha = 100: CV MSE = 52.1.

Best Ξ±=1\alpha = 1 β€” it balances bias and variance. Too small Ξ±\alpha overfits; too large Ξ±\alpha underfits.


Key Takeaways

Summary: Model Selection

  • Bias-Variance Tradeoff: Increasing complexity reduces bias but increases variance. The optimal model minimizes their sum.
  • Cross-Validation: K-fold CV estimates generalization error. Use it to compare models and tune hyperparameters.
  • AIC vs BIC: AIC minimizes prediction error (favors larger models). BIC penalizes complexity more (favors simpler models).
  • Regularization: Ridge (β„“2\ell_2) shrinks all coefficients. Lasso (β„“1\ell_1) drives some to zero, performing feature selection.
  • Underfitting vs Overfitting: High bias = too simple; high variance = too complex. Monitor training vs. validation error curves.
  • Workflow: Split data into train/validation/test -> use CV to select hyperparameters -> evaluate once on held-out test set.
  • Adjusted RΒ²: Penalizes for adding predictors. Use it to compare models with different numbers of features.

Deep Dive

For detailed explanations, worked examples, and Python implementations, explore the dedicated statistics lessons:

Cross-Validation

  • Cross-Validation β€” K-fold, stratified, leave-one-out, and nested cross-validation

Information Criteria

  • AIC and BIC β€” Derivation, interpretation, model averaging, and when each is appropriate

ROC and AUC

  • ROC and AUC β€” Threshold-independent evaluation, ROC curves, AUC interpretation, and trade-offs

Related Topics

⭐

Premium Content

Model Selection

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Mathematics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement