ML Foundations
Preventing Overfitting — Ridge, Lasso, and Elastic Net
Regularization constrains model complexity by adding penalty terms to the loss function, helping models generalize better to unseen data.
- Ridge (L2) — shrinks weights toward zero to prevent overfitting when all features are potentially useful
- Lasso (L1) — zeros out irrelevant features, performing automatic feature selection
- Elastic Net — combines both penalties for the best of Ridge and Lasso
"Simplicity is the ultimate sophistication." — Leonardo da Vinci
Regularization — Complete Guide
Regularization prevents overfitting by adding a penalty term to the loss function, constraining model complexity.
The Problem
DfOverfitting
Overfitting occurs when a model learns the training data too well, including noise and random fluctuations, resulting in poor performance on new, unseen data.
L1 vs L2 Constraint Regions
Without regularization:
Model fits training data perfectly
Complex models with large weights
High variance (overfitting)
Poor generalization to new data
With regularization:
Model balances fit and simplicity
Smaller weights
Lower variance (less overfitting)
Better generalization
Ridge Regression (L2)
DfRidge Regression (L2 Regularization)
Adds the squared magnitude of weights as penalty to the loss function. Shrinks weights toward zero but never exactly to zero.
Ridge Loss
Here,
- =Ridge loss
- =Mean Squared Error
- =Regularization strength
- =Model weights
Coefficient Shrinkage Diagram
When to Use Ridge
Use Ridge when you have many features that are all potentially useful. It prevents overfitting by shrinking all weights toward zero.
Lasso Regression (L1)
DfLasso Regression (L1 Regularization)
Adds the absolute magnitude of weights as penalty. Can shrink weights to exactly zero, performing automatic feature selection.
Lasso Loss
Here,
- =Lasso loss
- =Mean Squared Error
- =Regularization strength
- =Model weights
Feature Selection
Lasso can shrink some weights to exactly zero, effectively selecting a subset of features. This makes it useful for feature selection.
Elastic Net
DfElastic Net
Combines both L1 (Lasso) and L2 (Ridge) penalties. Provides a balance between feature selection and weight shrinkage.
Elastic Net Loss
Here,
- =Elastic Net loss
- =Mean Squared Error
- =L1 regularization strength
- =L2 regularization strength
- =Model weights
Regularization Path
When to Use Elastic Net
Use Elastic Net when you have correlated features and need feature selection. It combines the benefits of both Ridge and Lasso.
Python Implementation
Python Implementation
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.model_selection import cross_val_score
import numpy as np
# Ridge
ridge = Ridge(alpha=1.0)
scores = cross_val_score(ridge, X, y, cv=5, scoring='r2')
# Lasso
lasso = Lasso(alpha=0.1)
scores = cross_val_score(lasso, X, y, cv=5, scoring='r2')
print(f"Lasso selected {np.sum(lasso.coef_ != 0)} features")
# Elastic Net
enet = ElasticNet(alpha=0.1, l1_ratio=0.5)
scores = cross_val_score(enet, X, y, cv=5, scoring='r2')
Choosing Alpha
Choosing Alpha
: No regularization (original model)
: All weights = 0 (trivial model)
Use cross-validation to find optimal
:
α = 0: No regularization (original model)
α = ≡: All weights = 0 (trivial model)
Use cross-validation to find optimal α:
alphas = [0.001, 0.01, 0.1, 1, 10, 100]
for a in alphas:
model = Ridge(alpha=a)
score = cross_val_score(model, X, y, cv=5).mean()
print(f"α={a}: {score:.3f}")
Key Takeaways
Summary: Regularization
- Regularization prevents overfitting by penalizing complexity
- Ridge (L2) shrinks weights — good when all features matter
- Lasso (L1) performs feature selection — zeros out irrelevant features
- Elastic Net combines both — good default choice
- Cross-validation is essential for choosing alpha
- Scale features before regularization (penalty is scale-dependent)
- Regularization is crucial for high-dimensional data
- Tree-based models don't need regularization
What to Learn Next
-> Linear Regression Understand the foundational model where Ridge and Lasso regularization are applied.
-> Logistic Regression Extend regularization to classification problems with penalized logistic models.
-> Model Evaluation Learn cross-validation techniques for selecting the optimal regularization strength.
-> Model Selection Compare algorithms and tune hyperparameters including regularization parameters.
-> Training Deep Networks Apply dropout, weight decay, and batch normalization as regularization in deep learning.
-> SVM Explore maximum margin classifiers that implicitly use L2 regularization.