ML Foundations

Preventing Overfitting — Ridge, Lasso, and Elastic Net

Regularization constrains model complexity by adding penalty terms to the loss function, helping models generalize better to unseen data.

Ridge (L2) — shrinks weights toward zero to prevent overfitting when all features are potentially useful
Lasso (L1) — zeros out irrelevant features, performing automatic feature selection
Elastic Net — combines both penalties for the best of Ridge and Lasso

"Simplicity is the ultimate sophistication." — Leonardo da Vinci

Regularization — Complete Guide

Regularization prevents overfitting by adding a penalty term to the loss function, constraining model complexity.

The Problem

DfOverfitting

Overfitting occurs when a model learns the training data too well, including noise and random fluctuations, resulting in poor performance on new, unseen data.

L1 vs L2 Constraint Regions

Architecture Diagram

Without regularization:
  Model fits training data perfectly
  Complex models with large weights
  High variance (overfitting)
  Poor generalization to new data

With regularization:
  Model balances fit and simplicity
  Smaller weights
  Lower variance (less overfitting)
  Better generalization

Ridge Regression (L2)

DfRidge Regression (L2 Regularization)

Adds the squared magnitude of weights as penalty to the loss function. Shrinks weights toward zero but never exactly to zero.

Ridge Loss

L_{\text{Ridge}} = MSE + \alpha \sum_{i=1}^{n} w_i^2

Here,

$L_{\text{Ridge}}$ =Ridge loss
$MSE$ =Mean Squared Error
$\alpha$ =Regularization strength
$w_i$ =Model weights

Coefficient Shrinkage Diagram

When to Use Ridge

Use Ridge when you have many features that are all potentially useful. It prevents overfitting by shrinking all weights toward zero.

Lasso Regression (L1)

DfLasso Regression (L1 Regularization)

Adds the absolute magnitude of weights as penalty. Can shrink weights to exactly zero, performing automatic feature selection.

Lasso Loss

L_{\text{Lasso}} = MSE + \alpha \sum_{i=1}^{n} |w_i|

Here,

$L_{\text{Lasso}}$ =Lasso loss
$MSE$ =Mean Squared Error
$\alpha$ =Regularization strength
$w_i$ =Model weights

Feature Selection

Lasso can shrink some weights to exactly zero, effectively selecting a subset of features. This makes it useful for feature selection.

Elastic Net

DfElastic Net

Combines both L1 (Lasso) and L2 (Ridge) penalties. Provides a balance between feature selection and weight shrinkage.

Elastic Net Loss

L_{\text{Elastic Net}} = MSE + \alpha_1 \sum_{i=1}^{n} |w_i| + \alpha_2 \sum_{i=1}^{n} w_i^2

Here,

$L_{\text{Elastic Net}}$ =Elastic Net loss
$MSE$ =Mean Squared Error
$\alpha_1$ =L1 regularization strength
$\alpha_2$ =L2 regularization strength
$w_i$ =Model weights

Regularization Path

When to Use Elastic Net

Use Elastic Net when you have correlated features and need feature selection. It combines the benefits of both Ridge and Lasso.

Python Implementation

from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.model_selection import cross_val_score
import numpy as np

# Ridge
ridge = Ridge(alpha=1.0)
scores = cross_val_score(ridge, X, y, cv=5, scoring='r2')

# Lasso
lasso = Lasso(alpha=0.1)
scores = cross_val_score(lasso, X, y, cv=5, scoring='r2')
print(f"Lasso selected {np.sum(lasso.coef_ != 0)} features")

# Elastic Net
enet = ElasticNet(alpha=0.1, l1_ratio=0.5)
scores = cross_val_score(enet, X, y, cv=5, scoring='r2')

Choosing Alpha

\alpha = 0

: No regularization (original model)

\alpha = \infty

: All weights = 0 (trivial model)

Use cross-validation to find optimal

\alpha

Architecture Diagram

α = 0: No regularization (original model)
α = ≡: All weights = 0 (trivial model)

Use cross-validation to find optimal α:
alphas = [0.001, 0.01, 0.1, 1, 10, 100]
for a in alphas:
    model = Ridge(alpha=a)
    score = cross_val_score(model, X, y, cv=5).mean()
    print(f"α={a}: {score:.3f}")

Key Takeaways

Summary: Regularization

Regularization prevents overfitting by penalizing complexity
Ridge (L2) shrinks weights — good when all features matter
Lasso (L1) performs feature selection — zeros out irrelevant features
Elastic Net combines both — good default choice
Cross-validation is essential for choosing alpha
Scale features before regularization (penalty is scale-dependent)
Regularization is crucial for high-dimensional data
Tree-based models don't need regularization

What to Learn Next

-> Linear Regression Understand the foundational model where Ridge and Lasso regularization are applied.

-> Logistic Regression Extend regularization to classification problems with penalized logistic models.

-> Model Evaluation Learn cross-validation techniques for selecting the optimal regularization strength.

-> Model Selection Compare algorithms and tune hyperparameters including regularization parameters.

-> Training Deep Networks Apply dropout, weight decay, and batch normalization as regularization in deep learning.

-> SVM Explore maximum margin classifiers that implicitly use L2 regularization.

Regularization — Ridge, Lasso and Elastic Net Complete Guide

Preventing Overfitting — Ridge, Lasso, and Elastic Net

Regularization — Complete Guide

The Problem

DfOverfitting

L1 vs L2 Constraint Regions

Ridge Regression (L2)

DfRidge Regression (L2 Regularization)

Ridge Loss

Coefficient Shrinkage Diagram

Lasso Regression (L1)

DfLasso Regression (L1 Regularization)

Lasso Loss

Elastic Net

DfElastic Net

Elastic Net Loss

Regularization Path

Python Implementation

Python Implementation

Choosing Alpha

Key Takeaways

Summary: Regularization

What to Learn Next

Premium Content

Need Expert Machine Learning Help?