Ridge Regression (L2 Regularization)

Regression Analysis

Shrinking Coefficients to Prevent Overfitting

Ridge regression adds an L2 penalty to the OLS objective, shrinking coefficients toward zero. This reduces variance at the cost of small bias, improving generalization when multicollinearity exists or predictors outnumber observations.

Genomics — Handle thousands of gene predictors with limited samples
Finance — Stabilize portfolio weight estimates with correlated assets
Text Mining — Regularize high-dimensional term frequency features

The tuning parameter lambda controls the bias-variance tradeoff along a continuous path.

Ridge regression adds an L2 penalty to the OLS objective to shrink coefficients, reducing variance at the cost of a small bias:

Ridge Regression Objective

\text{Minimize: } \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda\|\boldsymbol{\beta}\|^2

Here,

$\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2$ =Sum of squared residuals
$\lambda\|\boldsymbol{\beta}\|^2$ =L2 penalty term
$\lambda$ =Regularization parameter (tuning parameter)

Ridge Estimator

\hat{\boldsymbol{\beta}}_{\text{ridge}} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}

Here,

$\hat{\boldsymbol{\beta}}_{\text{ridge}}$ =Ridge estimator
$\lambda\mathbf{I}$ =Regularization term added to diagonal


import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import Ridge, RidgeCV, LinearRegression

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import cross_val_score, train_test_split

from sklearn.pipeline import Pipeline



np.random.seed(42)

n, p = 100, 20  # 20 features, many correlated

X = np.random.randn(n, p)

# Introduce correlations

X[:, 1] = X[:, 0] + np.random.randn(n)*0.3

X[:, 2] = X[:, 0] + np.random.randn(n)*0.3

# True model: only first 5 features matter

true_beta = np.array([2,-1.5,1,0.5,-0.8] + [0]*15)

y = X @ true_beta + np.random.randn(n)*2



X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



# Ridge path

lambdas = np.logspace(-3, 4, 100)

coef_path = []

for lam in lambdas:

    ridge = Pipeline([('scaler', StandardScaler()),

                      ('ridge', Ridge(alpha=lam))])

    ridge.fit(X_train, y_train)

    coef_path.append(ridge.named_steps['ridge'].coef_)



coef_path = np.array(coef_path)



fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for j in range(p):

    axes[0].plot(np.log10(lambdas), coef_path[:, j],

                 alpha=0.7, linewidth=1.5,

                 color='red' if j < 5 else 'lightblue')

axes[0].set_xlabel('log10(?)')

axes[0].set_ylabel('Coefficient Value')

axes[0].set_title('Ridge Coefficient Path\n(Red = true predictors)')

axes[0].axvline(0, color='black', linestyle='--', alpha=0.5)



# Cross-validation to select ?

ridge_cv = Pipeline([('scaler', StandardScaler()),

                     ('ridge', RidgeCV(alphas=lambdas, cv=5, scoring='neg_mean_squared_error'))])

ridge_cv.fit(X_train, y_train)

best_lambda = ridge_cv.named_steps['ridge'].alpha_



cv_scores = []

for lam in lambdas:

    model = Pipeline([('scaler', StandardScaler()), ('ridge', Ridge(alpha=lam))])

    score = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mse').mean()

    cv_scores.append(-score)



best_idx = np.argmin(cv_scores)

axes[1].plot(np.log10(lambdas), cv_scores, 'b-', linewidth=2)

axes[1].axvline(np.log10(best_lambda), color='red', linestyle='--',

               label=f'Best ?={best_lambda:.3f}')

axes[1].set_xlabel('log10(?)')

axes[1].set_ylabel('CV MSE')

axes[1].set_title('Ridge CV — Selecting Optimal ?')

axes[1].legend()



plt.tight_layout()

plt.savefig('ridge_regression.png', dpi=150)

plt.show()



# Compare OLS vs Ridge

ols = Pipeline([('scaler', StandardScaler()), ('ols', LinearRegression())])

best_ridge = Pipeline([('scaler', StandardScaler()), ('ridge', Ridge(alpha=best_lambda))])



ols.fit(X_train, y_train)

best_ridge.fit(X_train, y_train)



print(f"Best ?: {best_lambda:.4f}")

print(f"OLS   — Train MSE: {np.mean((y_train - ols.predict(X_train))**2):.3f}, Test MSE: {np.mean((y_test - ols.predict(X_test))**2):.3f}")

print(f"Ridge — Train MSE: {np.mean((y_train - best_ridge.predict(X_train))**2):.3f}, Test MSE: {np.mean((y_test - best_ridge.predict(X_test))**2):.3f}")

? = 0 vs ? -> 8

When ? = 0, Ridge reduces to OLS. As ? -> 8, all coefficients shrink toward zero.

Key Takeaways

Summary: Ridge Regression

Ridge adds L2 penalty ?Sß?² — shrinks all coefficients toward zero but rarely to exactly zero
? = 0: OLS; ? -> 8: all coefficients -> 0
Solves multicollinearity: (X?X + ?I) is always invertible
Select ? via cross-validation — RidgeCV does this efficiently
Standardize features before ridge — penalty treats all features equally
Ridge for multicollinearity; Lasso for feature selection

Ridge Regression (L2 Regularization) — Complete Guide