🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Ridge Regression (L2 Regularization) — Complete Guide

Regression AnalysisRegularization🟢 Free Lesson

Advertisement

Ridge Regression (L2 Regularization)

Regression Analysis

Shrinking Coefficients to Prevent Overfitting

Ridge regression adds an L2 penalty to the OLS objective, shrinking coefficients toward zero. This reduces variance at the cost of small bias, improving generalization when multicollinearity exists or predictors outnumber observations.

  • Genomics — Handle thousands of gene predictors with limited samples

  • Finance — Stabilize portfolio weight estimates with correlated assets

  • Text Mining — Regularize high-dimensional term frequency features

The tuning parameter lambda controls the bias-variance tradeoff along a continuous path.


Ridge regression adds an L2 penalty to the OLS objective to shrink coefficients, reducing variance at the cost of a small bias:

Ridge Regression Objective

Minimize: yXβ2+λβ2\text{Minimize: } \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda\|\boldsymbol{\beta}\|^2

Here,

  • yXβ2\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2=Sum of squared residuals
  • λβ2\lambda\|\boldsymbol{\beta}\|^2=L2 penalty term
  • λ\lambda=Regularization parameter (tuning parameter)

Ridge Estimator

β^ridge=(XTX+λI)1XTy\hat{\boldsymbol{\beta}}_{\text{ridge}} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}

Here,

  • β^ridge\hat{\boldsymbol{\beta}}_{\text{ridge}}=Ridge estimator
  • λI\lambda\mathbf{I}=Regularization term added to diagonal

import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import Ridge, RidgeCV, LinearRegression

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import cross_val_score, train_test_split

from sklearn.pipeline import Pipeline



np.random.seed(42)

n, p = 100, 20  # 20 features, many correlated

X = np.random.randn(n, p)

# Introduce correlations

X[:, 1] = X[:, 0] + np.random.randn(n)*0.3

X[:, 2] = X[:, 0] + np.random.randn(n)*0.3

# True model: only first 5 features matter

true_beta = np.array([2,-1.5,1,0.5,-0.8] + [0]*15)

y = X @ true_beta + np.random.randn(n)*2



X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



# Ridge path

lambdas = np.logspace(-3, 4, 100)

coef_path = []

for lam in lambdas:

    ridge = Pipeline([('scaler', StandardScaler()),

                      ('ridge', Ridge(alpha=lam))])

    ridge.fit(X_train, y_train)

    coef_path.append(ridge.named_steps['ridge'].coef_)



coef_path = np.array(coef_path)



fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for j in range(p):

    axes[0].plot(np.log10(lambdas), coef_path[:, j],

                 alpha=0.7, linewidth=1.5,

                 color='red' if j < 5 else 'lightblue')

axes[0].set_xlabel('log10(?)')

axes[0].set_ylabel('Coefficient Value')

axes[0].set_title('Ridge Coefficient Path\n(Red = true predictors)')

axes[0].axvline(0, color='black', linestyle='--', alpha=0.5)



# Cross-validation to select ?

ridge_cv = Pipeline([('scaler', StandardScaler()),

                     ('ridge', RidgeCV(alphas=lambdas, cv=5, scoring='neg_mean_squared_error'))])

ridge_cv.fit(X_train, y_train)

best_lambda = ridge_cv.named_steps['ridge'].alpha_



cv_scores = []

for lam in lambdas:

    model = Pipeline([('scaler', StandardScaler()), ('ridge', Ridge(alpha=lam))])

    score = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mse').mean()

    cv_scores.append(-score)



best_idx = np.argmin(cv_scores)

axes[1].plot(np.log10(lambdas), cv_scores, 'b-', linewidth=2)

axes[1].axvline(np.log10(best_lambda), color='red', linestyle='--',

               label=f'Best ?={best_lambda:.3f}')

axes[1].set_xlabel('log10(?)')

axes[1].set_ylabel('CV MSE')

axes[1].set_title('Ridge CV — Selecting Optimal ?')

axes[1].legend()



plt.tight_layout()

plt.savefig('ridge_regression.png', dpi=150)

plt.show()



# Compare OLS vs Ridge

ols = Pipeline([('scaler', StandardScaler()), ('ols', LinearRegression())])

best_ridge = Pipeline([('scaler', StandardScaler()), ('ridge', Ridge(alpha=best_lambda))])



ols.fit(X_train, y_train)

best_ridge.fit(X_train, y_train)



print(f"Best ?: {best_lambda:.4f}")

print(f"OLS   — Train MSE: {np.mean((y_train - ols.predict(X_train))**2):.3f}, Test MSE: {np.mean((y_test - ols.predict(X_test))**2):.3f}")

print(f"Ridge — Train MSE: {np.mean((y_train - best_ridge.predict(X_train))**2):.3f}, Test MSE: {np.mean((y_test - best_ridge.predict(X_test))**2):.3f}")

? = 0 vs ? -> 8

When ? = 0, Ridge reduces to OLS. As ? -> 8, all coefficients shrink toward zero.


Key Takeaways

Summary: Ridge Regression

  • Ridge adds L2 penalty ?Sß?² — shrinks all coefficients toward zero but rarely to exactly zero

  • ? = 0: OLS; ? -> 8: all coefficients -> 0

  • Solves multicollinearity: (X?X + ?I) is always invertible

  • Select ? via cross-validation — RidgeCV does this efficiently

  • Standardize features before ridge — penalty treats all features equally

  • Ridge for multicollinearity; Lasso for feature selection

Premium Content

Ridge Regression (L2 Regularization) — Complete Guide

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement