Ridge Regression (L2 Regularization)
Regression Analysis
Shrinking Coefficients to Prevent Overfitting
Ridge regression adds an L2 penalty to the OLS objective, shrinking coefficients toward zero. This reduces variance at the cost of small bias, improving generalization when multicollinearity exists or predictors outnumber observations.
-
Genomics — Handle thousands of gene predictors with limited samples
-
Finance — Stabilize portfolio weight estimates with correlated assets
-
Text Mining — Regularize high-dimensional term frequency features
The tuning parameter lambda controls the bias-variance tradeoff along a continuous path.
Ridge regression adds an L2 penalty to the OLS objective to shrink coefficients, reducing variance at the cost of a small bias:
Ridge Regression Objective
Here,
- =Sum of squared residuals
- =L2 penalty term
- =Regularization parameter (tuning parameter)
Ridge Estimator
Here,
- =Ridge estimator
- =Regularization term added to diagonal
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, RidgeCV, LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
np.random.seed(42)
n, p = 100, 20 # 20 features, many correlated
X = np.random.randn(n, p)
# Introduce correlations
X[:, 1] = X[:, 0] + np.random.randn(n)*0.3
X[:, 2] = X[:, 0] + np.random.randn(n)*0.3
# True model: only first 5 features matter
true_beta = np.array([2,-1.5,1,0.5,-0.8] + [0]*15)
y = X @ true_beta + np.random.randn(n)*2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Ridge path
lambdas = np.logspace(-3, 4, 100)
coef_path = []
for lam in lambdas:
ridge = Pipeline([('scaler', StandardScaler()),
('ridge', Ridge(alpha=lam))])
ridge.fit(X_train, y_train)
coef_path.append(ridge.named_steps['ridge'].coef_)
coef_path = np.array(coef_path)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
for j in range(p):
axes[0].plot(np.log10(lambdas), coef_path[:, j],
alpha=0.7, linewidth=1.5,
color='red' if j < 5 else 'lightblue')
axes[0].set_xlabel('log10(?)')
axes[0].set_ylabel('Coefficient Value')
axes[0].set_title('Ridge Coefficient Path\n(Red = true predictors)')
axes[0].axvline(0, color='black', linestyle='--', alpha=0.5)
# Cross-validation to select ?
ridge_cv = Pipeline([('scaler', StandardScaler()),
('ridge', RidgeCV(alphas=lambdas, cv=5, scoring='neg_mean_squared_error'))])
ridge_cv.fit(X_train, y_train)
best_lambda = ridge_cv.named_steps['ridge'].alpha_
cv_scores = []
for lam in lambdas:
model = Pipeline([('scaler', StandardScaler()), ('ridge', Ridge(alpha=lam))])
score = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mse').mean()
cv_scores.append(-score)
best_idx = np.argmin(cv_scores)
axes[1].plot(np.log10(lambdas), cv_scores, 'b-', linewidth=2)
axes[1].axvline(np.log10(best_lambda), color='red', linestyle='--',
label=f'Best ?={best_lambda:.3f}')
axes[1].set_xlabel('log10(?)')
axes[1].set_ylabel('CV MSE')
axes[1].set_title('Ridge CV — Selecting Optimal ?')
axes[1].legend()
plt.tight_layout()
plt.savefig('ridge_regression.png', dpi=150)
plt.show()
# Compare OLS vs Ridge
ols = Pipeline([('scaler', StandardScaler()), ('ols', LinearRegression())])
best_ridge = Pipeline([('scaler', StandardScaler()), ('ridge', Ridge(alpha=best_lambda))])
ols.fit(X_train, y_train)
best_ridge.fit(X_train, y_train)
print(f"Best ?: {best_lambda:.4f}")
print(f"OLS — Train MSE: {np.mean((y_train - ols.predict(X_train))**2):.3f}, Test MSE: {np.mean((y_test - ols.predict(X_test))**2):.3f}")
print(f"Ridge — Train MSE: {np.mean((y_train - best_ridge.predict(X_train))**2):.3f}, Test MSE: {np.mean((y_test - best_ridge.predict(X_test))**2):.3f}")
? = 0 vs ? -> 8
When ? = 0, Ridge reduces to OLS. As ? -> 8, all coefficients shrink toward zero.
Key Takeaways
Summary: Ridge Regression
-
Ridge adds L2 penalty ?Sß?² — shrinks all coefficients toward zero but rarely to exactly zero
-
? = 0: OLS; ? -> 8: all coefficients -> 0
-
Solves multicollinearity: (X?X + ?I) is always invertible
-
Select ? via cross-validation — RidgeCV does this efficiently
-
Standardize features before ridge — penalty treats all features equally
-
Ridge for multicollinearity; Lasso for feature selection