Lasso Regression (L1 Regularization)
Regression Analysis
Feature Selection Through Coefficient Shrinkage
The Lasso adds an L1 penalty that can drive coefficients to exactly zero, performing automatic feature selection. This produces sparse, interpretable models that identify the most important predictors from high-dimensional data.
-
Biomedical Research — Identify key biomarkers from thousands of candidates
-
Marketing — Select the most predictive customer attributes for targeting
-
Environmental Modeling — Pinpoint critical factors from numerous measurements
While Ridge shrinks all coefficients, Lasso selects only the essential ones.
Lasso (Least Absolute Shrinkage and Selection Operator) adds an L1 penalty:
Lasso Regression Objective
Here,
- =Sum of squared residuals
- =L1 penalty term
- =Regularization parameter
Unlike Ridge, Lasso produces sparse solutions — many coefficients shrink to exactly zero.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso, LassoCV, Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, train_test_split
np.random.seed(42)
n, p = 150, 30 # 30 features, only 5 are truly relevant
X = np.random.randn(n, p)
true_beta = np.array([3, -2, 1.5, -1, 0.8] + [0]*25) # only 5 nonzero
y = X @ true_beta + np.random.randn(n)*1.5
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Lasso path
lambdas = np.logspace(-3, 2, 200)
lasso_coefs = []
ridge_coefs = []
for lam in lambdas:
l = Pipeline([('s', StandardScaler()), ('m', Lasso(alpha=lam, max_iter=10000))])
r = Pipeline([('s', StandardScaler()), ('m', Ridge(alpha=lam))])
l.fit(X_train, y_train)
r.fit(X_train, y_train)
lasso_coefs.append(l.named_steps['m'].coef_)
ridge_coefs.append(r.named_steps['m'].coef_)
lasso_coefs = np.array(lasso_coefs)
ridge_coefs = np.array(ridge_coefs)
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
# Lasso path
for j in range(p):
axes[0].plot(np.log10(lambdas), lasso_coefs[:, j],
linewidth=1.5, color='red' if j < 5 else 'lightgray', alpha=0.7)
axes[0].set_xlabel('log10(?)')
axes[0].set_ylabel('Coefficient')
axes[0].set_title('Lasso Path — Sparse Solutions\n(coefficients become exactly 0)')
# Compare: sparsity at optimal ?
lasso_cv = Pipeline([('s', StandardScaler()), ('m', LassoCV(cv=5, random_state=42))])
lasso_cv.fit(X_train, y_train)
best_lam = lasso_cv.named_steps['m'].alpha_
best_coefs = lasso_cv.named_steps['m'].coef_
nonzero = (best_coefs != 0).sum()
axes[1].bar(range(p), best_coefs, color=['red' if j<5 else 'steelblue' for j in range(p)])
axes[1].axhline(0, color='black', linewidth=0.5)
axes[1].set_title(f'Lasso Coefficients (?={best_lam:.4f})\n{nonzero}/{p} nonzero features selected')
axes[1].set_xlabel('Feature Index')
# Lasso vs Ridge: number of nonzero features
lasso_nonzero = [(lasso_coefs[i] != 0).sum() for i in range(len(lambdas))]
ridge_nonzero = [(ridge_coefs[i] != 0).sum() for i in range(len(lambdas))]
axes[2].plot(np.log10(lambdas), lasso_nonzero, 'r-', linewidth=2, label='Lasso')
axes[2].plot(np.log10(lambdas), ridge_nonzero, 'b-', linewidth=2, label='Ridge')
axes[2].set_title('Sparsity: Lasso vs Ridge')
axes[2].set_xlabel('log10(?)')
axes[2].set_ylabel('# Nonzero Coefficients')
axes[2].legend()
plt.tight_layout()
plt.savefig('lasso_regression.png', dpi=150)
plt.show()
test_mse_lasso = np.mean((y_test - lasso_cv.predict(X_test))**2)
print(f"Lasso: best ?={best_lam:.4f}, {nonzero} features selected, Test MSE={test_mse_lasso:.4f}")
print(f"True nonzero features: {(true_beta!=0).sum()}")
print(f"Correctly selected: {sum(1 for j in range(p) if (best_coefs[j]!=0) == (true_beta[j]!=0))}/{p}")
Ridge vs Lasso
| Aspect | Ridge | Lasso |
|--------|-------|-------|
| Penalty | L2 (squared) | L1 (absolute) |
| Shrinkage | Toward 0, not to 0 | Can be exactly 0 |
| Feature selection | ? No | ? Yes (sparse) |
| Multicollinearity | Keeps all | Picks one arbitrarily |
| Solution | Closed form | Iterative (coordinate descent) |
L1 vs L2 Geometry
The L1 penalty creates sparsity because the L1 ball has corners at the axes, while the L2 ball is smooth and doesn't encourage exact zeros.
Key Takeaways
Summary: Lasso Regression
-
Lasso performs feature selection — coefficients shrink to exactly zero
-
L1 penalty creates sparsity because the L1 ball has corners at axes
-
Use Lasso when you believe few features truly matter (sparse true model)
-
Use Ridge when all features contribute (dense true model)
-
Elastic Net combines L1 + L2 — best of both worlds