Lasso Regression (L1 Regularization)

Regression Analysis

Feature Selection Through Coefficient Shrinkage

The Lasso adds an L1 penalty that can drive coefficients to exactly zero, performing automatic feature selection. This produces sparse, interpretable models that identify the most important predictors from high-dimensional data.

Biomedical Research — Identify key biomarkers from thousands of candidates
Marketing — Select the most predictive customer attributes for targeting
Environmental Modeling — Pinpoint critical factors from numerous measurements

While Ridge shrinks all coefficients, Lasso selects only the essential ones.

Lasso (Least Absolute Shrinkage and Selection Operator) adds an L1 penalty:

Lasso Regression Objective

\text{Minimize: } \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda\sum_j|\beta_j|

Here,

$\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2$ =Sum of squared residuals
$\lambda\sum_j|\beta_j|$ =L1 penalty term
$\lambda$ =Regularization parameter

Unlike Ridge, Lasso produces sparse solutions — many coefficients shrink to exactly zero.


import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import Lasso, LassoCV, Ridge

from sklearn.preprocessing import StandardScaler

from sklearn.pipeline import Pipeline

from sklearn.model_selection import cross_val_score, train_test_split



np.random.seed(42)

n, p = 150, 30  # 30 features, only 5 are truly relevant

X = np.random.randn(n, p)

true_beta = np.array([3, -2, 1.5, -1, 0.8] + [0]*25)  # only 5 nonzero

y = X @ true_beta + np.random.randn(n)*1.5



X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)



# Lasso path

lambdas = np.logspace(-3, 2, 200)

lasso_coefs = []

ridge_coefs = []

for lam in lambdas:

    l = Pipeline([('s', StandardScaler()), ('m', Lasso(alpha=lam, max_iter=10000))])

    r = Pipeline([('s', StandardScaler()), ('m', Ridge(alpha=lam))])

    l.fit(X_train, y_train)

    r.fit(X_train, y_train)

    lasso_coefs.append(l.named_steps['m'].coef_)

    ridge_coefs.append(r.named_steps['m'].coef_)



lasso_coefs = np.array(lasso_coefs)

ridge_coefs = np.array(ridge_coefs)



fig, axes = plt.subplots(1, 3, figsize=(16, 5))



# Lasso path

for j in range(p):

    axes[0].plot(np.log10(lambdas), lasso_coefs[:, j],

                 linewidth=1.5, color='red' if j < 5 else 'lightgray', alpha=0.7)

axes[0].set_xlabel('log10(?)')

axes[0].set_ylabel('Coefficient')

axes[0].set_title('Lasso Path — Sparse Solutions\n(coefficients become exactly 0)')



# Compare: sparsity at optimal ?

lasso_cv = Pipeline([('s', StandardScaler()), ('m', LassoCV(cv=5, random_state=42))])

lasso_cv.fit(X_train, y_train)

best_lam = lasso_cv.named_steps['m'].alpha_

best_coefs = lasso_cv.named_steps['m'].coef_



nonzero = (best_coefs != 0).sum()

axes[1].bar(range(p), best_coefs, color=['red' if j<5 else 'steelblue' for j in range(p)])

axes[1].axhline(0, color='black', linewidth=0.5)

axes[1].set_title(f'Lasso Coefficients (?={best_lam:.4f})\n{nonzero}/{p} nonzero features selected')

axes[1].set_xlabel('Feature Index')



# Lasso vs Ridge: number of nonzero features

lasso_nonzero = [(lasso_coefs[i] != 0).sum() for i in range(len(lambdas))]

ridge_nonzero = [(ridge_coefs[i] != 0).sum() for i in range(len(lambdas))]

axes[2].plot(np.log10(lambdas), lasso_nonzero, 'r-', linewidth=2, label='Lasso')

axes[2].plot(np.log10(lambdas), ridge_nonzero, 'b-', linewidth=2, label='Ridge')

axes[2].set_title('Sparsity: Lasso vs Ridge')

axes[2].set_xlabel('log10(?)')

axes[2].set_ylabel('# Nonzero Coefficients')

axes[2].legend()



plt.tight_layout()

plt.savefig('lasso_regression.png', dpi=150)

plt.show()



test_mse_lasso = np.mean((y_test - lasso_cv.predict(X_test))**2)

print(f"Lasso: best ?={best_lam:.4f}, {nonzero} features selected, Test MSE={test_mse_lasso:.4f}")

print(f"True nonzero features: {(true_beta!=0).sum()}")

print(f"Correctly selected: {sum(1 for j in range(p) if (best_coefs[j]!=0) == (true_beta[j]!=0))}/{p}")

Ridge vs Lasso

| Aspect | Ridge | Lasso |

|--------|-------|-------|

| Penalty | L2 (squared) | L1 (absolute) |

| Shrinkage | Toward 0, not to 0 | Can be exactly 0 |

| Feature selection | ? No | ? Yes (sparse) |

| Multicollinearity | Keeps all | Picks one arbitrarily |

| Solution | Closed form | Iterative (coordinate descent) |

L1 vs L2 Geometry

The L1 penalty creates sparsity because the L1 ball has corners at the axes, while the L2 ball is smooth and doesn't encourage exact zeros.

Key Takeaways

Summary: Lasso Regression

Lasso performs feature selection — coefficients shrink to exactly zero
L1 penalty creates sparsity because the L1 ball has corners at axes
Use Lasso when you believe few features truly matter (sparse true model)
Use Ridge when all features contribute (dense true model)
Elastic Net combines L1 + L2 — best of both worlds

Lasso Regression (L1 Regularization) — Feature Selection