🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Lasso Regression (L1 Regularization) — Feature Selection

Regression AnalysisRegularization🟢 Free Lesson

Advertisement

Lasso Regression (L1 Regularization)

Regression Analysis

Feature Selection Through Coefficient Shrinkage

The Lasso adds an L1 penalty that can drive coefficients to exactly zero, performing automatic feature selection. This produces sparse, interpretable models that identify the most important predictors from high-dimensional data.

  • Biomedical Research — Identify key biomarkers from thousands of candidates

  • Marketing — Select the most predictive customer attributes for targeting

  • Environmental Modeling — Pinpoint critical factors from numerous measurements

While Ridge shrinks all coefficients, Lasso selects only the essential ones.


Lasso (Least Absolute Shrinkage and Selection Operator) adds an L1 penalty:

Lasso Regression Objective

Minimize: yXβ2+λjβj\text{Minimize: } \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda\sum_j|\beta_j|

Here,

  • yXβ2\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2=Sum of squared residuals
  • λjβj\lambda\sum_j|\beta_j|=L1 penalty term
  • λ\lambda=Regularization parameter

Unlike Ridge, Lasso produces sparse solutions — many coefficients shrink to exactly zero.


import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import Lasso, LassoCV, Ridge

from sklearn.preprocessing import StandardScaler

from sklearn.pipeline import Pipeline

from sklearn.model_selection import cross_val_score, train_test_split



np.random.seed(42)

n, p = 150, 30  # 30 features, only 5 are truly relevant

X = np.random.randn(n, p)

true_beta = np.array([3, -2, 1.5, -1, 0.8] + [0]*25)  # only 5 nonzero

y = X @ true_beta + np.random.randn(n)*1.5



X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)



# Lasso path

lambdas = np.logspace(-3, 2, 200)

lasso_coefs = []

ridge_coefs = []

for lam in lambdas:

    l = Pipeline([('s', StandardScaler()), ('m', Lasso(alpha=lam, max_iter=10000))])

    r = Pipeline([('s', StandardScaler()), ('m', Ridge(alpha=lam))])

    l.fit(X_train, y_train)

    r.fit(X_train, y_train)

    lasso_coefs.append(l.named_steps['m'].coef_)

    ridge_coefs.append(r.named_steps['m'].coef_)



lasso_coefs = np.array(lasso_coefs)

ridge_coefs = np.array(ridge_coefs)



fig, axes = plt.subplots(1, 3, figsize=(16, 5))



# Lasso path

for j in range(p):

    axes[0].plot(np.log10(lambdas), lasso_coefs[:, j],

                 linewidth=1.5, color='red' if j < 5 else 'lightgray', alpha=0.7)

axes[0].set_xlabel('log10(?)')

axes[0].set_ylabel('Coefficient')

axes[0].set_title('Lasso Path — Sparse Solutions\n(coefficients become exactly 0)')



# Compare: sparsity at optimal ?

lasso_cv = Pipeline([('s', StandardScaler()), ('m', LassoCV(cv=5, random_state=42))])

lasso_cv.fit(X_train, y_train)

best_lam = lasso_cv.named_steps['m'].alpha_

best_coefs = lasso_cv.named_steps['m'].coef_



nonzero = (best_coefs != 0).sum()

axes[1].bar(range(p), best_coefs, color=['red' if j<5 else 'steelblue' for j in range(p)])

axes[1].axhline(0, color='black', linewidth=0.5)

axes[1].set_title(f'Lasso Coefficients (?={best_lam:.4f})\n{nonzero}/{p} nonzero features selected')

axes[1].set_xlabel('Feature Index')



# Lasso vs Ridge: number of nonzero features

lasso_nonzero = [(lasso_coefs[i] != 0).sum() for i in range(len(lambdas))]

ridge_nonzero = [(ridge_coefs[i] != 0).sum() for i in range(len(lambdas))]

axes[2].plot(np.log10(lambdas), lasso_nonzero, 'r-', linewidth=2, label='Lasso')

axes[2].plot(np.log10(lambdas), ridge_nonzero, 'b-', linewidth=2, label='Ridge')

axes[2].set_title('Sparsity: Lasso vs Ridge')

axes[2].set_xlabel('log10(?)')

axes[2].set_ylabel('# Nonzero Coefficients')

axes[2].legend()



plt.tight_layout()

plt.savefig('lasso_regression.png', dpi=150)

plt.show()



test_mse_lasso = np.mean((y_test - lasso_cv.predict(X_test))**2)

print(f"Lasso: best ?={best_lam:.4f}, {nonzero} features selected, Test MSE={test_mse_lasso:.4f}")

print(f"True nonzero features: {(true_beta!=0).sum()}")

print(f"Correctly selected: {sum(1 for j in range(p) if (best_coefs[j]!=0) == (true_beta[j]!=0))}/{p}")

Ridge vs Lasso

| Aspect | Ridge | Lasso |

|--------|-------|-------|

| Penalty | L2 (squared) | L1 (absolute) |

| Shrinkage | Toward 0, not to 0 | Can be exactly 0 |

| Feature selection | ? No | ? Yes (sparse) |

| Multicollinearity | Keeps all | Picks one arbitrarily |

| Solution | Closed form | Iterative (coordinate descent) |

L1 vs L2 Geometry

The L1 penalty creates sparsity because the L1 ball has corners at the axes, while the L2 ball is smooth and doesn't encourage exact zeros.


Key Takeaways

Summary: Lasso Regression

  • Lasso performs feature selection — coefficients shrink to exactly zero

  • L1 penalty creates sparsity because the L1 ball has corners at axes

  • Use Lasso when you believe few features truly matter (sparse true model)

  • Use Ridge when all features contribute (dense true model)

  • Elastic Net combines L1 + L2 — best of both worlds

Premium Content

Lasso Regression (L1 Regularization) — Feature Selection

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement