🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Linear Regression: Bias-Variance Tradeoff & Regularization (L1/L2)

Machine LearningLinear Regression⭐ Premium

Advertisement

Google & Amazon Interview

Linear Regression: Bias-Variance Tradeoff & Regularization

Understanding the foundational algorithm behind predictive modeling

Interview Question

"Explain the bias-variance tradeoff in the context of linear regression. How does regularization (L1 vs L2) address overfitting, and when would you choose one over the other?"

Difficulty: Medium-Hard | Frequently asked at Google, Amazon, Meta


Theoretical Foundation

What is Linear Regression?

Linear regression models the relationship between a dependent variable yy and one or more independent variables XX by fitting a linear equation:

y=β0+β1x1+β2x2++βpxp+ϵy = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon

where β0\beta_0 is the intercept, β1,,βp\beta_1, \ldots, \beta_p are coefficients, and ϵ\epsilon is the error term assumed to follow N(0,σ2)\mathcal{N}(0, \sigma^2).

The ordinary least squares (OLS) estimator minimizes the residual sum of squares:

β^=argminβi=1n(yiy^i)2=argminβyXβ22\hat{\beta} = \arg\min_{\beta} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \arg\min_{\beta} \|y - X\beta\|_2^2

The closed-form solution is:

β^=(XTX)1XTy\hat{\beta} = (X^T X)^{-1} X^T y

ℹ️

Key Insight: The OLS estimator is the Best Linear Unbiased Estimator (BLUE) under the Gauss-Markov assumptions. However, "best" doesn't mean optimal when the model is misspecified or when pp is large relative to nn.

The Bias-Variance Tradeoff

The expected prediction error for any model can be decomposed into three components:

Expected Error=Bias2+Variance+Irreducible Error\text{Expected Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}

Bias measures how far off the model's predictions are from the true values on average. High bias indicates the model is too simple (underfitting). For linear regression:

Bias(f^(x))=E[f^(x)]f(x)\text{Bias}(\hat{f}(x)) = E[\hat{f}(x)] - f(x)

Variance measures how much the model's predictions change when trained on different subsets of data. High variance indicates the model is too complex (overfitting):

Variance(f^(x))=E[(f^(x)E[f^(x)])2]\text{Variance}(\hat{f}(x)) = E[(\hat{f}(x) - E[\hat{f}(x)])^2]

Irreducible error σ2\sigma^2 represents noise that no model can eliminate.

⚠️

Common Interview Trap: Many candidates explain bias and variance separately but fail to articulate the tradeoff. As model complexity increases, bias decreases but variance increases. The optimal model minimizes the total error, not just one component.

Visual Intuition

Think of a dartboard analogy:

  • Low Bias, Low Variance = Darts clustered at the bullseye (ideal)
  • Low Bias, High Variance = Darts scattered but centered on bullseye
  • High Bias, Low Variance = Darts clustered but far from bullseye
  • High Bias, High Variance = Darts scattered and far from bullseye (worst)

How Regularization Addresses Overfitting

When we have many features or correlated features, OLS can produce unstable estimates with high variance. Regularization adds a penalty term to the loss function to constrain model complexity.

Ridge Regression (L2 Regularization)

β^ridge=argminβ{yXβ22+λβ22}\hat{\beta}_{\text{ridge}} = \arg\min_{\beta} \left\{ \|y - X\beta\|_2^2 + \lambda \|\beta\|_2^2 \right\}

The closed-form solution becomes:

β^ridge=(XTX+λI)1XTy\hat{\beta}_{\text{ridge}} = (X^T X + \lambda I)^{-1} X^T y

Key properties:

  • Shrinks coefficients toward zero but never exactly to zero
  • Handles multicollinearity by distributing weight among correlated features
  • The tuning parameter λ0\lambda \geq 0 controls the strength of regularization
  • When λ\lambda \to \infty, all coefficients 0\to 0

Lasso Regression (L1 Regularization)

β^lasso=argminβ{yXβ22+λβ1}\hat{\beta}_{\text{lasso}} = \arg\min_{\beta} \left\{ \|y - X\beta\|_2^2 + \lambda \|\beta\|_1 \right\}

Key properties:

  • Can shrink coefficients exactly to zero (sparse solutions)
  • Performs automatic feature selection
  • The L1L_1 penalty creates a diamond-shaped constraint region that tends to hit axes
  • No closed-form solution; solved using coordinate descent

Elastic Net (L1 + L2 Combined)

β^elastic=argminβ{yXβ22+λ1β1+λ2β22}\hat{\beta}_{\text{elastic}} = \arg\min_{\beta} \left\{ \|y - X\beta\|_2^2 + \lambda_1 \|\beta\|_1 + \lambda_2 \|\beta\|_2^2 \right\}

💡

When to use Elastic Net: When you have many correlated features and want both feature selection (L1) and group selection (L2). Netflix famously uses Elastic Net for their recommendation system feature selection.

L1 vs L2: When to Choose Which?

CriterionL1 (Lasso)L2 (Ridge)
Feature SelectionYes (sparse solutions)No
MulticollinearitySelects one feature from groupDistributes weight evenly
InterpretabilityHigher (fewer features)Lower
ComputationSlower (no closed-form)Faster (closed-form)
When p>np > nSelects at most nn featuresCan include all features
Solution UniquenessNot unique if p>np > nAlways unique
Geometric InterpretationDiamond constraint (corners on axes)Circular constraint (no corners)

Code Implementation

Explanation of Code

The code above demonstrates:

  1. Data Generation: Creates a synthetic dataset with 50 features but only 10 truly informative, simulating real-world scenarios where many features are irrelevant.

  2. Model Comparison: Shows how OLS, Ridge, Lasso, and Elastic Net perform differently on the same data. Notice Lasso produces sparse solutions while Ridge keeps all features.

  3. Cross-Validation for λ Selection: Demonstrates how to find the optimal regularization strength by balancing training and validation performance.

  4. Coefficient Shrinkage Paths: Visualizes how L1 and L2 regularization differently shrink coefficients as λ increases.

  5. Bias-Variance Decomposition: Uses bootstrap to directly estimate bias² and variance at different regularization strengths.


Real-World Applications

Google: Ad Click Prediction

Google uses regularized linear models (specifically FTRL-Proximal) for online ad click prediction. The sparsity induced by L1 regularization is crucial for serving billions of predictions per second with limited memory.

Amazon: Dynamic Pricing

Amazon's pricing algorithms use Ridge regression to handle the multicollinearity between features like competitor price, demand, and inventory levels.

Finance: Risk Models

Banks use Elastic Net for credit scoring where:

  • L1 selects the most predictive financial ratios
  • L2 handles the natural correlation between financial metrics
  • The resulting model must be interpretable for regulatory compliance

💡

Production Tip: In production systems, always standardize your features before applying regularization. The penalty is applied to coefficient magnitudes, so features on different scales will be penalized unequally.


Common Follow-Up Questions

Q1: What happens to the bias-variance tradeoff as you increase λ?

As λ increases:

  • Bias increases: The model becomes more constrained, moving away from the true function
  • Variance decreases: The model becomes more stable across different training sets
  • Optimal λ minimizes the total prediction error

Q2: Can Lasso select more features than the number of samples?

No. Lasso can select at most nn features (where nn is the number of samples). When p>np > n, Lasso selects at most nn non-zero coefficients. This is a fundamental limitation that Elastic Net addresses.

Q3: How do you choose between Ridge and Lasso in practice?

  1. If you believe most features are relevant → Ridge
  2. If you believe few features are relevant → Lasso
  3. If you're unsure → Elastic Net with cross-validation
  4. Always check the correlation structure of your features

Q4: What is the connection between regularization and Bayesian inference?

  • Ridge = Gaussian prior on coefficients: βN(0,τ2I)\beta \sim \mathcal{N}(0, \tau^2 I)
  • Lasso = Laplace prior on coefficients: βLaplace(0,b)\beta \sim \text{Laplace}(0, b)
  • The MAP estimate with these priors yields the regularized solutions

Company-Specific Tips

Google Interview Tips

  • Emphasize understanding of the mathematical derivation of the closed-form solution
  • Be prepared to discuss computational complexity when pp is very large
  • Mention SGD variants for online learning scenarios
  • Discuss FTRL-Proximal for sparse online learning

Amazon Interview Tips

  • Focus on business impact: How does regularization improve predictions on unseen data?
  • Discuss A/B testing regularized models in production
  • Be ready to explain why regularization reduces overfitting using the variance formula
  • Mention interpretability as a business requirement

Related Topics

Advertisement