Google & Amazon Interview

Linear Regression: Bias-Variance Tradeoff & Regularization

Understanding the foundational algorithm behind predictive modeling

Interview Question

"Explain the bias-variance tradeoff in the context of linear regression. How does regularization (L1 vs L2) address overfitting, and when would you choose one over the other?"

Difficulty: Medium-Hard | Frequently asked at Google, Amazon, Meta

Theoretical Foundation

What is Linear Regression?

Linear regression models the relationship between a dependent variable $y$ and one or more independent variables $X$ by fitting a linear equation:

y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon

where $\beta_0$ is the intercept, $\beta_1, \ldots, \beta_p$ are coefficients, and $\epsilon$ is the error term assumed to follow $\mathcal{N}(0, \sigma^2)$ .

The ordinary least squares (OLS) estimator minimizes the residual sum of squares:

\hat{\beta} = \arg\min_{\beta} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \arg\min_{\beta} \|y - X\beta\|_2^2

The closed-form solution is:

\hat{\beta} = (X^T X)^{-1} X^T y

ℹ️

Key Insight: The OLS estimator is the Best Linear Unbiased Estimator (BLUE) under the Gauss-Markov assumptions. However, "best" doesn't mean optimal when the model is misspecified or when $p$ is large relative to $n$ .

The Bias-Variance Tradeoff

The expected prediction error for any model can be decomposed into three components:

\text{Expected Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}

Bias measures how far off the model's predictions are from the true values on average. High bias indicates the model is too simple (underfitting). For linear regression:

\text{Bias}(\hat{f}(x)) = E[\hat{f}(x)] - f(x)

Variance measures how much the model's predictions change when trained on different subsets of data. High variance indicates the model is too complex (overfitting):

\text{Variance}(\hat{f}(x)) = E[(\hat{f}(x) - E[\hat{f}(x)])^2]

Irreducible error $\sigma^2$ represents noise that no model can eliminate.

⚠️

Common Interview Trap: Many candidates explain bias and variance separately but fail to articulate the tradeoff. As model complexity increases, bias decreases but variance increases. The optimal model minimizes the total error, not just one component.

Visual Intuition

Think of a dartboard analogy:

Low Bias, Low Variance = Darts clustered at the bullseye (ideal)
Low Bias, High Variance = Darts scattered but centered on bullseye
High Bias, Low Variance = Darts clustered but far from bullseye
High Bias, High Variance = Darts scattered and far from bullseye (worst)

How Regularization Addresses Overfitting

When we have many features or correlated features, OLS can produce unstable estimates with high variance. Regularization adds a penalty term to the loss function to constrain model complexity.

Ridge Regression (L2 Regularization)

\hat{\beta}_{\text{ridge}} = \arg\min_{\beta} \left\{ \|y - X\beta\|_2^2 + \lambda \|\beta\|_2^2 \right\}

The closed-form solution becomes:

\hat{\beta}_{\text{ridge}} = (X^T X + \lambda I)^{-1} X^T y

Key properties:

Shrinks coefficients toward zero but never exactly to zero
Handles multicollinearity by distributing weight among correlated features
The tuning parameter $\lambda \geq 0$ controls the strength of regularization
When $\lambda \to \infty$ , all coefficients $\to 0$

Lasso Regression (L1 Regularization)

\hat{\beta}_{\text{lasso}} = \arg\min_{\beta} \left\{ \|y - X\beta\|_2^2 + \lambda \|\beta\|_1 \right\}

Key properties:

Can shrink coefficients exactly to zero (sparse solutions)
Performs automatic feature selection
The $L_1$ penalty creates a diamond-shaped constraint region that tends to hit axes
No closed-form solution; solved using coordinate descent

Elastic Net (L1 + L2 Combined)

\hat{\beta}_{\text{elastic}} = \arg\min_{\beta} \left\{ \|y - X\beta\|_2^2 + \lambda_1 \|\beta\|_1 + \lambda_2 \|\beta\|_2^2 \right\}

💡

When to use Elastic Net: When you have many correlated features and want both feature selection (L1) and group selection (L2). Netflix famously uses Elastic Net for their recommendation system feature selection.

L1 vs L2: When to Choose Which?

Criterion	L1 (Lasso)	L2 (Ridge)
Feature Selection	Yes (sparse solutions)	No
Multicollinearity	Selects one feature from group	Distributes weight evenly
Interpretability	Higher (fewer features)	Lower
Computation	Slower (no closed-form)	Faster (closed-form)
When $p > n$	Selects at most $n$ features	Can include all features
Solution Uniqueness	Not unique if $p > n$	Always unique
Geometric Interpretation	Diamond constraint (corners on axes)	Circular constraint (no corners)

Code Implementation

Explanation of Code

The code above demonstrates:

Data Generation: Creates a synthetic dataset with 50 features but only 10 truly informative, simulating real-world scenarios where many features are irrelevant.
Model Comparison: Shows how OLS, Ridge, Lasso, and Elastic Net perform differently on the same data. Notice Lasso produces sparse solutions while Ridge keeps all features.
Cross-Validation for λ Selection: Demonstrates how to find the optimal regularization strength by balancing training and validation performance.
Coefficient Shrinkage Paths: Visualizes how L1 and L2 regularization differently shrink coefficients as λ increases.
Bias-Variance Decomposition: Uses bootstrap to directly estimate bias² and variance at different regularization strengths.

Real-World Applications

Google: Ad Click Prediction

Google uses regularized linear models (specifically FTRL-Proximal) for online ad click prediction. The sparsity induced by L1 regularization is crucial for serving billions of predictions per second with limited memory.

Amazon: Dynamic Pricing

Amazon's pricing algorithms use Ridge regression to handle the multicollinearity between features like competitor price, demand, and inventory levels.

Finance: Risk Models

Banks use Elastic Net for credit scoring where:

L1 selects the most predictive financial ratios
L2 handles the natural correlation between financial metrics
The resulting model must be interpretable for regulatory compliance

💡

Production Tip: In production systems, always standardize your features before applying regularization. The penalty is applied to coefficient magnitudes, so features on different scales will be penalized unequally.

Common Follow-Up Questions

Q1: What happens to the bias-variance tradeoff as you increase λ?

As λ increases:

Bias increases: The model becomes more constrained, moving away from the true function
Variance decreases: The model becomes more stable across different training sets
Optimal λ minimizes the total prediction error

Q2: Can Lasso select more features than the number of samples?

No. Lasso can select at most $n$ features (where $n$ is the number of samples). When $p > n$ , Lasso selects at most $n$ non-zero coefficients. This is a fundamental limitation that Elastic Net addresses.

Q3: How do you choose between Ridge and Lasso in practice?

If you believe most features are relevant → Ridge
If you believe few features are relevant → Lasso
If you're unsure → Elastic Net with cross-validation
Always check the correlation structure of your features

Q4: What is the connection between regularization and Bayesian inference?

Ridge = Gaussian prior on coefficients: $\beta \sim \mathcal{N}(0, \tau^2 I)$
Lasso = Laplace prior on coefficients: $\beta \sim \text{Laplace}(0, b)$
The MAP estimate with these priors yields the regularized solutions

Company-Specific Tips

Google Interview Tips

Emphasize understanding of the mathematical derivation of the closed-form solution
Be prepared to discuss computational complexity when $p$ is very large
Mention SGD variants for online learning scenarios
Discuss FTRL-Proximal for sparse online learning

Amazon Interview Tips

Focus on business impact: How does regularization improve predictions on unseen data?
Discuss A/B testing regularized models in production
Be ready to explain why regularization reduces overfitting using the variance formula
Mention interpretability as a business requirement

Linear Regression: Bias-Variance Tradeoff & Regularization (L1/L2)