Regularization: L1, L2, Elastic Net & Dropout
Techniques to prevent overfitting and improve generalization
Interview Question
"Explain the difference between L1 and L2 regularization from both geometric and Bayesian perspectives. What is Elastic Net and when would you use it? How does dropout work in neural networks?"
Difficulty: Hard | Frequently asked at OpenAI, Anthropic, Google
Theoretical Foundation
Why Regularization?
Without regularization, models can overfit by learning noise, assigning large weights to irrelevant features, and creating overly complex decision boundaries. Regularization adds constraints to prevent this.
L2 Regularization (Ridge)
Mathematical Formulation
Geometric Interpretation
The L2 constraint region is a hypersphere. The solution is where the loss contour first touches this sphere. Since spheres are smooth, the solution typically doesn't land on axes.
Bayesian Interpretation
L2 regularization corresponds to a Gaussian prior on coefficients:
Properties
- Shrinks coefficients toward zero but never exactly to zero
- Handles multicollinearity by distributing weight evenly
- Differentiable everywhere
- Closed-form solution exists
L1 Regularization (Lasso)
Mathematical Formulation
Geometric Interpretation
The L1 constraint region is a hypercube. The solution is where the loss contour touches a corner of this cube. Corners are on axes, promoting sparsity.
Bayesian Interpretation
L1 regularization corresponds to a Laplace prior:
The Laplace prior has a peak at zero, encouraging sparsity.
Properties
- Can shrink coefficients exactly to zero (sparse solutions)
- Performs automatic feature selection
- Not differentiable at zero
- No closed-form solution (use coordinate descent)
L1 vs L2 Comparison
| Aspect | L1 (Lasso) | L2 (Ridge) |
|---|---|---|
| Sparsity | Yes (feature selection) | No |
| Geometry | Diamond (corners on axes) | Sphere (no corners) |
| Prior | Laplace | Gaussian |
| Differentiability | Not at zero | Everywhere |
| Closed-form | No | Yes |
| Multicollinearity | Selects one feature | Distributes weight |
⚠️
Interview Trap: Don't just say "L1 does feature selection." Explain why geometrically (diamond corners) and probabilistically (Laplace prior has peak at zero).
Elastic Net
Combines L1 and L2:
Use Elastic Net when you have many correlated features and want both feature selection (L1) and stability (L2).
Dropout (Neural Networks)
During training, randomly zero out neurons with probability and scale remaining by . At inference, use all neurons.
Dropout works by:
- Ensemble Effect: Approximates training different sub-networks
- Reduced Co-adaptation: Forces neurons to learn robust features
- Noise Injection: Acts as implicit data augmentation
💡
OpenAI Interview Tip: Dropout can be interpreted as approximate Bayesian inference in deep Gaussian processes.
Code Implementation
Real-World Applications
OpenAI: Large Language Models
- Weight Decay: L2 regularization in transformer training
- Dropout: Preventing overfitting in attention layers
- Early Stopping: Stopping training at optimal point
Anthropic: AI Safety
- Robustness: Regularizing against adversarial examples
- Interpretability: Sparse models are more interpretable
- Generalization: Ensuring models work on unseen distributions
Common Follow-Up Questions
Q1: Why does L1 produce sparse solutions? Geometrically, the L1 constraint region has corners on axes. The loss function intersects at corners, giving zero coefficients. Probabilistically, the Laplace prior peaks at zero.
Q2: When should you use Elastic Net? When you have many correlated features, want both feature selection and stability, or when .
Q3: How does dropout relate to bagging? Dropout trains different sub-networks per mini-batch, approximating an ensemble of networks.
Q4: What is the relationship between regularization and model complexity? Regularization reduces effective model complexity by constraining the parameter space.