🎯 The Interview Question
"Compare SGD with momentum, Adam, and AdamW optimizers. What are the mathematical formulations of each? Why is AdamW preferred over Adam for training Transformers? Explain learning rate schedules and their importance. When would you choose SGD over Adam?"
This question is fundamental for understanding how deep learning models are trained — essential for OpenAI and DeepMind.
📚 Detailed Answer
SGD with Momentum
Standard SGD:
With Momentum:
where is typically 0.9.
Effect: Momentum accumulates past gradients, providing:
- Faster convergence in consistent gradient directions
- Dampening of oscillations
- Ability to escape shallow local minima
💡
SGD with momentum is often preferred for computer vision tasks (training ResNets) because it tends to find flatter minima that generalize better. The learning rate schedule is critical — use cosine annealing or step decay.
Adam (Adaptive Moment Estimation)
Adam combines momentum with adaptive learning rates:
First moment (mean):
Second moment (variance):
Bias correction:
Update:
Default hyperparameters: , ,
Advantages:
- Adaptive learning rates per parameter
- Fast convergence on sparse gradients
- Works well with default hyperparameters
AdamW: Decoupled Weight Decay
Adam applies weight decay incorrectly — as L2 regularization:
This couples weight decay with the adaptive learning rate, which is suboptimal.
AdamW fixes this:
The weight decay term is applied directly to the weights, not through the gradient.
Why AdamW is better:
- Proper decoupling of weight decay
- Better generalization
- Standard for Transformer training
Comparison Table
| Optimizer | Pros | Cons | Best For |
|---|---|---|---|
| SGD | Simple, good generalization | Slow convergence, sensitive to LR | Computer Vision |
| Adam | Fast convergence, adaptive LR | Can generalize poorly | Sparse gradients, NLP |
| AdamW | Proper weight decay, good generalization | Slightly more compute | Transformers |
| LAMB | Large batch training | Complex | Distributed training |
Learning Rate Schedules
Step Decay
Decays LR by factor every steps. Common: , epochs.
Cosine Annealing
Smooth decay from to over steps. State-of-the-art for most tasks.
Warmup + Cosine Annealing
Linear warmup for first steps, then cosine decay. Essential for Transformers.
Cyclical Learning Rates
Oscillate between bounds:
Can help escape local minima.
Advanced Optimizers
LAMB (Layer-wise Adaptive Moments)
For large batch training:
Enables batch sizes up to 32K for BERT training.
Lion (Google Brain)
Uses only sign of gradient:
Memory efficient, good for large models.
Practical Guidelines
Follow-Up Questions
Q: Why does Adam sometimes generalize worse than SGD? A: Adaptive methods can converge to sharp minima that have high training loss but poor generalization. SGD with momentum tends to find flatter minima.
Q: What is the relationship between learning rate and batch size? A: Linear scaling rule: when batch size increases by , increase LR by . Works for SGD; Adam is more robust to batch size changes.
Q: How do you choose between warmup steps and total training steps? A: Warmup is typically 5-10% of total steps. More warmup needed for larger models and batch sizes. Start with 2000-4000 steps for most tasks.