Optimizers: SGD, Adam, AdamW, Learning Rate Schedules — Asked at OpenAI & DeepMind

🎯 The Interview Question

"Compare SGD with momentum, Adam, and AdamW optimizers. What are the mathematical formulations of each? Why is AdamW preferred over Adam for training Transformers? Explain learning rate schedules and their importance. When would you choose SGD over Adam?"

This question is fundamental for understanding how deep learning models are trained — essential for OpenAI and DeepMind.

📚 Detailed Answer

SGD with Momentum

Standard SGD:

\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \nabla \mathcal{L}(\mathbf{w}_t)

With Momentum:

\mathbf{v}_t = \beta \mathbf{v}_{t-1} + \nabla \mathcal{L}(\mathbf{w}_t)

\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \mathbf{v}_t

where $\beta$ is typically 0.9.

Effect: Momentum accumulates past gradients, providing:

Faster convergence in consistent gradient directions
Dampening of oscillations
Ability to escape shallow local minima

💡

SGD with momentum is often preferred for computer vision tasks (training ResNets) because it tends to find flatter minima that generalize better. The learning rate schedule is critical — use cosine annealing or step decay.

Adam (Adaptive Moment Estimation)

Adam combines momentum with adaptive learning rates:

First moment (mean):

\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1-\beta_1)\nabla \mathcal{L}(\mathbf{w}_t)

Second moment (variance):

\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1-\beta_2)(\nabla \mathcal{L}(\mathbf{w}_t))^2

Bias correction:

\hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1-\beta_1^t}, \quad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1-\beta_2^t}

Update:

\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon}

Default hyperparameters: $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , $\epsilon = 10^{-8}$

Advantages:

Adaptive learning rates per parameter
Fast convergence on sparse gradients
Works well with default hyperparameters

AdamW: Decoupled Weight Decay

Adam applies weight decay incorrectly — as L2 regularization:

\text{Adam L2: } \mathbf{w}_{t+1} = \mathbf{w}_t - \eta \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} - \eta \lambda \mathbf{w}_t

This couples weight decay with the adaptive learning rate, which is suboptimal.

AdamW fixes this:

\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} - \eta \lambda \mathbf{w}_t

The weight decay term is applied directly to the weights, not through the gradient.

Why AdamW is better:

Proper decoupling of weight decay
Better generalization
Standard for Transformer training

Comparison Table

Optimizer	Pros	Cons	Best For
SGD	Simple, good generalization	Slow convergence, sensitive to LR	Computer Vision
Adam	Fast convergence, adaptive LR	Can generalize poorly	Sparse gradients, NLP
AdamW	Proper weight decay, good generalization	Slightly more compute	Transformers
LAMB	Large batch training	Complex	Distributed training

Learning Rate Schedules

Step Decay

\eta_t = \eta_0 \cdot \gamma^{\lfloor t/s \rfloor}

Decays LR by factor $\gamma$ every $s$ steps. Common: $\gamma=0.1$ , $s=30$ epochs.

Cosine Annealing

\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{\pi t}{T}\right)\right)

Smooth decay from $\eta_{max}$ to $\eta_{min}$ over $T$ steps. State-of-the-art for most tasks.

Warmup + Cosine Annealing

\eta_t = \begin{cases} \eta_{max} \cdot \frac{t}{T_{warmup}} & t \leq T_{warmup} \\ \text{cosine}(t - T_{warmup}) & t > T_{warmup} \end{cases}

Linear warmup for first $T_{warmup}$ steps, then cosine decay. Essential for Transformers.

Cyclical Learning Rates

Oscillate between bounds:

\eta_t = \eta_{min} + (\eta_{max} - \eta_{min})\left|\frac{2t}{T} - 1\right|

Can help escape local minima.

Advanced Optimizers

LAMB (Layer-wise Adaptive Moments)

For large batch training:

\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \frac{\phi(\mathbf{w}_t)}{\|\phi(\mathbf{w}_t)\|} \cdot \frac{\|\mathbf{w}_t\|}{\|\hat{\mathbf{m}}_t / (\sqrt{\hat{\mathbf{v}}_t} + \epsilon)\|}

Enables batch sizes up to 32K for BERT training.

Lion (Google Brain)

Uses only sign of gradient:

\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \left(\text{sign}(\beta_1 \mathbf{m}_t + (1-\beta_1)\nabla \mathcal{L}) + \lambda \mathbf{w}_t\right)

Memory efficient, good for large models.

Practical Guidelines

Follow-Up Questions

Q: Why does Adam sometimes generalize worse than SGD? A: Adaptive methods can converge to sharp minima that have high training loss but poor generalization. SGD with momentum tends to find flatter minima.

Q: What is the relationship between learning rate and batch size? A: Linear scaling rule: when batch size increases by $k$ , increase LR by $k$ . Works for SGD; Adam is more robust to batch size changes.

Q: How do you choose between warmup steps and total training steps? A: Warmup is typically 5-10% of total steps. More warmup needed for larger models and batch sizes. Start with 2000-4000 steps for most tasks.