DL Foundations
Optimizers — From SGD to Adam, Finding the Best Weights
Optimizers determine how neural network parameters are updated based on computed gradients. The choice of optimizer and learning rate schedule significantly impacts training speed and final performance.
- AdamW is the Default — Combines momentum with adaptive learning rates and decoupled weight decay
- SGD + Momentum for Vision — Often achieves better generalization than adaptive methods on image tasks
- Cosine Annealing + Warmup — The standard learning rate schedule for modern deep learning
Optimizers for Deep Learning — SGD, Adam, AdamW and Learning Rate Schedules
Optimizers determine how neural network parameters are updated based on computed gradients. The choice of optimizer and learning rate schedule significantly impacts training speed and final performance.
Gradient Descent Foundation
DfStochastic Gradient Descent
SGD updates parameters using mini-batch gradients:
where is the learning rate and is the gradient of the loss with respect to parameters.
SGD with Momentum
DfMomentum
Momentum accelerates convergence by accumulating velocity from past gradients:
Momentum (typically ) smooths out noisy gradients and accelerates convergence in consistent gradient directions.
SGD with Momentum
Here,
- =Velocity (accumulated gradient)
- =Momentum coefficient (typically 0.9)
- =Learning rate
- =Current gradient
Adaptive Learning Rate Methods
AdaGrad
DfAdaGrad
AdaGrad adapts learning rates based on historical gradient magnitudes:
Parameters with large accumulated gradients receive smaller updates. Good for sparse data but learning rate decreases too aggressively.
RMSProp
DfRMSProp
RMSProp uses exponential moving average to prevent AdaGrad's aggressive decay:
The decay rate (typically 0.99) controls the window of past gradients considered.
Adam and AdamW
DfAdam (Adaptive Moment Estimation)
Adam combines momentum and RMSProp:
Default: , , .
DfAdamW (Decoupled Weight Decay)
AdamW decouples weight decay from the gradient update:
where is the weight decay coefficient. This is different from L2 regularization, which is applied through the gradient.
Adam vs AdamW
Standard Adam with L2 regularization applies weight decay through the gradient, which is scaled by the adaptive learning rate. AdamW applies weight decay directly, making it more effective. AdamW is now the default optimizer for transformers and large models.
Learning Rate Schedules
Step Decay
DfStep Decay
Reduce learning rate by a factor every epochs:
Simple but requires manual tuning of decay schedule.
Cosine Annealing
DfCosine Annealing
The learning rate follows a cosine curve:
Smooth decay that starts fast and slows down. Often combined with warmup.
Warmup
DfLinear Warmup
Linearly increase learning rate from 0 to over steps:
Warmup stabilizes training in the early stages when gradients are noisy.
Optimizer Selection Guide
DfOptimizer Hyperparameters
| Optimizer | Default LR | Weight Decay | |||
|---|---|---|---|---|---|
| SGD | 0.1 | N/A | N/A | N/A | 1e-4 |
| SGD+Momentum | 0.1 | 0.9 | N/A | N/A | 1e-4 |
| Adam | 1e-3 | 0.9 | 0.999 | 1e-8 | 0 |
| AdamW | 1e-3 | 0.9 | 0.999 | 1e-8 | 0.01 |
| LAMB | 1e-3 | 0.9 | 0.999 | 1e-6 | 0.01 |
Practical Tips
Learning Rate Finding
Use learning rate finder: start with very small lr, increase exponentially, plot loss vs lr. The optimal lr is where loss decreases fastest (typically 10x before minimum loss).
Common Mistakes
- Using Adam with default lr=1e-3 for all tasks (task-dependent!)
- Forgetting weight decay with Adam (use AdamW instead)
- Not using warmup for large batch training
- Changing multiple hyperparameters simultaneously
Summary
- SGD + Momentum for computer vision: best generalization with proper tuning
- AdamW for NLP/transformers: fast convergence, decoupled weight decay
- Cosine annealing + warmup is the standard learning rate schedule
- Learning rate is the most important hyperparameter — tune it first
- Different tasks require different optimizers and hyperparameters
Next: Weight Initialization