DL Foundations

Optimizers — From SGD to Adam, Finding the Best Weights

Optimizers determine how neural network parameters are updated based on computed gradients. The choice of optimizer and learning rate schedule significantly impacts training speed and final performance.

AdamW is the Default — Combines momentum with adaptive learning rates and decoupled weight decay
SGD + Momentum for Vision — Often achieves better generalization than adaptive methods on image tasks
Cosine Annealing + Warmup — The standard learning rate schedule for modern deep learning

Optimizers for Deep Learning — SGD, Adam, AdamW and Learning Rate Schedules

Gradient Descent Foundation

DfStochastic Gradient Descent

SGD updates parameters using mini-batch gradients:

\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t)

where $\eta$ is the learning rate and $\nabla_\theta \mathcal{L}$ is the gradient of the loss with respect to parameters.

SGD Update Rule

\theta_{t+1} = \theta_t - \eta \nabla_{\theta} \mathcal{L}(\theta_t)

SGD with Momentum

DfMomentum

Momentum accelerates convergence by accumulating velocity from past gradients:

v_t = \beta v_{t-1} + \nabla_\theta \mathcal{L}(\theta_t)

\theta_{t+1} = \theta_t - \eta v_t

Momentum (typically $\beta = 0.9$ ) smooths out noisy gradients and accelerates convergence in consistent gradient directions.

SGD with Momentum

v_t = \beta v_{t-1} + \nabla_{\theta} \mathcal{L}(\theta_t), \quad \theta_{t+1} = \theta_t - \eta v_t

Here,

$v_t$ =Velocity (accumulated gradient)
$\beta$ =Momentum coefficient (typically 0.9)
$\eta$ =Learning rate
$\nabla_{\theta} \mathcal{L}$ =Current gradient

Adaptive Learning Rate Methods

AdaGrad

DfAdaGrad

AdaGrad adapts learning rates based on historical gradient magnitudes:

G_t = G_{t-1} + (\nabla_\theta \mathcal{L}_t)^2

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \nabla_\theta \mathcal{L}_t

Parameters with large accumulated gradients receive smaller updates. Good for sparse data but learning rate decreases too aggressively.

RMSProp

DfRMSProp

RMSProp uses exponential moving average to prevent AdaGrad's aggressive decay:

v_t = \beta v_{t-1} + (1 - \beta)(\nabla_\theta \mathcal{L}_t)^2

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{v_t + \epsilon}} \nabla_\theta \mathcal{L}_t

The decay rate $\beta$ (typically 0.99) controls the window of past gradients considered.

Adam and AdamW

DfAdam (Adaptive Moment Estimation)

Adam combines momentum and RMSProp:

m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_\theta \mathcal{L}_t

v_t = \beta_2 v_{t-1} + (1 - \beta_2)(\nabla_\theta \mathcal{L}_t)^2

\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

Default: $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , $\epsilon = 10^{-8}$ .

Adam Update Rule

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

DfAdamW (Decoupled Weight Decay)

AdamW decouples weight decay from the gradient update:

\theta_{t+1} = (1 - \lambda)\theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

where $\lambda$ is the weight decay coefficient. This is different from L2 regularization, which is applied through the gradient.

Adam vs AdamW

Standard Adam with L2 regularization applies weight decay through the gradient, which is scaled by the adaptive learning rate. AdamW applies weight decay directly, making it more effective. AdamW is now the default optimizer for transformers and large models.

Learning Rate Schedules

Step Decay

DfStep Decay

Reduce learning rate by a factor every $k$ epochs:

\eta_t = \eta_0 \cdot \gamma^{\lfloor t / k \rfloor}

Simple but requires manual tuning of decay schedule.

Cosine Annealing

DfCosine Annealing

The learning rate follows a cosine curve:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{\pi t}{T}\right)\right)

Smooth decay that starts fast and slows down. Often combined with warmup.

Cosine Annealing

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{\pi t}{T}\right)\right)

Warmup

DfLinear Warmup

Linearly increase learning rate from 0 to $\eta_{\max}$ over $T_w$ steps:

\eta_t = \eta_{\max} \cdot \frac{t}{T_w}, \quad t \leq T_w

Warmup stabilizes training in the early stages when gradients are noisy.

Optimizer Selection Guide

DfOptimizer Hyperparameters

Optimizer	Default LR	$\beta_1$	$\beta_2$	$\epsilon$	Weight Decay
SGD	0.1	N/A	N/A	N/A	1e-4
SGD+Momentum	0.1	0.9	N/A	N/A	1e-4
Adam	1e-3	0.9	0.999	1e-8	0
AdamW	1e-3	0.9	0.999	1e-8	0.01
LAMB	1e-3	0.9	0.999	1e-6	0.01

Practical Tips

Learning Rate Finding

Use learning rate finder: start with very small lr, increase exponentially, plot loss vs lr. The optimal lr is where loss decreases fastest (typically 10x before minimum loss).

Common Mistakes

Using Adam with default lr=1e-3 for all tasks (task-dependent!)
Forgetting weight decay with Adam (use AdamW instead)
Not using warmup for large batch training
Changing multiple hyperparameters simultaneously

Summary

SGD + Momentum for computer vision: best generalization with proper tuning
AdamW for NLP/transformers: fast convergence, decoupled weight decay
Cosine annealing + warmup is the standard learning rate schedule
Learning rate is the most important hyperparameter — tune it first
Different tasks require different optimizers and hyperparameters

Next: Weight Initialization

Optimizers for Deep Learning — SGD, Adam, AdamW and Learning Rate Schedules

Optimizers — From SGD to Adam, Finding the Best Weights

Optimizers for Deep Learning — SGD, Adam, AdamW and Learning Rate Schedules

Gradient Descent Foundation

DfStochastic Gradient Descent

SGD with Momentum

DfMomentum

SGD with Momentum

Adaptive Learning Rate Methods

AdaGrad

DfAdaGrad

RMSProp

DfRMSProp

Adam and AdamW

DfAdam (Adaptive Moment Estimation)

DfAdamW (Decoupled Weight Decay)

Learning Rate Schedules

Step Decay

DfStep Decay

Cosine Annealing

DfCosine Annealing

Warmup

DfLinear Warmup

Optimizer Selection Guide

DfOptimizer Hyperparameters

Practical Tips

Summary

Premium Content

Need Expert Deep Learning Help?