🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Optimizers for Deep Learning — SGD, Adam, AdamW and Learning Rate Schedules

FoundationsOptimization🟢 Free Lesson

Advertisement

DL Foundations

Optimizers — From SGD to Adam, Finding the Best Weights

Optimizers determine how neural network parameters are updated based on computed gradients. The choice of optimizer and learning rate schedule significantly impacts training speed and final performance.

  • AdamW is the Default — Combines momentum with adaptive learning rates and decoupled weight decay
  • SGD + Momentum for Vision — Often achieves better generalization than adaptive methods on image tasks
  • Cosine Annealing + Warmup — The standard learning rate schedule for modern deep learning

Optimizers for Deep Learning — SGD, Adam, AdamW and Learning Rate Schedules

Optimizers determine how neural network parameters are updated based on computed gradients. The choice of optimizer and learning rate schedule significantly impacts training speed and final performance.


Gradient Descent Foundation

DfStochastic Gradient Descent

SGD updates parameters using mini-batch gradients:

θt+1=θtηθL(θt)\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t)

where η\eta is the learning rate and θL\nabla_\theta \mathcal{L} is the gradient of the loss with respect to parameters.

SGD Update Rule
θt+1=θtηθL(θt)\theta_{t+1} = \theta_t - \eta \nabla_{\theta} \mathcal{L}(\theta_t)

SGD with Momentum

DfMomentum

Momentum accelerates convergence by accumulating velocity from past gradients:

vt=βvt1+θL(θt)v_t = \beta v_{t-1} + \nabla_\theta \mathcal{L}(\theta_t)
θt+1=θtηvt\theta_{t+1} = \theta_t - \eta v_t

Momentum (typically β=0.9\beta = 0.9) smooths out noisy gradients and accelerates convergence in consistent gradient directions.

SGD vs SGD with Momentum: Convergence PathMinimumSGDMomentumSGD: zigzags due to noisy gradientsMomentum: smooth, faster convergence

SGD with Momentum

vt=βvt1+θL(θt),θt+1=θtηvtv_t = \beta v_{t-1} + \nabla_{\theta} \mathcal{L}(\theta_t), \quad \theta_{t+1} = \theta_t - \eta v_t

Here,

  • vtv_t=Velocity (accumulated gradient)
  • β\beta=Momentum coefficient (typically 0.9)
  • η\eta=Learning rate
  • θL\nabla_{\theta} \mathcal{L}=Current gradient

Adaptive Learning Rate Methods

AdaGrad

DfAdaGrad

AdaGrad adapts learning rates based on historical gradient magnitudes:

Gt=Gt1+(θLt)2G_t = G_{t-1} + (\nabla_\theta \mathcal{L}_t)^2
θt+1=θtηGt+ϵθLt\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \nabla_\theta \mathcal{L}_t

Parameters with large accumulated gradients receive smaller updates. Good for sparse data but learning rate decreases too aggressively.

RMSProp

DfRMSProp

RMSProp uses exponential moving average to prevent AdaGrad's aggressive decay:

vt=βvt1+(1β)(θLt)2v_t = \beta v_{t-1} + (1 - \beta)(\nabla_\theta \mathcal{L}_t)^2
θt+1=θtηvt+ϵθLt\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{v_t + \epsilon}} \nabla_\theta \mathcal{L}_t

The decay rate β\beta (typically 0.99) controls the window of past gradients considered.


Adam and AdamW

DfAdam (Adaptive Moment Estimation)

Adam combines momentum and RMSProp:

mt=β1mt1+(1β1)θLtm_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_\theta \mathcal{L}_t
vt=β2vt1+(1β2)(θLt)2v_t = \beta_2 v_{t-1} + (1 - \beta_2)(\nabla_\theta \mathcal{L}_t)^2
m^t=mt1β1t,v^t=vt1β2t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}
θt+1=θtηv^t+ϵm^t\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

Default: β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, ϵ=108\epsilon = 10^{-8}.

Adam Update Rule
θt+1=θtηv^t+ϵm^t\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t
Optimizer Comparison: Convergence SpeedTraining StepsLossSGDSGD+MAdamAdamWFaster convergence →

DfAdamW (Decoupled Weight Decay)

AdamW decouples weight decay from the gradient update:

θt+1=(1λ)θtηv^t+ϵm^t\theta_{t+1} = (1 - \lambda)\theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

where λ\lambda is the weight decay coefficient. This is different from L2 regularization, which is applied through the gradient.

Adam vs AdamW

Standard Adam with L2 regularization applies weight decay through the gradient, which is scaled by the adaptive learning rate. AdamW applies weight decay directly, making it more effective. AdamW is now the default optimizer for transformers and large models.


Learning Rate Schedules

Step Decay

DfStep Decay

Reduce learning rate by a factor every kk epochs:

ηt=η0γt/k\eta_t = \eta_0 \cdot \gamma^{\lfloor t / k \rfloor}

Simple but requires manual tuning of decay schedule.

Cosine Annealing

DfCosine Annealing

The learning rate follows a cosine curve:

ηt=ηmin+12(ηmaxηmin)(1+cos(πtT))\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{\pi t}{T}\right)\right)

Smooth decay that starts fast and slows down. Often combined with warmup.

Cosine Annealing
ηt=ηmin+12(ηmaxηmin)(1+cos(πtT))\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{\pi t}{T}\right)\right)

Warmup

DfLinear Warmup

Linearly increase learning rate from 0 to ηmax\eta_{\max} over TwT_w steps:

ηt=ηmaxtTw,tTw\eta_t = \eta_{\max} \cdot \frac{t}{T_w}, \quad t \leq T_w

Warmup stabilizes training in the early stages when gradients are noisy.

Learning Rate Schedules: Cosine + WarmupTraining StepsLearning RateWarmupCosineLinear warmupStep decayη_maxη_min

Optimizer Selection Guide

Optimizer Selection Decision TreeTask Type?NLP/TransformersAdamWlr=2e-5, warmup, cosineComputer VisionSGD+Momentumlr=0.1, cosine, WSDGANs/RLAdamlr=1e-4, beta1=0.5Optimizer PropertiesSGD: Best generalization, slow convergenceAdam: Fast convergence, may overfitAdamW: Decoupled weight decay, default for transformersLAMB: Large batch training, distributed

DfOptimizer Hyperparameters

OptimizerDefault LRβ1\beta_1β2\beta_2ϵ\epsilonWeight Decay
SGD0.1N/AN/AN/A1e-4
SGD+Momentum0.10.9N/AN/A1e-4
Adam1e-30.90.9991e-80
AdamW1e-30.90.9991e-80.01
LAMB1e-30.90.9991e-60.01

Practical Tips

Learning Rate Finding

Use learning rate finder: start with very small lr, increase exponentially, plot loss vs lr. The optimal lr is where loss decreases fastest (typically 10x before minimum loss).

Common Mistakes

  • Using Adam with default lr=1e-3 for all tasks (task-dependent!)
  • Forgetting weight decay with Adam (use AdamW instead)
  • Not using warmup for large batch training
  • Changing multiple hyperparameters simultaneously

Summary

  • SGD + Momentum for computer vision: best generalization with proper tuning
  • AdamW for NLP/transformers: fast convergence, decoupled weight decay
  • Cosine annealing + warmup is the standard learning rate schedule
  • Learning rate is the most important hyperparameter — tune it first
  • Different tasks require different optimizers and hyperparameters

Next: Weight Initialization

Premium Content

Optimizers for Deep Learning — SGD, Adam, AdamW and Learning Rate Schedules

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Deep Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement