Training Loops: Loss Functions, Optimizers and Learning Rate Schedules

1. Training Loop Anatomy

Every deep learning model, regardless of architecture, follows the same fundamental loop: forward pass → compute loss → backward pass → update parameters.

Architecture Diagram

For each epoch:
    For each batch:
        1. Forward pass: ŷ = f(x; θ)
        2. Compute loss: L = Loss(y, ŷ)
        3. Zero gradients: ∇θ ← 0
        4. Backward pass: ∇θ = ∂L/∂θ (autograd)
        5. Update: θ ← θ ≈ η · ∇θ

The PyTorch Implementation

import torch
import torch.nn as nn

def train_one_epoch(model, dataloader, criterion, optimizer, device):
    model.train()
    total_loss = 0.0

    for batch_x, batch_y in dataloader:
        batch_x, batch_y = batch_x.to(device), batch_y.to(device)

        # 1. Forward pass
        predictions = model(batch_x)

        # 2. Compute loss
        loss = criterion(predictions, batch_y)

        # 3. Zero gradients
        optimizer.zero_grad()

        # 4. Backward pass
        loss.backward()

        # 5. Update parameters
        optimizer.step()

        total_loss += loss.item()

    return total_loss / len(dataloader)

Key subtleties:

model.train() enables dropout and batch normalization training mode
optimizer.zero_grad() must be called before .backward() gradients accumulate by default
.item() extracts the scalar loss value (detaches from the computation graph)

2. Loss Functions

Loss functions quantify the mismatch between predictions and targets. The choice of loss function encodes what "good" means for your task.

2.1 Mean Squared Error (MSE)

For regression tasks:

L_{\text{MSE}} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2

Gradient (per sample):

\frac{\partial L}{\partial \hat{y}_i} = \frac{2}{N}(\hat{y}_i - y_i)

Properties:

Convex for linear models → guarantees a single global minimum
Penalizes large errors quadratically → sensitive to outliers
Equivalent to maximizing Gaussian log-likelihood with fixed variance

2.2 Cross-Entropy Loss

For multi-class classification with $C$ classes:

L_{\text{CE}} = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)

where $y_c$ is one-hot encoded and $\hat{y}_c = \text{softmax}(z)_c$ :

\text{softmax}(z)_c = \frac{e^{z_c}}{\sum_{j=1}^{C} e^{z_j}}

The combined CrossEntropyLoss in PyTorch applies log-softmax + NLL loss numerically:

L = -z_{y_{\text{true}}} + \log\left(\sum_{j=1}^{C} e^{z_j}\right)

Numerical stability: The log-sum-exp trick computes $\log\sum e^{z_j} = m + \log\sum e^{z_j - m}$ where $m = \max(z)$ .

2.3 Focal Loss

Addressing class imbalance (e.g., object detection where 99% of anchors are background):

L_{\text{focal}} = -\alpha_t (1 - p_t)^\gamma \log(p_t)

where:

$p_t$ = model's estimated probability for the correct class
$\gamma$ = focusing parameter (typically 2)
$\alpha_t$ = class balancing weight

When $\gamma = 0$ , focal loss reduces to standard cross-entropy. When $\gamma = 2$ , well-classified examples ( $p_t > 0.9$ ) have their loss reduced by $10\times$ .

2.4 Contrastive Loss

For learning embeddings where similar items are close and dissimilar items are far apart:

L_{\text{contrastive}} = \frac{1}{2} y d^2 + \frac{1}{2}(1-y) \max(0, m - d)^2

where $d$ is the Euclidean distance between embeddings, $y = 1$ for similar pairs, and $m$ is the margin.

Triplet Loss (used in FaceNet):

L_{\text{triplet}} = \max(0, \|f(a) - f(p)\|^2 - \|f(a) - f(n)\|^2 + \alpha)

where $a$ = anchor, $p$ = positive (same class), $n$ = negative (different class), $\alpha$ = margin.

Loss Function Selection Guide

Task	Loss Function	Why
Regression	MSE, MAE, Huber	MSE for clean data, Huber for outliers
Binary classification	BCE, Focal	Focal for imbalanced data
Multi-class classification	CrossEntropy, Focal	Focal for long-tailed distributions
Metric learning	Contrastive, Triplet	Learn embedding space structure
Segmentation	Dice loss, CE+Dice	Handle severe foreground/background imbalance
GANs	Adversarial loss	Minimax game between generator and discriminator

3. Optimizers

3.1 SGD (Stochastic Gradient Descent)

The simplest update rule:

\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t)

Problem: Oscillates along high-curvature directions, converges slowly along flat directions.

3.2 SGD with Momentum

Adds a velocity term that accumulates past gradients:

v_{t+1} = \beta v_t + \nabla_\theta L(\theta_t)

\theta_{t+1} = \theta_t - \eta v_{t+1}

Commonly $\beta = 0.9$ . Momentum accelerates convergence in consistent gradient directions and dampens oscillations.

Physical analogy: A ball rolling downhill accumulates velocity. $\beta$ controls friction lower $\beta$ means more friction.

3.3 RMSProp

Adapts the learning rate per parameter based on the magnitude of recent gradients:

s_{t+1} = \beta s_t + (1 - \beta)(\nabla_\theta L)^2

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{s_{t+1} + \epsilon}} \nabla_\theta L

Parameters with large gradients get a smaller effective learning rate; parameters with small gradients get a larger one. Default: $\beta = 0.9$ , $\epsilon = 10^{-8}$ .

3.4 Adam (Adaptive Moment Estimation)

Combines momentum (first moment) and RMSProp (second moment):

m_{t+1} = \beta_1 m_t + (1 - \beta_1) \nabla_\theta L

v_{t+1} = \beta_2 v_t + (1 - \beta_2) (\nabla_\theta L)^2

Bias correction (critical in early steps):

\hat{m}_{t+1} = \frac{m_{t+1}}{1 - \beta_1^{t+1}}, \quad \hat{v}_{t+1} = \frac{v_{t+1}}{1 - \beta_2^{t+1}}

Update:

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_{t+1}} + \epsilon} \hat{m}_{t+1}

Defaults: $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , $\epsilon = 10^{-8}$ .

3.5 AdamW (Adam with Decoupled Weight Decay)

In Adam, L2 regularization ( $\lambda \theta$ ) is absorbed into the adaptive learning rate, making the effective weight decay per parameter different. AdamW decouples weight decay:

\theta_{t+1} = (1 - \eta \lambda) \theta_t - \frac{\eta}{\sqrt{\hat{v}_{t+1}} + \epsilon} \hat{m}_{t+1}

This makes weight decay consistent across parameters regardless of gradient magnitude. AdamW is the default optimizer for training transformers.

Optimizer Comparison

Optimizer Selection Decision Tree

Architecture Diagram

Is your model a transformer or uses batch norm?
├─ Yes → AdamW (lr=3e-4, weight_decay=0.01)
└─ No
   ├─ Computer Vision (CNN)?
   │  ├─ Yes → SGD+Momentum (lr=0.1, momentum=0.9) with cosine schedule
   │  └─ No
   │     ├─ Reinforcement Learning? → Adam (lr=3e-4)
   │     └─ General deep learning? → Start with Adam, try SGD if generalization gap

4. Learning Rate Schedules

The learning rate is the most important hyperparameter. A fixed learning rate is rarely optimal you want large steps early (fast convergence) and small steps later (fine-tuning).

4.1 Step Decay

Reduce the learning rate by a factor every $k$ epochs:

\eta_t = \eta_0 \cdot \gamma^{\lfloor t / k \rfloor}

scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
# lr drops to 10% at epoch 30 and 60

4.2 Cosine Annealing

Smoothly anneal from $\eta_{\max}$ to $\eta_{\min}$ following a cosine curve:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{\pi t}{T}\right)\right)

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)

Cosine annealing is the default schedule for most modern training pipelines. It provides a gentle decay at both ends and faster decay in the middle.

4.3 Warmup + Cosine

Linearly increase the learning rate from 0 to $\eta_{\max}$ over $T_w$ warmup steps, then cosine anneal:

\eta_t = \begin{cases} \eta_{\max} \cdot \frac{t}{T_w} & \text{if } t \leq T_w \\ \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})(1 + \cos(\frac{\pi (t - T_w)}{T - T_w})) & \text{if } t > T_w \end{cases}

Warmup is essential for transformers training is unstable in early steps when parameters are random and adaptive optimizers have unreliable second-moment estimates.

4.4 OneCycle Policy

Cycles the learning rate from low → high → low within a single cycle, with momentum going in reverse (high → low → high):

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})(1 + \cos(\pi \cdot \text{pct}))

where pct goes from 0 to 1 over the total training steps. Smith (2018) showed this can converge in fewer epochs than standard schedules.

scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer, max_lr=0.01, total_steps=total_training_steps
)

Learning Rate Schedule Visualization

5. Regularization

Overfitting occurs when the model memorizes training data rather than learning generalizable patterns. Regularization techniques combat this.

5.1 Dropout

During training, each neuron is independently set to zero with probability $p$ :

\hat{h}_j = \frac{m_j \cdot h_j}{1 - p}, \quad m_j \sim \text{Bernoulli}(1 - p)

The $\frac{1}{1-p}$ scaling (inverted dropout) ensures the expected activation remains unchanged at test time, where no dropout is applied.

Intuition: Dropout forces the network to learn redundant representations no single neuron can be relied upon. It can be interpreted as training an ensemble of $2^n$ sub-networks (where $n$ is the number of neurons).

5.2 Batch Normalization

Normalizes activations across the batch dimension for each feature:

\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

y_i = \gamma \hat{x}_i + \beta

where $\mu_B = \frac{1}{m}\sum x_i$ and $\sigma_B^2 = \frac{1}{m}\sum(x_i - \mu_B)^2$ over the mini-batch.

$\gamma$ and $\beta$ are learnable parameters that allow the network to undo the normalization if needed.

During inference: Use running averages of $\mu$ and $\sigma^2$ accumulated during training (via exponential moving average).

Benefits:

Allows higher learning rates
Reduces sensitivity to initialization
Provides mild regularization (batch statistics add noise)

Limitation: Requires batch dimension → problematic for small batches, sequence models, or distributed training.

5.3 Layer Normalization

Normalizes across the feature dimension for each sample (independent of batch size):

\hat{x}_i = \frac{x_i - \mu_L}{\sqrt{\sigma_L^2 + \epsilon}}, \quad \mu_L = \frac{1}{H}\sum_{j=1}^{H} x_j, \quad \sigma_L^2 = \frac{1}{H}\sum_{j=1}^{H}(x_j - \mu_L)^2

5.4 Weight Decay

Adds an L2 penalty to the loss:

L_{\text{total}} = L_{\text{task}} + \frac{\lambda}{2} \|\theta\|_2^2

This pushes weights toward zero, preventing any single weight from growing too large. Typical values: $\lambda \in [10^{-5}, 10^{-2}]$ .

With AdamW, weight decay is applied directly to parameters without going through the adaptive learning rate, making it more effective than L2 regularization with Adam.

6. Gradient Clipping

Exploding gradients cause numerical instability. Gradient clipping bounds the gradient norm.

Norm Clipping (recommended)

\hat{g} = \begin{cases} g & \text{if } \|g\| \leq \tau \\ \frac{\tau}{\|g\|} g & \text{if } \|g\| > \tau \end{cases}

This preserves the gradient direction while limiting magnitude.

Value Clipping

\hat{g}_i = \text{clip}(g_i, -\tau, \tau)

Clips each gradient component independently. This changes the gradient direction and is less preferred.

# Norm clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# Value clipping
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)

When to use gradient clipping:

Training RNNs/LSTMs (almost always needed)
Training transformers (especially with large learning rates)
Large batch training where gradient norms can spike
Any situation with loss divergence or NaN losses

7. Mixed Precision Training

Uses 16-bit floating point (FP16) for most computations while keeping a 32-bit (FP32) master copy of weights.

Why Mixed Precision?

	FP32	FP16	Speedup
Memory	4 bytes	2 bytes	2נless memory
Compute (A100)	19.5 TFLOPS	312 TFLOPS	~16נ(with Tensor Cores)
Bandwidth	2 TB/s	2 TB/s	Same (but less data)

The Problem: Loss Scaling

FP16 has a much smaller range ( $\pm 65504$ ) and precision ( $\approx 3$ decimal digits). Small gradients can underflow to zero. Solution: loss scaling multiply the loss by a large factor (e.g., 1024), compute gradients in this scaled space, then unscale before the update.

scaler = torch.cuda.amp.GradScaler()

with torch.cuda.amp.autocast():
    predictions = model(batch_x)
    loss = criterion(predictions, batch_y)

scaler.scale(loss).backward()       # backward in scaled FP16
scaler.unscale_(optimizer)          # unscale gradients back to FP32
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
scaler.step(optimizer)              # skip if gradients contain inf/nan
scaler.update()                     # adjust scale factor

Dynamic loss scaling: The GradScaler starts with a large scale factor and halves it whenever inf or nan gradients are detected, then increases it slowly when training is stable.

BFloat16 Alternative

BFloat16 uses 8 exponent bits (same range as FP32) and 7 mantissa bits. No loss scaling needed, but slightly less precise than FP16. Preferred on Ampere+ GPUs.

8. Distributed Training

Data Parallelism

The most common strategy: replicate the model on $N$ GPUs, split the batch across them, and average gradients:

g = \frac{1}{N} \sum_{i=1}^{N} g_i

# PyTorch DDP
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])

Gradient synchronization: All-reduce communicates gradients across GPUs. NCCL (NVIDIA) or Gloo (CPU) backends. Overlap computation and communication start reducing gradients for layer $l$ while computing backward for layer $l+1$ .

Model Parallelism

When the model is too large to fit on one GPU:

Pipeline parallelism: Split model layers across GPUs, micro-batch the pipeline
Tensor parallelism: Split individual operations (e.g., attention heads) across GPUs
ZeRO (DeepSpeed): Shard optimizer states, gradients, and parameters across GPUs

Training at Scale

Architecture Diagram

Total batch size = num_GPUs נper_gpu_batch_size נgradient_accumulation_steps

Example: 8 GPUs נ32 samples נ4 accumulations = 1024 effective batch size

Large batch training requires adjusting the learning rate (linear scaling rule) and using warmup:

\eta_{\text{scaled}} = \eta_{\text{base}} \times \frac{B_{\text{scaled}}}{B_{\text{base}}}

Putting It All Together: A Modern Training Recipe

# 1. Model
model = YourModel().to(device)

# 2. Optimizer (AdamW for transformers, SGD for CNNs)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

# 3. Schedule (Warmup + Cosine)
warmup_steps = 1000
total_steps = 100000

def lr_lambda(step):
    if step < warmup_steps:
        return step / warmup_steps
    progress = (step - warmup_steps) / (total_steps - warmup_steps)
    return 0.5 * (1 + math.cos(math.pi * progress))

scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)

# 4. Loss
criterion = torch.nn.CrossEntropyLoss(label_smoothing=0.1)

# 5. Mixed precision
scaler = torch.cuda.amp.GradScaler()

# 6. Training loop
for step in range(total_steps):
    batch_x, batch_y = next(train_loader)

    with torch.cuda.amp.autocast():
        loss = criterion(model(batch_x), batch_y)

    scaler.scale(loss).backward()
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    scaler.step(optimizer)
    scaler.update()
    optimizer.zero_grad()
    scheduler.step()

Hyperparameter Defaults (IIT/MIT Research Standards)

Hyperparameter	Transformer	CNN (ResNet)
Optimizer	AdamW	SGD + Momentum
Learning rate	3e-4	0.1
Weight decay	0.01	1e-4
Batch size	2562048	256
Warmup steps	20004000
Schedule	Cosine	Cosine
Gradient clip	1.0	None
Dropout	0.1	0.2
Label smoothing	0.1

Summary

Concept	Key Takeaway
Training loop	Forward → loss → zero_grad → backward → step
MSE	Regression; penalizes large errors quadratically
Cross-entropy	Classification; combined with log-softmax
Focal loss	Handles class imbalance via $(1-p_t)^\gamma$ weighting
Contrastive/triplet loss	Learn embedding spaces
SGD + Momentum	Best for CNNs; fast convergence with proper schedule
Adam/AdamW	Best for transformers; adaptive per-parameter lr
Cosine annealing	Smooth decay; default schedule in modern training
Warmup	Essential for transformers; stabilizes early training
Dropout	Ensembles sub-networks; scale by $\frac{1}{1-p}$ at test time
BatchNorm	Normalize across batch; use in CNNs
LayerNorm	Normalize across features; use in transformers
Gradient clipping	Clip norm to prevent exploding gradients
Mixed precision	FP16/BF16 + loss scaling for 2-4נspeedup
Distributed training	DDP for data parallelism; ZeRO for model parallelism

Training Loops: Loss Functions, Optimizers and Learning Rate Schedules

1. Training Loop Anatomy

The PyTorch Implementation

2. Loss Functions

2.1 Mean Squared Error (MSE)

2.2 Cross-Entropy Loss

2.3 Focal Loss

2.4 Contrastive Loss

Loss Function Selection Guide

3. Optimizers

3.1 SGD (Stochastic Gradient Descent)

3.2 SGD with Momentum

3.3 RMSProp

3.4 Adam (Adaptive Moment Estimation)

3.5 AdamW (Adam with Decoupled Weight Decay)

Optimizer Comparison

Optimizer Selection Decision Tree

4. Learning Rate Schedules

4.1 Step Decay

4.2 Cosine Annealing

4.3 Warmup + Cosine

4.4 OneCycle Policy

Learning Rate Schedule Visualization

5. Regularization

5.1 Dropout

5.2 Batch Normalization

5.3 Layer Normalization

5.4 Weight Decay

6. Gradient Clipping

Norm Clipping (recommended)

Value Clipping

7. Mixed Precision Training

Why Mixed Precision?

The Problem: Loss Scaling

BFloat16 Alternative

8. Distributed Training

Data Parallelism

Model Parallelism

Training at Scale

Putting It All Together: A Modern Training Recipe

Hyperparameter Defaults (IIT/MIT Research Standards)

Summary

Premium Content

Need Expert Data Science Help?