Deep Learning

Training Deep Networks — From Gradient Descent to Modern Optimizers

Master the techniques and best practices for training deep neural networks effectively.

Optimization algorithms — SGD, Adam, and beyond
Learning rate schedules — adaptive and warm-up strategies
Regularization — dropout, weight decay, and early stopping

Success is not final, failure is not fatal: it is the courage to continue that counts.

Training Deep Networks — Complete Guide

Training deep networks requires careful choice of optimizer, learning rate, regularization, and debugging techniques. This guide covers the essential practical knowledge for training models that converge well and generalize.

The Training Loop

How the training loop works: This diagram shows the six-step process that repeats for every batch of data during neural network training. Step 1: Forward pass — the input data flows through the network to produce predictions. Step 2: Compute loss — the loss function (e.g., cross-entropy) measures how wrong the predictions are compared to true labels. Step 3: Zero gradients — clear any accumulated gradients from previous iterations (PyTorch default accumulates). Step 4: Backward pass — backpropagation computes gradients of the loss with respect to every parameter by applying the chain rule through the computational graph. Step 5: Clip gradients — prevent exploding gradients by capping their norm (critical for RNNs and Transformers). Step 6: Update weights — the optimizer adjusts parameters using the computed gradients. The loss curve inset shows the goal: loss should decrease over time as the model learns. The dashed arrow back to Step 1 shows this loop repeats for every batch, typically thousands of times per epoch.

Optimizers

Learning Rate Schedules

DfLearning Rate Schedules

Cosine Annealing (default for Transformers):

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{t}{T}\pi\right)\right)

Linear Warmup + Decay: Start at $\eta_{\min}$ , linearly increase to $\eta_{\max}$ over $T_{\text{warmup}}$ steps, then decay:

\eta_t = \begin{cases} \eta_{\max} \cdot t / T_{\text{warmup}} & t \leq T_{\text{warmup}} \\ \eta_{\max} \cdot (1 - t/T) & t > T_{\text{warmup}} \end{cases}

Step Decay: Reduce LR by factor $\gamma$ every $k$ epochs: $\eta_t = \eta_0 \cdot \gamma^{\lfloor t/k \rfloor}$

Warmup is critical: Prevents early instability when gradients are noisy. Use 1-10% of total steps.

Regularization

Debugging Training

Mixed Precision Training

DfMixed Precision (FP16/BF16)

Modern GPUs support FP16 and BF16 arithmetic that is 2-8x faster than FP32:

FP16: 16-bit floating point, faster but prone to overflow
BF16: Brain Float 16, same range as FP32, less precision — safer
Loss scaling: Multiply loss by large factor to prevent underflow in FP16 gradients

Benefits: 2x memory savings, 2-3x speedup, allows larger batch sizes.

Automatic Mixed Precision (AMP) in PyTorch: torch.cuda.amp.autocast() handles the conversion automatically.

Key Takeaways

Summary: Training Deep Networks

AdamW is the default optimizer for Transformers; SGD for CNNs
Learning rate is the most important hyperparameter — tune first
Cosine annealing with warmup is the standard schedule
Dropout (0.1-0.5) and weight decay (1e-4 to 0.01) prevent overfitting
Layer Norm for Transformers, Batch Norm for CNNs
Gradient clipping (max_norm=1.0) prevents exploding gradients
Mixed precision (BF16/FP16) saves memory and speeds up training
Gradient accumulation simulates larger batch sizes
Early stopping is the simplest and most effective regularization
Debug by monitoring: loss curves, gradient norms, activation statistics

What to Learn Next

-> Optimizers for Deep Learning Master different optimization algorithms.

-> Loss Functions Choose the right loss for your task.

-> Regularization Prevent overfitting in deep models.

-> Weight Initialization Start training with proper initialization.

-> Neural Networks Understand the models you're training.

-> Transformers Learn the architecture most affected by training choices.

Training Deep Networks — Optimization, Regularization and Best Practices