Deep Learning
Training Deep Networks — From Gradient Descent to Modern Optimizers
Master the techniques and best practices for training deep neural networks effectively.
- Optimization algorithms — SGD, Adam, and beyond
- Learning rate schedules — adaptive and warm-up strategies
- Regularization — dropout, weight decay, and early stopping
Success is not final, failure is not fatal: it is the courage to continue that counts.
Training Deep Networks — Complete Guide
Training deep networks requires careful choice of optimizer, learning rate, regularization, and debugging techniques. This guide covers the essential practical knowledge for training models that converge well and generalize.
The Training Loop
How the training loop works: This diagram shows the six-step process that repeats for every batch of data during neural network training. Step 1: Forward pass — the input data flows through the network to produce predictions. Step 2: Compute loss — the loss function (e.g., cross-entropy) measures how wrong the predictions are compared to true labels. Step 3: Zero gradients — clear any accumulated gradients from previous iterations (PyTorch default accumulates). Step 4: Backward pass — backpropagation computes gradients of the loss with respect to every parameter by applying the chain rule through the computational graph. Step 5: Clip gradients — prevent exploding gradients by capping their norm (critical for RNNs and Transformers). Step 6: Update weights — the optimizer adjusts parameters using the computed gradients. The loss curve inset shows the goal: loss should decrease over time as the model learns. The dashed arrow back to Step 1 shows this loop repeats for every batch, typically thousands of times per epoch.
Optimizers
Learning Rate Schedules
DfLearning Rate Schedules
Cosine Annealing (default for Transformers):
Linear Warmup + Decay: Start at , linearly increase to over steps, then decay:
Step Decay: Reduce LR by factor every epochs:
Warmup is critical: Prevents early instability when gradients are noisy. Use 1-10% of total steps.
Regularization
Debugging Training
Mixed Precision Training
DfMixed Precision (FP16/BF16)
Modern GPUs support FP16 and BF16 arithmetic that is 2-8x faster than FP32:
- FP16: 16-bit floating point, faster but prone to overflow
- BF16: Brain Float 16, same range as FP32, less precision — safer
- Loss scaling: Multiply loss by large factor to prevent underflow in FP16 gradients
Benefits: 2x memory savings, 2-3x speedup, allows larger batch sizes.
Automatic Mixed Precision (AMP) in PyTorch: torch.cuda.amp.autocast() handles the conversion automatically.
Key Takeaways
Summary: Training Deep Networks
- AdamW is the default optimizer for Transformers; SGD for CNNs
- Learning rate is the most important hyperparameter — tune first
- Cosine annealing with warmup is the standard schedule
- Dropout (0.1-0.5) and weight decay (1e-4 to 0.01) prevent overfitting
- Layer Norm for Transformers, Batch Norm for CNNs
- Gradient clipping (max_norm=1.0) prevents exploding gradients
- Mixed precision (BF16/FP16) saves memory and speeds up training
- Gradient accumulation simulates larger batch sizes
- Early stopping is the simplest and most effective regularization
- Debug by monitoring: loss curves, gradient norms, activation statistics
What to Learn Next
-> Optimizers for Deep Learning Master different optimization algorithms.
-> Loss Functions Choose the right loss for your task.
-> Regularization Prevent overfitting in deep models.
-> Weight Initialization Start training with proper initialization.
-> Neural Networks Understand the models you're training.
-> Transformers Learn the architecture most affected by training choices.