🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Training Deep Networks — Optimization, Regularization and Best Practices

Deep LearningTraining🟢 Free Lesson

Advertisement

Deep Learning

Training Deep Networks — From Gradient Descent to Modern Optimizers

Master the techniques and best practices for training deep neural networks effectively.

  • Optimization algorithms — SGD, Adam, and beyond
  • Learning rate schedules — adaptive and warm-up strategies
  • Regularization — dropout, weight decay, and early stopping

Success is not final, failure is not fatal: it is the courage to continue that counts.

Training Deep Networks — Complete Guide

Training deep networks requires careful choice of optimizer, learning rate, regularization, and debugging techniques. This guide covers the essential practical knowledge for training models that converge well and generalize.


The Training Loop

Training Loopfor epoch in range(num_epochs):for batch in dataloader:1. Forward Passy_pred = model(x)Compute predictions2. Compute LossL = loss(y_pred, y)Compare to ground truth3. Zero Gradsopt.zero_grad()Clear old gradients4. Backward PassL.backward()Compute gradients via autograd5. Clip Gradientsclip_grad_norm_()6. Update Weightsopt.step()Loss CurveShould decrease over time

How the training loop works: This diagram shows the six-step process that repeats for every batch of data during neural network training. Step 1: Forward pass — the input data flows through the network to produce predictions. Step 2: Compute loss — the loss function (e.g., cross-entropy) measures how wrong the predictions are compared to true labels. Step 3: Zero gradients — clear any accumulated gradients from previous iterations (PyTorch default accumulates). Step 4: Backward pass — backpropagation computes gradients of the loss with respect to every parameter by applying the chain rule through the computational graph. Step 5: Clip gradients — prevent exploding gradients by capping their norm (critical for RNNs and Transformers). Step 6: Update weights — the optimizer adjusts parameters using the computed gradients. The loss curve inset shows the goal: loss should decrease over time as the model learns. The dashed arrow back to Step 1 shows this loop repeats for every batch, typically thousands of times per epoch.


Optimizers

Optimizers ComparisonSGD + Momentumv = βv + ∇Lw = w - αv• Best generalization• Slow convergence• Requires tuning LRAdamm = β₁m + (1-β₁)∇Lv = β₂v² + (1-β₂)(∇L)²w = w - α·m̂/√(v̂+ε)• Fast convergence• Adaptive per-param LR• May generalize worseAdamW ★m = β₁m + (1-β₁)∇Lv = β₂v² + (1-β₂)(∇L)²w = w - α(m̂/√(v̂+ε)+λw)• Decoupled weight decay• Better regularization• Current default choiceLionu = sign(m·β₁ + ...)w = w - α(u + λw)• Only sign updates• Memory efficient• Google Brain 2023Optimizer Convergence BehaviorSGD + MomentumSmooth, may get stuckAdamFast, may oscillateAdamWFast + better generalizationSpeed vs QualitySGD: Fastest converge ★★☆Adam: Faster converge ★★★AdamW: Fast + best reg ★★★Generalization:SGD > AdamW > Adam

Learning Rate Schedules

Learning Rate SchedulesLearning RateEpochsCosineWarmup+DecayStepConstantWarmupCosine annealing

DfLearning Rate Schedules

Cosine Annealing (default for Transformers):

ηt=ηmin+12(ηmaxηmin)(1+cos(tTπ))\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{t}{T}\pi\right)\right)

Linear Warmup + Decay: Start at ηmin\eta_{\min}, linearly increase to ηmax\eta_{\max} over TwarmupT_{\text{warmup}} steps, then decay:

ηt={ηmaxt/TwarmuptTwarmupηmax(1t/T)t>Twarmup\eta_t = \begin{cases} \eta_{\max} \cdot t / T_{\text{warmup}} & t \leq T_{\text{warmup}} \\ \eta_{\max} \cdot (1 - t/T) & t > T_{\text{warmup}} \end{cases}

Step Decay: Reduce LR by factor γ\gamma every kk epochs: ηt=η0γt/k\eta_t = \eta_0 \cdot \gamma^{\lfloor t/k \rfloor}

Warmup is critical: Prevents early instability when gradients are noisy. Use 1-10% of total steps.


Regularization

Regularization TechniquesDropoutBeforeAfter (p=0.3)X droppedp=0.1-0.5, scale by 1/(1-p) at inferenceBatch Normalizationŷ = (x - μ_B) / √(σ²_B + ε)out = γ·ŷ + β• Normalize per batch per channel• Learnable γ, β parameters• Faster training, higher LR• Not for RNNs/TransformersLayer Normalizationŷ = (x - μ) / √(σ² + ε)out = γ·ŷ + β• Normalize per sample (not batch)• Batch-size independent• Standard for Transformers• Pre-LN > Post-LNWeight Decay (L2)L_total = L_task + λ/2 · ||W||²• Penalizes large weights• λ = 0.01-0.0001 typical• AdamW decouples from LR• Essential for Transformer trainingGradient Clippingg = min(g, threshold / ||g||)• Prevents exploding gradients• threshold = 1.0 typical• clip_grad_norm_() in PyTorch• Essential for RNNs + TransformersEarly StoppingStop when val_loss stops improving• Monitor validation metric• Patience: 3-10 epochs• Simplest regularization• Always use this

Debugging Training

Debugging Training IssuesLoss Not DecreasingChecklist:□ LR too high or too low□ Bug in data pipeline□ Wrong loss function□ Check labels aren't shuffledOverfittingSolutions:□ More data / augmentation□ Add dropout / weight decay□ Early stopping□ Simplify modelUnderfittingSolutions:□ Larger model□ More epochs□ Lower regularization□ Better features / architectureHow to Read Training CurvesGoodTrain ↓, Val ↓, gap smallOverfittingTrain ↓, Val ↑ (gap grows)UnstableLoss oscillates → lower LR

Mixed Precision Training

DfMixed Precision (FP16/BF16)

Modern GPUs support FP16 and BF16 arithmetic that is 2-8x faster than FP32:

  • FP16: 16-bit floating point, faster but prone to overflow
  • BF16: Brain Float 16, same range as FP32, less precision — safer
  • Loss scaling: Multiply loss by large factor to prevent underflow in FP16 gradients

Benefits: 2x memory savings, 2-3x speedup, allows larger batch sizes.

Automatic Mixed Precision (AMP) in PyTorch: torch.cuda.amp.autocast() handles the conversion automatically.


Key Takeaways

Summary: Training Deep Networks

  • AdamW is the default optimizer for Transformers; SGD for CNNs
  • Learning rate is the most important hyperparameter — tune first
  • Cosine annealing with warmup is the standard schedule
  • Dropout (0.1-0.5) and weight decay (1e-4 to 0.01) prevent overfitting
  • Layer Norm for Transformers, Batch Norm for CNNs
  • Gradient clipping (max_norm=1.0) prevents exploding gradients
  • Mixed precision (BF16/FP16) saves memory and speeds up training
  • Gradient accumulation simulates larger batch sizes
  • Early stopping is the simplest and most effective regularization
  • Debug by monitoring: loss curves, gradient norms, activation statistics

What to Learn Next

-> Optimizers for Deep Learning Master different optimization algorithms.

-> Loss Functions Choose the right loss for your task.

-> Regularization Prevent overfitting in deep models.

-> Weight Initialization Start training with proper initialization.

-> Neural Networks Understand the models you're training.

-> Transformers Learn the architecture most affected by training choices.

Premium Content

Training Deep Networks — Optimization, Regularization and Best Practices

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Machine Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement