DL Foundations
Regularization in Deep Learning — Dropout, Batch Norm, and More
Deep networks have far more parameters than training samples, making them prone to overfitting. Regularization techniques constrain the model to improve generalization to unseen data.
- Dropout as Ensemble — Randomly zeroing neurons is equivalent to training an ensemble of subnetworks
- BatchNorm Stabilizes — Normalizes per-batch activations, allowing 10-100x higher learning rates
- Combine Techniques — Different regularization methods are complementary; the key is avoiding over-regularization
Regularization for Deep Learning — Dropout, BatchNorm, Data Augmentation and Weight Decay
Deep networks have far more parameters than training samples, making them prone to overfitting. Regularization techniques constrain the model to improve generalization.
The Overfitting Problem
DfOverfitting
Overfitting occurs when a model learns the training data too well, including noise, and fails to generalize to unseen data. Deep networks are particularly susceptible because:
- Large capacity: Millions of parameters can memorize training data
- Expressivity: Deep networks can fit random labels (Zhang et al., 2017)
- Limited data: Real-world datasets are often small relative to model size
Regularization techniques explicitly or implicitly constrain the model to reduce overfitting.
Dropout
DfDropout
During training, dropout randomly zeros each neuron with probability :
During inference, no dropout is applied, but activations are scaled by to compensate. PyTorch's nn.Dropout implements inverted dropout, which scales during training instead.
Inverted Dropout
Here,
- =Neuron j activation
- =Dropout mask (0 with prob p, 1 with prob 1-p)
- =Dropout probability (typically 0.1-0.5)
- =Scaling factor for inverted dropout
ThDropout as Ensemble
Gal and Ghahramani (2016) proved that a neural network with dropout is equivalent to an ensemble of subnetworks (where is the number of neurons), sharing weights. At test time, using the full network approximates the geometric mean of all subnetwork predictions. This provides an implicit Bayesian interpretation.
Dropout Best Practices
- Hidden layers: to (0.5 is common for fully connected layers)
- Convolutional layers: to (lower to preserve spatial structure)
- Attention layers: (common in transformers)
- No dropout before softmax (destroys probability calibration)
Batch Normalization
DfBatch Normalization
BatchNorm normalizes activations across the batch for each feature:
where and are batch statistics, and are learnable parameters.
During inference, running averages are used instead of batch statistics.
DfBatchNorm Benefits
- Faster training: Allows 10-100x higher learning rates
- Reduced initialization sensitivity: Networks train well with wider range of initializations
- Smoothing optimization landscape: Reduces internal covariate shift
- Mild regularization: Batch statistics add noise during training
Layer Normalization
DfLayer Normalization
LayerNorm normalizes across features for each sample (instead of across batch for each feature):
LayerNorm is preferred for transformers and NLP because it doesn't depend on batch size.
Weight Decay
DfWeight Decay (L2 Regularization)
Weight decay adds a penalty for large weights:
The gradient update becomes:
The term decays weights toward zero each step.
Weight Decay vs L2 Regularization
In Adam optimizer, weight decay and L2 regularization are different! L2 regularization is applied through the gradient (scaled by adaptive learning rates), while weight decay is applied directly. AdamW uses proper weight decay.
Data Augmentation
DfData Augmentation
Data augmentation creates transformed versions of training data:
| Augmentation | Description | Use Case |
|---|---|---|
| Random crop | Crop random patch | Classification |
| Horizontal flip | Mirror image | Classification |
| Color jitter | Vary brightness/contrast | Classification |
| Random rotation | Rotate by random angle | When rotation invariance needed |
| Mixup | Blend two images | Regularization |
| CutMix | Cut and paste patches | Regularization |
| RandAugment | Random policy | General purpose |
Early Stopping
DfEarly Stopping
Monitor validation loss during training and stop when it starts increasing:
Save the model checkpoint with the lowest validation loss. Simple and effective.
Regularization Summary
Summary
- Dropout: Randomly zeros neurons during training, acts as implicit ensemble
- BatchNorm: Normalizes across batch, allows higher learning rates
- LayerNorm: Normalizes across features, preferred for transformers
- Weight Decay: Penalizes large weights, always use with proper value
- Data Augmentation: Increases effective training set size
- Early Stopping: Simple and effective, always monitor validation loss
- Combine multiple regularization techniques for best results