DL Foundations

Regularization in Deep Learning — Dropout, Batch Norm, and More

Deep networks have far more parameters than training samples, making them prone to overfitting. Regularization techniques constrain the model to improve generalization to unseen data.

Dropout as Ensemble — Randomly zeroing neurons is equivalent to training an ensemble of $2^n$ subnetworks
BatchNorm Stabilizes — Normalizes per-batch activations, allowing 10-100x higher learning rates
Combine Techniques — Different regularization methods are complementary; the key is avoiding over-regularization

Regularization for Deep Learning — Dropout, BatchNorm, Data Augmentation and Weight Decay

Deep networks have far more parameters than training samples, making them prone to overfitting. Regularization techniques constrain the model to improve generalization.

The Overfitting Problem

DfOverfitting

Overfitting occurs when a model learns the training data too well, including noise, and fails to generalize to unseen data. Deep networks are particularly susceptible because:

Large capacity: Millions of parameters can memorize training data
Expressivity: Deep networks can fit random labels (Zhang et al., 2017)
Limited data: Real-world datasets are often small relative to model size

Regularization techniques explicitly or implicitly constrain the model to reduce overfitting.

Dropout

DfDropout

During training, dropout randomly zeros each neuron with probability $p$ :

\tilde{h}_j = r_j \cdot h_j, \quad r_j \sim \text{Bernoulli}(1 - p)

During inference, no dropout is applied, but activations are scaled by $(1 - p)$ to compensate. PyTorch's nn.Dropout implements inverted dropout, which scales during training instead.

Inverted Dropout

\tilde{h}_j = \frac{r_j \cdot h_j}{1 - p}, \quad r_j \sim \text{Bernoulli}(1 - p)

Here,

$h_j$ =Neuron j activation
$r_j$ =Dropout mask (0 with prob p, 1 with prob 1-p)
$p$ =Dropout probability (typically 0.1-0.5)
$1-p$ =Scaling factor for inverted dropout

ThDropout as Ensemble

Gal and Ghahramani (2016) proved that a neural network with dropout is equivalent to an ensemble of $2^n$ subnetworks (where $n$ is the number of neurons), sharing weights. At test time, using the full network approximates the geometric mean of all subnetwork predictions. This provides an implicit Bayesian interpretation.

Dropout Best Practices

Hidden layers: $p = 0.1$ to $0.5$ (0.5 is common for fully connected layers)
Convolutional layers: $p = 0.1$ to $0.2$ (lower to preserve spatial structure)
Attention layers: $p = 0.1$ (common in transformers)
No dropout before softmax (destroys probability calibration)

Batch Normalization

DfBatch Normalization

BatchNorm normalizes activations across the batch for each feature:

\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

y_i = \gamma \hat{x}_i + \beta

where $\mu_B$ and $\sigma_B^2$ are batch statistics, and $\gamma, \beta$ are learnable parameters.

During inference, running averages are used instead of batch statistics.

BatchNorm

\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}, \quad y_i = \gamma \hat{x}_i + \beta

DfBatchNorm Benefits

Faster training: Allows 10-100x higher learning rates
Reduced initialization sensitivity: Networks train well with wider range of initializations
Smoothing optimization landscape: Reduces internal covariate shift
Mild regularization: Batch statistics add noise during training

Layer Normalization

DfLayer Normalization

LayerNorm normalizes across features for each sample (instead of across batch for each feature):

\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}, \quad \mu = \frac{1}{D}\sum_{i=1}^{D} x_i, \quad \sigma^2 = \frac{1}{D}\sum_{i=1}^{D} (x_i - \mu)^2

LayerNorm is preferred for transformers and NLP because it doesn't depend on batch size.

Weight Decay

DfWeight Decay (L2 Regularization)

Weight decay adds a penalty for large weights:

\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{task}} + \frac{\lambda}{2} \|\theta\|_2^2

The gradient update becomes:

\theta_{t+1} = (1 - \eta\lambda)\theta_t - \eta \nabla_\theta \mathcal{L}_{\text{task}}

The $(1 - \eta\lambda)$ term decays weights toward zero each step.

Weight Decay vs L2 Regularization

In Adam optimizer, weight decay and L2 regularization are different! L2 regularization is applied through the gradient (scaled by adaptive learning rates), while weight decay is applied directly. AdamW uses proper weight decay.

Data Augmentation

DfData Augmentation

Data augmentation creates transformed versions of training data:

Augmentation	Description	Use Case
Random crop	Crop random patch	Classification
Horizontal flip	Mirror image	Classification
Color jitter	Vary brightness/contrast	Classification
Random rotation	Rotate by random angle	When rotation invariance needed
Mixup	Blend two images	Regularization
CutMix	Cut and paste patches	Regularization
RandAugment	Random policy	General purpose

Early Stopping

DfEarly Stopping

Monitor validation loss during training and stop when it starts increasing:

\text{Stop at } t^* = \arg\min_t \mathcal{L}_{\text{val}}(t)

Save the model checkpoint with the lowest validation loss. Simple and effective.

Regularization Summary

Summary

Dropout: Randomly zeros neurons during training, acts as implicit ensemble
BatchNorm: Normalizes across batch, allows higher learning rates
LayerNorm: Normalizes across features, preferred for transformers
Weight Decay: Penalizes large weights, always use with proper value
Data Augmentation: Increases effective training set size
Early Stopping: Simple and effective, always monitor validation loss
Combine multiple regularization techniques for best results

Next: CNN Architecture Deep Dive

Regularization for Deep Learning — Dropout, BatchNorm, Data Augmentation and Weight Decay

Regularization in Deep Learning — Dropout, Batch Norm, and More

Regularization for Deep Learning — Dropout, BatchNorm, Data Augmentation and Weight Decay

The Overfitting Problem

DfOverfitting

Dropout

DfDropout

Inverted Dropout

ThDropout as Ensemble

Batch Normalization

DfBatch Normalization

DfBatchNorm Benefits

Layer Normalization

DfLayer Normalization

Weight Decay

DfWeight Decay (L2 Regularization)

Data Augmentation

DfData Augmentation

Early Stopping

DfEarly Stopping

Regularization Summary

Summary

Premium Content

Need Expert Deep Learning Help?