DL Foundations
Weight Initialization — The Hidden Key to Training Deep Networks
Poor weight initialization causes vanishing or exploding gradients, making deep networks untrainable before training even begins. Proper initialization preserves activation variance across layers.
- Xavier for Sigmoid/Tanh — Preserves variance for activations that are approximately linear near zero
- He/Kaiming for ReLU — Doubles variance to compensate for ReLU zeroing out half the activations
- ResNet Zero-Init Trick — Initialize last batch norm to zero in residual blocks for stable deep training
Weight Initialization — Xavier, He, LSUV and Variance Preservation
Weight initialization determines how neural network parameters are set before training begins. Poor initialization leads to vanishing or exploding gradients, making deep networks untrainable.
Why Initialization Matters
DfThe Initialization Problem
Consider a network with layers. The output variance after layers depends on the variance of each layer's weights. If the variance shrinks at each layer, activations vanish. If it grows, activations explode. Initialization must preserve the variance of activations and gradients across layers.
ThVariance Preservation Requirement
For a layer , the variance of the output is:
For variance to be preserved across layers:
This is the foundation for both Xavier and He initialization.
Random Initialization
DfNaive Random Initialization
Initialize weights from a standard normal distribution:
Problem: For a layer with 1000 inputs, the output variance is 1000 times the input variance. Activations explode after a few layers, causing numerical overflow and gradient explosion.
Why Not Zero Initialization?
Initializing all weights to zero causes all neurons in a layer to compute the same output and receive the same gradient. They remain identical throughout training — the network cannot learn different features. Weights must be asymmetric at initialization.
Xavier/Glorot Initialization
DfXavier (Glorot) Initialization
Xavier initialization preserves variance for linear and sigmoid/tanh activations. It sets weights from:
or uniform:
where is the number of inputs and is the number of outputs.
He/Kaiming Initialization
DfHe (Kaiming) Initialization
He initialization accounts for ReLU, which zeros out approximately half the activations:
The factor of 2 compensates for ReLU zeroing out half the variance. For Leaky ReLU with slope :
DfInitialization Comparison
| Method | Variance | Best For | Formula |
|---|---|---|---|
| Xavier/Glorot | Sigmoid, Tanh | ||
| He/Kaiming | ReLU, Leaky ReLU | ||
| Orthogonal | 1 | RNNs | Orthogonal matrix |
| LSUV | Data-driven | Deep nets | Layer-wise normalization |
Orthogonal Initialization
DfOrthogonal Initialization
For RNNs, orthogonal initialization sets the recurrent weight matrix to an orthogonal matrix:
This preserves gradient magnitude through time because for all .
LSUV (Layer-Sequential Unit-Variance)
DfLSUV Initialization
LSUV is a data-driven initialization method:
- Initialize all weights to orthogonal matrices
- For each layer (in order):
- Forward pass a batch of data
- Compute output variance
- Scale weights:
- Repeat until variance is 1.0
LSUV works for any architecture and doesn't require knowing activation functions.
Initialization for ResNets
DfResNet Initialization Trick
ResNet uses special initialization for residual blocks:
- Initialize all residual block weights with He initialization
- Set the last batch normalization in each residual block to
This ensures that at initialization, each residual block computes zero, making the network equivalent to a shallow network. As training progresses, learns to be non-zero.
Practical Recommendations
PyTorch Default Initialization
PyTorch uses Kaiming uniform by default for linear and convolutional layers. For most deep learning tasks, this is sufficient. Override only when:
- Using non-ReLU activations (use Xavier for sigmoid/tanh)
- Training very deep networks (need special tricks)
- Working with RNNs (need orthogonal initialization)
Summary
- Xavier/Glorot for sigmoid/tanh: preserves variance
- He/Kaiming for ReLU: doubles variance to compensate for ReLU
- Orthogonal for RNNs: preserves gradient magnitude through time
- Zero-init for ResNet last BN: ensures identity mapping at initialization
- LSUV: data-driven, works for any architecture