DL Foundations

Weight Initialization — The Hidden Key to Training Deep Networks

Poor weight initialization causes vanishing or exploding gradients, making deep networks untrainable before training even begins. Proper initialization preserves activation variance across layers.

Xavier for Sigmoid/Tanh — Preserves variance for activations that are approximately linear near zero
He/Kaiming for ReLU — Doubles variance to compensate for ReLU zeroing out half the activations
ResNet Zero-Init Trick — Initialize last batch norm to zero in residual blocks for stable deep training

Weight Initialization — Xavier, He, LSUV and Variance Preservation

Weight initialization determines how neural network parameters are set before training begins. Poor initialization leads to vanishing or exploding gradients, making deep networks untrainable.

Why Initialization Matters

DfThe Initialization Problem

Consider a network with $L$ layers. The output variance after $L$ layers depends on the variance of each layer's weights. If the variance shrinks at each layer, activations vanish. If it grows, activations explode. Initialization must preserve the variance of activations and gradients across layers.

ThVariance Preservation Requirement

For a layer $\mathbf{h}^{(l)} = \mathbf{W}^{(l)} \mathbf{h}^{(l-1)}$ , the variance of the output is:

\text{Var}(\mathbf{h}^{(l)}) = n_{l-1} \cdot \text{Var}(W) \cdot \text{Var}(\mathbf{h}^{(l-1)})

For variance to be preserved across layers:

\text{Var}(W) = \frac{1}{n_{l-1}}

This is the foundation for both Xavier and He initialization.

Random Initialization

DfNaive Random Initialization

Initialize weights from a standard normal distribution:

W \sim \mathcal{N}(0, 1)

Problem: For a layer with 1000 inputs, the output variance is 1000 times the input variance. Activations explode after a few layers, causing numerical overflow and gradient explosion.

Why Not Zero Initialization?

Initializing all weights to zero causes all neurons in a layer to compute the same output and receive the same gradient. They remain identical throughout training — the network cannot learn different features. Weights must be asymmetric at initialization.

Xavier/Glorot Initialization

DfXavier (Glorot) Initialization

Xavier initialization preserves variance for linear and sigmoid/tanh activations. It sets weights from:

W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}} + n_{\text{out}}}\right)

or uniform:

W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}, \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}\right)

where $n_{\text{in}}$ is the number of inputs and $n_{\text{out}}$ is the number of outputs.

Xavier Initialization

W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}} + n_{\text{out}}}\right)

He/Kaiming Initialization

DfHe (Kaiming) Initialization

He initialization accounts for ReLU, which zeros out approximately half the activations:

W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right)

The factor of 2 compensates for ReLU zeroing out half the variance. For Leaky ReLU with slope $\alpha$ :

W \sim \mathcal{N}\left(0, \frac{2}{(1 + \alpha^2) n_{\text{in}}}\right)

He Initialization

W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right)

DfInitialization Comparison

Method	Variance	Best For	Formula
Xavier/Glorot	$\frac{2}{n_{\text{in}} + n_{\text{out}}}$	Sigmoid, Tanh	$\mathcal{N}(0, \frac{2}{n_{\text{in}}+n_{\text{out}}})$
He/Kaiming	$\frac{2}{n_{\text{in}}}$	ReLU, Leaky ReLU	$\mathcal{N}(0, \frac{2}{n_{\text{in}}})$
Orthogonal	1	RNNs	Orthogonal matrix
LSUV	Data-driven	Deep nets	Layer-wise normalization

Orthogonal Initialization

DfOrthogonal Initialization

For RNNs, orthogonal initialization sets the recurrent weight matrix to an orthogonal matrix:

\mathbf{W}_{hh} = \mathbf{Q} \quad \text{where} \quad \mathbf{Q}^T \mathbf{Q} = \mathbf{I}

This preserves gradient magnitude through time because $\|\mathbf{W}_{hh}^t\| = 1$ for all $t$ .

LSUV (Layer-Sequential Unit-Variance)

DfLSUV Initialization

LSUV is a data-driven initialization method:

Initialize all weights to orthogonal matrices
For each layer (in order):
- Forward pass a batch of data
- Compute output variance
- Scale weights: $\mathbf{W} \leftarrow \mathbf{W} / \sqrt{\text{Var}(\text{output})}$
Repeat until variance is 1.0

LSUV works for any architecture and doesn't require knowing activation functions.

Initialization for ResNets

DfResNet Initialization Trick

ResNet uses special initialization for residual blocks:

Initialize all residual block weights with He initialization
Set the last batch normalization in each residual block to $\gamma = 0$

This ensures that at initialization, each residual block computes zero, making the network equivalent to a shallow network. As training progresses, $\gamma$ learns to be non-zero.

\mathbf{h}^{(l+1)} = \mathbf{h}^{(l)} + \text{BN}(\mathcal{F}(\mathbf{h}^{(l)})) \quad \text{with } \gamma_{\text{last}} = 0

Practical Recommendations

PyTorch Default Initialization

PyTorch uses Kaiming uniform by default for linear and convolutional layers. For most deep learning tasks, this is sufficient. Override only when:

Using non-ReLU activations (use Xavier for sigmoid/tanh)
Training very deep networks (need special tricks)
Working with RNNs (need orthogonal initialization)

Summary

Xavier/Glorot for sigmoid/tanh: preserves variance $\text{Var}(W) = \frac{2}{n_{\text{in}} + n_{\text{out}}}$
He/Kaiming for ReLU: doubles variance to compensate for ReLU $\text{Var}(W) = \frac{2}{n_{\text{in}}}$
Orthogonal for RNNs: preserves gradient magnitude through time
Zero-init for ResNet last BN: ensures identity mapping at initialization
LSUV: data-driven, works for any architecture

Next: Regularization for Deep Learning

Weight Initialization — Xavier, He, LSUV and Variance Preservation

Weight Initialization — The Hidden Key to Training Deep Networks

Weight Initialization — Xavier, He, LSUV and Variance Preservation

Why Initialization Matters

DfThe Initialization Problem

ThVariance Preservation Requirement

Random Initialization

DfNaive Random Initialization

Xavier/Glorot Initialization

DfXavier (Glorot) Initialization

He/Kaiming Initialization

DfHe (Kaiming) Initialization

DfInitialization Comparison

Orthogonal Initialization

DfOrthogonal Initialization

LSUV (Layer-Sequential Unit-Variance)

DfLSUV Initialization

Initialization for ResNets

DfResNet Initialization Trick

Practical Recommendations

Summary

Premium Content

Need Expert Deep Learning Help?