🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Weight Initialization — Xavier, He, LSUV and Variance Preservation

FoundationsTraining🟢 Free Lesson

Advertisement

DL Foundations

Weight Initialization — The Hidden Key to Training Deep Networks

Poor weight initialization causes vanishing or exploding gradients, making deep networks untrainable before training even begins. Proper initialization preserves activation variance across layers.

  • Xavier for Sigmoid/Tanh — Preserves variance for activations that are approximately linear near zero
  • He/Kaiming for ReLU — Doubles variance to compensate for ReLU zeroing out half the activations
  • ResNet Zero-Init Trick — Initialize last batch norm to zero in residual blocks for stable deep training

Weight Initialization — Xavier, He, LSUV and Variance Preservation

Weight initialization determines how neural network parameters are set before training begins. Poor initialization leads to vanishing or exploding gradients, making deep networks untrainable.


Why Initialization Matters

DfThe Initialization Problem

Consider a network with LL layers. The output variance after LL layers depends on the variance of each layer's weights. If the variance shrinks at each layer, activations vanish. If it grows, activations explode. Initialization must preserve the variance of activations and gradients across layers.

ThVariance Preservation Requirement

For a layer h(l)=W(l)h(l1)\mathbf{h}^{(l)} = \mathbf{W}^{(l)} \mathbf{h}^{(l-1)}, the variance of the output is:

Var(h(l))=nl1Var(W)Var(h(l1))\text{Var}(\mathbf{h}^{(l)}) = n_{l-1} \cdot \text{Var}(W) \cdot \text{Var}(\mathbf{h}^{(l-1)})

For variance to be preserved across layers:

Var(W)=1nl1\text{Var}(W) = \frac{1}{n_{l-1}}

This is the foundation for both Xavier and He initialization.

Initialization Impact on Activation VarianceLayer 1Layer 2Layer 3Layer 4Layer 5ExplodingPreservedVanishing← Good initialization preserves variance across layers →

Random Initialization

DfNaive Random Initialization

Initialize weights from a standard normal distribution:

WN(0,1)W \sim \mathcal{N}(0, 1)

Problem: For a layer with 1000 inputs, the output variance is 1000 times the input variance. Activations explode after a few layers, causing numerical overflow and gradient explosion.

Why Not Zero Initialization?

Initializing all weights to zero causes all neurons in a layer to compute the same output and receive the same gradient. They remain identical throughout training — the network cannot learn different features. Weights must be asymmetric at initialization.


Xavier/Glorot Initialization

DfXavier (Glorot) Initialization

Xavier initialization preserves variance for linear and sigmoid/tanh activations. It sets weights from:

WN(0,2nin+nout)W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}} + n_{\text{out}}}\right)

or uniform:

WU(6nin+nout,6nin+nout)W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}, \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}\right)

where ninn_{\text{in}} is the number of inputs and noutn_{\text{out}} is the number of outputs.

Xavier Initialization
WN(0,2nin+nout)W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}} + n_{\text{out}}}\right)
Xavier Initialization: Variance AnalysisInputVar(h⁽ˡ⁻¹⁾) = 1n_in inputsWeightsVar(W) = 2/(n_in+n_out)XavierOutputVar(h⁽ˡ⁾) ≈ 1n_out outputsVar(h⁽ˡ⁾) = n_in · Var(W) · Var(h⁽ˡ⁻¹⁾) ≈ 1

He/Kaiming Initialization

DfHe (Kaiming) Initialization

He initialization accounts for ReLU, which zeros out approximately half the activations:

WN(0,2nin)W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right)

The factor of 2 compensates for ReLU zeroing out half the variance. For Leaky ReLU with slope α\alpha:

WN(0,2(1+α2)nin)W \sim \mathcal{N}\left(0, \frac{2}{(1 + \alpha^2) n_{\text{in}}}\right)
He Initialization
WN(0,2nin)W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right)
He vs Xavier: ReLU CompensationXavier (no ReLU)Var = 1.0ReLUzeros halfVar = 0.5Vanishes!He (with ReLU)Var = 1.0ReLUzeros halfVar = 1.0Preserved!

DfInitialization Comparison

MethodVarianceBest ForFormula
Xavier/Glorot2nin+nout\frac{2}{n_{\text{in}} + n_{\text{out}}}Sigmoid, TanhN(0,2nin+nout)\mathcal{N}(0, \frac{2}{n_{\text{in}}+n_{\text{out}}})
He/Kaiming2nin\frac{2}{n_{\text{in}}}ReLU, Leaky ReLUN(0,2nin)\mathcal{N}(0, \frac{2}{n_{\text{in}}})
Orthogonal1RNNsOrthogonal matrix
LSUVData-drivenDeep netsLayer-wise normalization

Orthogonal Initialization

DfOrthogonal Initialization

For RNNs, orthogonal initialization sets the recurrent weight matrix to an orthogonal matrix:

Whh=QwhereQTQ=I\mathbf{W}_{hh} = \mathbf{Q} \quad \text{where} \quad \mathbf{Q}^T \mathbf{Q} = \mathbf{I}

This preserves gradient magnitude through time because Whht=1\|\mathbf{W}_{hh}^t\| = 1 for all tt.


LSUV (Layer-Sequential Unit-Variance)

DfLSUV Initialization

LSUV is a data-driven initialization method:

  1. Initialize all weights to orthogonal matrices
  2. For each layer (in order):
    • Forward pass a batch of data
    • Compute output variance
    • Scale weights: WW/Var(output)\mathbf{W} \leftarrow \mathbf{W} / \sqrt{\text{Var}(\text{output})}
  3. Repeat until variance is 1.0

LSUV works for any architecture and doesn't require knowing activation functions.


Initialization for ResNets

DfResNet Initialization Trick

ResNet uses special initialization for residual blocks:

  1. Initialize all residual block weights with He initialization
  2. Set the last batch normalization in each residual block to γ=0\gamma = 0

This ensures that at initialization, each residual block computes zero, making the network equivalent to a shallow network. As training progresses, γ\gamma learns to be non-zero.

h(l+1)=h(l)+BN(F(h(l)))with γlast=0\mathbf{h}^{(l+1)} = \mathbf{h}^{(l)} + \text{BN}(\mathcal{F}(\mathbf{h}^{(l)})) \quad \text{with } \gamma_{\text{last}} = 0
ResNet Zero-Init: Residual Block at Initializationh⁽ˡ⁾Identity (skip)F(h⁽ˡ⁾): Conv → BN → ReLU → Conv → BNγ=0+h⁽ˡ⁺¹⁾At initialization: F(h) = 0, so h⁽ˡ⁺¹⁾ = h⁽ˡ⁾ (identity mapping)

Practical Recommendations

Initialization Selection GuideCNNsHe (Kaiming) + ReLUResNet: zero-init BNEfficientNet: NAS-tunedTransformersXavier for embeddings1/√d for projectionsOutput: small init (0.02)RNNs/LSTMsOrthogonal recurrentXavier input→hiddenIdentity forget gate

PyTorch Default Initialization

PyTorch uses Kaiming uniform by default for linear and convolutional layers. For most deep learning tasks, this is sufficient. Override only when:

  • Using non-ReLU activations (use Xavier for sigmoid/tanh)
  • Training very deep networks (need special tricks)
  • Working with RNNs (need orthogonal initialization)

Summary

  • Xavier/Glorot for sigmoid/tanh: preserves variance Var(W)=2nin+nout\text{Var}(W) = \frac{2}{n_{\text{in}} + n_{\text{out}}}
  • He/Kaiming for ReLU: doubles variance to compensate for ReLU Var(W)=2nin\text{Var}(W) = \frac{2}{n_{\text{in}}}
  • Orthogonal for RNNs: preserves gradient magnitude through time
  • Zero-init for ResNet last BN: ensures identity mapping at initialization
  • LSUV: data-driven, works for any architecture

Next: Regularization for Deep Learning

Premium Content

Weight Initialization — Xavier, He, LSUV and Variance Preservation

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Deep Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement