🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

RNN Deep Dive — Vanilla RNN, BPTT and Vanishing/Exploding Gradients

Sequence ModelsRNNs🟢 Free Lesson

Advertisement

Sequence Models

RNNs Deep Dive — Understanding Recurrent Neural Networks

Recurrent Neural Networks process sequential data by maintaining a hidden state that carries information across time steps. This tutorial covers vanilla RNNs and their fundamental limitations.

  • BPTT Unrolls Through Time — Backpropagation through time treats the RNN as a deep feedforward network
  • Vanishing Gradients Limit Memory — The product of Jacobians shrinks exponentially, limiting to ~10-20 steps
  • Orthogonal Initialization Helps — Setting recurrent weights to orthogonal matrices preserves gradient magnitude

RNN Deep Dive — Vanilla RNN, BPTT and Vanishing/Exploding Gradients

Recurrent Neural Networks process sequential data by maintaining a hidden state that carries information across time steps. This tutorial covers vanilla RNNs and their fundamental limitations.


Why Recurrence?

DfSequential Data

Many data types are inherently sequential:

  • Text: Words follow a specific order
  • Speech: Audio signals are time-series
  • Video: Frames have temporal relationships
  • Time series: Stock prices, sensor readings

Feedforward networks ignore temporal structure. RNNs explicitly model it by maintaining a hidden state that evolves over time.


Vanilla RNN

DfVanilla RNN

The RNN recurrence relation:

ht=tanh(Whhht1+Wxhxt+b)\mathbf{h}_t = \tanh(\mathbf{W}_{hh} \mathbf{h}_{t-1} + \mathbf{W}_{xh} \mathbf{x}_t + \mathbf{b})
yt=Whyht\mathbf{y}_t = \mathbf{W}_{hy} \mathbf{h}_t

At each time step:

  1. Take previous hidden state ht1\mathbf{h}_{t-1}
  2. Combine with current input xt\mathbf{x}_t
  3. Apply tanh activation
  4. Output prediction yt\mathbf{y}_t

The same parameters (Whh,Wxh,b\mathbf{W}_{hh}, \mathbf{W}_{xh}, \mathbf{b}) are shared across all time steps — this is weight sharing, which provides parameter efficiency and allows generalizing to variable-length sequences.

RNN Forward Pass
ht=tanh(Whhht1+Wxhxt+b)\mathbf{h}_t = \tanh(\mathbf{W}_{hh} \mathbf{h}_{t-1} + \mathbf{W}_{xh} \mathbf{x}_t + \mathbf{b})
RNN: Unrolled Through Timet=1h₁RNN Cellt=2h₂Same cellt=3h₃Same cellt=4h₄Same cellh₁h₂h₃x₁x₂x₃x₄y₁y₂y₃y₄Same parameters W_hh, W_xh shared across all time steps

Backpropagation Through Time (BPTT)

DfBPTT

BPTT unrolls the RNN through time and applies standard backpropagation:

The loss at time TT:

L=t=1TLt(yt,y^t)\mathcal{L} = \sum_{t=1}^{T} \mathcal{L}_t(\mathbf{y}_t, \hat{\mathbf{y}}_t)

Gradients flow backward through the unrolled graph:

LWhh=t=1TLtWhh\frac{\partial \mathcal{L}}{\partial \mathbf{W}_{hh}} = \sum_{t=1}^{T} \frac{\partial \mathcal{L}_t}{\partial \mathbf{W}_{hh}}

The key challenge: gradients through the recurrent connection involve repeated multiplication by Whh\mathbf{W}_{hh}.

BPTT: Gradient Flow Through TimeForward Pass (left to right)h₁h₂h₃h₄W_hhW_hhW_hhBackward Pass (right to left)∂L₄∂L₃∂L₂∂L₁W_hhᵀW_hhᵀW_hhᵀGradient throughT steps:(W_hhᵀ)ᵀ × ...

Vanishing and Exploding Gradients

DfVanishing Gradients

The gradient through TT time steps involves the product:

hTh1=t=2Ththt1=t=2Tdiag(σ(zt))Whh\frac{\partial \mathbf{h}_T}{\partial \mathbf{h}_1} = \prod_{t=2}^{T} \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}} = \prod_{t=2}^{T} \text{diag}(\sigma'(\mathbf{z}_t)) \cdot \mathbf{W}_{hh}

If the largest eigenvalue of Whh\mathbf{W}_{hh} is <1< 1, the product shrinks exponentially → vanishing gradients.

If the largest eigenvalue is >1> 1, the product grows exponentially → exploding gradients.

ThGradient Magnitude

For a linear RNN (σ=identity\sigma = \text{identity}):

hTh1WhhT\left\|\frac{\partial \mathbf{h}_T}{\partial \mathbf{h}_1}\right\| \leq \|\mathbf{W}_{hh}\|^T

If Whh<1\|\mathbf{W}_{hh}\| < 1, gradient vanishes as O(WhhT)O(\|\mathbf{W}_{hh}\|^T). If Whh>1\|\mathbf{W}_{hh}\| > 1, gradient explodes as O(WhhT)O(\|\mathbf{W}_{hh}\|^T).

Vanishing vs Exploding GradientsTime Steps (T)Gradient MagnitudeGood (‖W‖=1)Vanishing‖W‖=0.9Exploding‖W‖=1.1~20 steps: vanishing gradient makes learning impossiblePractical limit

Solutions to Vanishing Gradients

DfMitigating Vanishing Gradients

  1. LSTM/GRU: Gating mechanisms preserve gradient flow
  2. Orthogonal initialization: Whh=1\|\mathbf{W}_{hh}\| = 1 preserves gradient magnitude
  3. Gradient clipping: Prevents exploding gradients
  4. Skip connections: Additive shortcuts (like ResNet)
  5. Teacher forcing: Train with ground truth inputs during training

DfGradient Clipping

g{gif gθθggif g>θ\mathbf{g} \leftarrow \begin{cases} \mathbf{g} & \text{if } \|\mathbf{g}\| \leq \theta \\ \frac{\theta}{\|\mathbf{g}\|} \mathbf{g} & \text{if } \|\mathbf{g}\| > \theta \end{cases}

Clipping by norm (typically θ=1.0\theta = 1.0 or 5.05.0) prevents exploding gradients without affecting gradient direction.


Applications

DfRNN Applications

TaskArchitectureDescription
Language modelingRNN/LSTMPredict next word
Machine translationSeq2SeqEncoder-decoder with attention
Sentiment analysisRNNClassify text sentiment
Speech recognitionLSTMAudio to text
Music generationRNNGenerate musical sequences
Time series forecastingLSTMPredict future values

Practical Considerations

RNN Training Tips

  • Gradient clipping: Always use for RNNs (norm clipping, θ=1.0\theta = 1.0)
  • Teacher forcing: Use during training, scheduled sampling during inference
  • Hidden state initialization: Usually zeros, or learned
  • Truncated BPTT: For very long sequences, truncate gradient flow (e.g., 100 steps)
  • Pack padded sequences: Handle variable-length sequences efficiently

Limitations of Vanilla RNNs

  • Vanishing gradients limit memory to ~10-20 time steps
  • Cannot capture long-range dependencies
  • Sequential computation prevents parallelization
  • tanh saturation causes gradient issues
  • For long sequences, use LSTM/GRU or Transformers

Summary

  • Vanilla RNN maintains a hidden state that evolves over time through recurrence
  • BPTT unrolls the RNN and applies backpropagation through the unrolled graph
  • Vanishing gradients limit vanilla RNNs to ~10-20 time steps
  • Exploding gradients can be mitigated with gradient clipping
  • Solutions: LSTM/GRU (gating), orthogonal initialization, skip connections
  • Transformers have largely replaced RNNs for most sequence tasks due to parallelization

Next: LSTM Networks

Premium Content

RNN Deep Dive — Vanilla RNN, BPTT and Vanishing/Exploding Gradients

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Deep Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement