Sequence Models

RNNs Deep Dive — Understanding Recurrent Neural Networks

Recurrent Neural Networks process sequential data by maintaining a hidden state that carries information across time steps. This tutorial covers vanilla RNNs and their fundamental limitations.

BPTT Unrolls Through Time — Backpropagation through time treats the RNN as a deep feedforward network
Vanishing Gradients Limit Memory — The product of Jacobians shrinks exponentially, limiting to ~10-20 steps
Orthogonal Initialization Helps — Setting recurrent weights to orthogonal matrices preserves gradient magnitude

RNN Deep Dive — Vanilla RNN, BPTT and Vanishing/Exploding Gradients

Recurrent Neural Networks process sequential data by maintaining a hidden state that carries information across time steps. This tutorial covers vanilla RNNs and their fundamental limitations.

Why Recurrence?

DfSequential Data

Many data types are inherently sequential:

Text: Words follow a specific order
Speech: Audio signals are time-series
Video: Frames have temporal relationships
Time series: Stock prices, sensor readings

Feedforward networks ignore temporal structure. RNNs explicitly model it by maintaining a hidden state that evolves over time.

Vanilla RNN

DfVanilla RNN

The RNN recurrence relation:

\mathbf{h}_t = \tanh(\mathbf{W}_{hh} \mathbf{h}_{t-1} + \mathbf{W}_{xh} \mathbf{x}_t + \mathbf{b})

\mathbf{y}_t = \mathbf{W}_{hy} \mathbf{h}_t

At each time step:

Take previous hidden state $\mathbf{h}_{t-1}$
Combine with current input $\mathbf{x}_t$
Apply tanh activation
Output prediction $\mathbf{y}_t$

The same parameters ( $\mathbf{W}_{hh}, \mathbf{W}_{xh}, \mathbf{b}$ ) are shared across all time steps — this is weight sharing, which provides parameter efficiency and allows generalizing to variable-length sequences.

RNN Forward Pass

\mathbf{h}_t = \tanh(\mathbf{W}_{hh} \mathbf{h}_{t-1} + \mathbf{W}_{xh} \mathbf{x}_t + \mathbf{b})

Backpropagation Through Time (BPTT)

DfBPTT

BPTT unrolls the RNN through time and applies standard backpropagation:

The loss at time $T$ :

\mathcal{L} = \sum_{t=1}^{T} \mathcal{L}_t(\mathbf{y}_t, \hat{\mathbf{y}}_t)

Gradients flow backward through the unrolled graph:

\frac{\partial \mathcal{L}}{\partial \mathbf{W}_{hh}} = \sum_{t=1}^{T} \frac{\partial \mathcal{L}_t}{\partial \mathbf{W}_{hh}}

The key challenge: gradients through the recurrent connection involve repeated multiplication by $\mathbf{W}_{hh}$ .

Vanishing and Exploding Gradients

DfVanishing Gradients

The gradient through $T$ time steps involves the product:

\frac{\partial \mathbf{h}_T}{\partial \mathbf{h}_1} = \prod_{t=2}^{T} \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}} = \prod_{t=2}^{T} \text{diag}(\sigma'(\mathbf{z}_t)) \cdot \mathbf{W}_{hh}

If the largest eigenvalue of $\mathbf{W}_{hh}$ is $< 1$ , the product shrinks exponentially → vanishing gradients.

If the largest eigenvalue is $> 1$ , the product grows exponentially → exploding gradients.

ThGradient Magnitude

For a linear RNN ( $\sigma = \text{identity}$ ):

\left\|\frac{\partial \mathbf{h}_T}{\partial \mathbf{h}_1}\right\| \leq \|\mathbf{W}_{hh}\|^T

If $\|\mathbf{W}_{hh}\| < 1$ , gradient vanishes as $O(\|\mathbf{W}_{hh}\|^T)$ . If $\|\mathbf{W}_{hh}\| > 1$ , gradient explodes as $O(\|\mathbf{W}_{hh}\|^T)$ .

Solutions to Vanishing Gradients

DfMitigating Vanishing Gradients

LSTM/GRU: Gating mechanisms preserve gradient flow
Orthogonal initialization: $\|\mathbf{W}_{hh}\| = 1$ preserves gradient magnitude
Gradient clipping: Prevents exploding gradients
Skip connections: Additive shortcuts (like ResNet)
Teacher forcing: Train with ground truth inputs during training

DfGradient Clipping

\mathbf{g} \leftarrow \begin{cases} \mathbf{g} & \text{if } \|\mathbf{g}\| \leq \theta \\ \frac{\theta}{\|\mathbf{g}\|} \mathbf{g} & \text{if } \|\mathbf{g}\| > \theta \end{cases}

Clipping by norm (typically $\theta = 1.0$ or $5.0$ ) prevents exploding gradients without affecting gradient direction.

Applications

DfRNN Applications

Task	Architecture	Description
Language modeling	RNN/LSTM	Predict next word
Machine translation	Seq2Seq	Encoder-decoder with attention
Sentiment analysis	RNN	Classify text sentiment
Speech recognition	LSTM	Audio to text
Music generation	RNN	Generate musical sequences
Time series forecasting	LSTM	Predict future values

Practical Considerations

RNN Training Tips

Gradient clipping: Always use for RNNs (norm clipping, $\theta = 1.0$ )
Teacher forcing: Use during training, scheduled sampling during inference
Hidden state initialization: Usually zeros, or learned
Truncated BPTT: For very long sequences, truncate gradient flow (e.g., 100 steps)
Pack padded sequences: Handle variable-length sequences efficiently

Limitations of Vanilla RNNs

Vanishing gradients limit memory to ~10-20 time steps
Cannot capture long-range dependencies
Sequential computation prevents parallelization
tanh saturation causes gradient issues
For long sequences, use LSTM/GRU or Transformers

Summary

Vanilla RNN maintains a hidden state that evolves over time through recurrence
BPTT unrolls the RNN and applies backpropagation through the unrolled graph
Vanishing gradients limit vanilla RNNs to ~10-20 time steps
Exploding gradients can be mitigated with gradient clipping
Solutions: LSTM/GRU (gating), orthogonal initialization, skip connections
Transformers have largely replaced RNNs for most sequence tasks due to parallelization

Next: LSTM Networks

RNN Deep Dive — Vanilla RNN, BPTT and Vanishing/Exploding Gradients

RNNs Deep Dive — Understanding Recurrent Neural Networks

RNN Deep Dive — Vanilla RNN, BPTT and Vanishing/Exploding Gradients

Why Recurrence?

DfSequential Data

Vanilla RNN

DfVanilla RNN

Backpropagation Through Time (BPTT)

DfBPTT

Vanishing and Exploding Gradients

DfVanishing Gradients

ThGradient Magnitude

Solutions to Vanishing Gradients

DfMitigating Vanishing Gradients

DfGradient Clipping

Applications

DfRNN Applications

Practical Considerations

Summary

Premium Content

Need Expert Deep Learning Help?