Sequence Models
RNNs Deep Dive — Understanding Recurrent Neural Networks
Recurrent Neural Networks process sequential data by maintaining a hidden state that carries information across time steps. This tutorial covers vanilla RNNs and their fundamental limitations.
- BPTT Unrolls Through Time — Backpropagation through time treats the RNN as a deep feedforward network
- Vanishing Gradients Limit Memory — The product of Jacobians shrinks exponentially, limiting to ~10-20 steps
- Orthogonal Initialization Helps — Setting recurrent weights to orthogonal matrices preserves gradient magnitude
RNN Deep Dive — Vanilla RNN, BPTT and Vanishing/Exploding Gradients
Recurrent Neural Networks process sequential data by maintaining a hidden state that carries information across time steps. This tutorial covers vanilla RNNs and their fundamental limitations.
Why Recurrence?
DfSequential Data
Many data types are inherently sequential:
- Text: Words follow a specific order
- Speech: Audio signals are time-series
- Video: Frames have temporal relationships
- Time series: Stock prices, sensor readings
Feedforward networks ignore temporal structure. RNNs explicitly model it by maintaining a hidden state that evolves over time.
Vanilla RNN
DfVanilla RNN
The RNN recurrence relation:
At each time step:
- Take previous hidden state
- Combine with current input
- Apply tanh activation
- Output prediction
The same parameters () are shared across all time steps — this is weight sharing, which provides parameter efficiency and allows generalizing to variable-length sequences.
Backpropagation Through Time (BPTT)
DfBPTT
BPTT unrolls the RNN through time and applies standard backpropagation:
The loss at time :
Gradients flow backward through the unrolled graph:
The key challenge: gradients through the recurrent connection involve repeated multiplication by .
Vanishing and Exploding Gradients
DfVanishing Gradients
The gradient through time steps involves the product:
If the largest eigenvalue of is , the product shrinks exponentially → vanishing gradients.
If the largest eigenvalue is , the product grows exponentially → exploding gradients.
ThGradient Magnitude
For a linear RNN ():
If , gradient vanishes as . If , gradient explodes as .
Solutions to Vanishing Gradients
DfMitigating Vanishing Gradients
- LSTM/GRU: Gating mechanisms preserve gradient flow
- Orthogonal initialization: preserves gradient magnitude
- Gradient clipping: Prevents exploding gradients
- Skip connections: Additive shortcuts (like ResNet)
- Teacher forcing: Train with ground truth inputs during training
DfGradient Clipping
Clipping by norm (typically or ) prevents exploding gradients without affecting gradient direction.
Applications
DfRNN Applications
| Task | Architecture | Description |
|---|---|---|
| Language modeling | RNN/LSTM | Predict next word |
| Machine translation | Seq2Seq | Encoder-decoder with attention |
| Sentiment analysis | RNN | Classify text sentiment |
| Speech recognition | LSTM | Audio to text |
| Music generation | RNN | Generate musical sequences |
| Time series forecasting | LSTM | Predict future values |
Practical Considerations
RNN Training Tips
- Gradient clipping: Always use for RNNs (norm clipping, )
- Teacher forcing: Use during training, scheduled sampling during inference
- Hidden state initialization: Usually zeros, or learned
- Truncated BPTT: For very long sequences, truncate gradient flow (e.g., 100 steps)
- Pack padded sequences: Handle variable-length sequences efficiently
Limitations of Vanilla RNNs
- Vanishing gradients limit memory to ~10-20 time steps
- Cannot capture long-range dependencies
- Sequential computation prevents parallelization
- tanh saturation causes gradient issues
- For long sequences, use LSTM/GRU or Transformers
Summary
- Vanilla RNN maintains a hidden state that evolves over time through recurrence
- BPTT unrolls the RNN and applies backpropagation through the unrolled graph
- Vanishing gradients limit vanilla RNNs to ~10-20 time steps
- Exploding gradients can be mitigated with gradient clipping
- Solutions: LSTM/GRU (gating), orthogonal initialization, skip connections
- Transformers have largely replaced RNNs for most sequence tasks due to parallelization
Next: LSTM Networks