Deep Learning
RNNs and LSTMs — Neural Networks That Remember
Explore recurrent neural networks designed to process sequential data with memory of past inputs.
- Sequential processing — handle time series and text data
- LSTM gates — solve the vanishing gradient problem
- GRU simplification — efficient recurrent architectures
Memory is the diary we all carry about with us.
RNN, LSTM and GRU — Complete Guide
Recurrent networks process sequential data by maintaining a hidden state that carries information across time steps. Unlike Transformers, they process one token at a time with memory per step but sequential operations for a sequence of length .
Vanilla RNN
At each time step , the RNN computes:
How the RNN processes sequences: This diagram shows an RNN "unrolled" across time steps. At each step t, the RNN cell takes two inputs: the current input x_t (e.g., a word in a sentence) and the previous hidden state h_{t-1} (memory of all past inputs). It combines them through W·[h,x]+b and applies tanh to produce the new hidden state h_t and output y_t. The crucial detail: the SAME weights W are used at every time step — this is weight sharing across time, which means the model learns a single function that works regardless of sequence length. The hidden state h_t acts as a compressed memory of everything seen so far. The information bottleneck problem is visible: all past information must fit into the fixed-size vector h_t. The red text at the bottom highlights the critical flaw: during backpropagation, gradients multiply through the chain of time steps, causing them to vanish (or explode) exponentially, making it impossible to learn long-range dependencies beyond ~10-20 steps.
DfVanishing and Exploding Gradients
During backpropagation through time (BPTT), gradients are computed as:
Since (for tanh, max is 1), the product either:
- Vanishes when → gradients → 0 exponentially
- Explodes when → gradients → ∞
The spectral radius determines long-term gradient flow. For stable gradients, need .
Effective memory: Vanilla RNNs can learn dependencies up to ~10-20 steps (Pascanu et al., 2013).
LSTM (Long Short-Term Memory)
LSTM (Hochreiter and Schmidhuber, 1997) introduces a cell state as an information highway, with three gates controlling information flow:
DfWhy LSTM Solves Vanishing Gradients
The cell state update is linear:
The gradient through the cell state is:
When (forget gate open), the gradient flows unchanged: . This creates a gradient highway that prevents vanishing.
Parameters per LSTM cell: (4 gates, each with input and hidden weights).
GRU (Gated Recurrent Unit)
GRU (Cho et al., 2014) simplifies LSTM by merging the cell and hidden state and using only two gates:
DfGRU Equations
Comparison: GRU has ~25% fewer parameters than LSTM (2 gates vs 3, no separate cell state). Empirically, performance is comparable; GRU trains faster due to fewer parameters.
Bidirectional RNN
Processes the sequence in both directions and concatenates hidden states:
Use case: NER, sentiment analysis — where full context is available. Cannot be used for autoregressive generation.
Sequence-to-Sequence Architecture
PyTorch Implementation
Example: LSTM Model
import torch.nn as nn
class LSTMClassifier(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes, num_layers=2):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.lstm = nn.LSTM(
embed_dim, hidden_dim,
num_layers=num_layers,
batch_first=True,
dropout=0.3,
bidirectional=True
)
self.fc = nn.Linear(hidden_dim * 2, num_classes) # *2 for bidirectional
def forward(self, x):
embedded = self.embedding(x) # (B, T, E)
output, (hidden, cell) = self.lstm(embedded) # output: (B, T, 2H)
hidden_cat = torch.cat([hidden[-2], hidden[-1]], dim=1) # (B, 2H)
return self.fc(hidden_cat) # (B, C)
Key Takeaways
Summary: RNNs and LSTMs
- RNNs process sequential data with hidden state, but suffer from vanishing gradients
- LSTM solves this with a cell state highway and three gates (forget, input, output)
- GRU is a simplified LSTM with 2 gates, ~25% fewer parameters, comparable performance
- Bidirectional RNNs process both directions for full context
- Seq2Seq (encoder-decoder) for translation, summarization — bottleneck is the context vector
- Teacher forcing accelerates training but causes train/test mismatch
- LSTMs being replaced by Transformers for most NLP tasks
- LSTMs still useful for time series, edge devices, and streaming data
What to Learn Next
-> Transformers Learn the architecture replacing RNNs.
-> NLP Fundamentals Master natural language processing basics.
-> Time Series Analysis Apply RNNs to time-dependent data.
-> Attention Deep Dive Understand how attention solves the bottleneck.
-> Neural Networks Understand the foundation of deep learning.
-> Sequence-to-Sequence Build models for translation and summarization.