Deep Learning

RNNs and LSTMs — Neural Networks That Remember

Explore recurrent neural networks designed to process sequential data with memory of past inputs.

Sequential processing — handle time series and text data
LSTM gates — solve the vanishing gradient problem
GRU simplification — efficient recurrent architectures

Memory is the diary we all carry about with us.

RNN, LSTM and GRU — Complete Guide

Recurrent networks process sequential data by maintaining a hidden state $\mathbf{h}_t$ that carries information across time steps. Unlike Transformers, they process one token at a time with $O(1)$ memory per step but $O(n)$ sequential operations for a sequence of length $n$ .

Vanilla RNN

At each time step $t$ , the RNN computes:

\mathbf{h}_t = \tanh(W_{hh} \mathbf{h}_{t-1} + W_{xh} \mathbf{x}_t + \mathbf{b})

\hat{\mathbf{y}}_t = W_{hy} \mathbf{h}_t

How the RNN processes sequences: This diagram shows an RNN "unrolled" across time steps. At each step t, the RNN cell takes two inputs: the current input x_t (e.g., a word in a sentence) and the previous hidden state h_{t-1} (memory of all past inputs). It combines them through W·[h,x]+b and applies tanh to produce the new hidden state h_t and output y_t. The crucial detail: the SAME weights W are used at every time step — this is weight sharing across time, which means the model learns a single function that works regardless of sequence length. The hidden state h_t acts as a compressed memory of everything seen so far. The information bottleneck problem is visible: all past information must fit into the fixed-size vector h_t. The red text at the bottom highlights the critical flaw: during backpropagation, gradients multiply through the chain of time steps, causing them to vanish (or explode) exponentially, making it impossible to learn long-range dependencies beyond ~10-20 steps.

DfVanishing and Exploding Gradients

During backpropagation through time (BPTT), gradients are computed as:

\frac{\partial \mathcal{L}_T}{\partial \mathbf{h}_t} = \frac{\partial \mathcal{L}_T}{\partial \mathbf{h}_T} \prod_{k=t+1}^{T} \frac{\partial \mathbf{h}_k}{\partial \mathbf{h}_{k-1}} = \frac{\partial \mathcal{L}_T}{\partial \mathbf{h}_T} \prod_{k=t+1}^{T} W_{hh}^\top \cdot \text{diag}(\sigma'(\mathbf{z}_k))

Since $\sigma'(\mathbf{z}_k) \in (0, 1]$ (for tanh, max is 1), the product either:

Vanishes when $\|W_{hh}\| \cdot \max|\sigma'| < 1$ → gradients → 0 exponentially
Explodes when $\|W_{hh}\| \cdot \max|\sigma'| > 1$ → gradients → ∞

The spectral radius $\rho(W_{hh})$ determines long-term gradient flow. For stable gradients, need $\rho(W_{hh}) \approx 1$ .

Effective memory: Vanilla RNNs can learn dependencies up to ~10-20 steps (Pascanu et al., 2013).

LSTM (Long Short-Term Memory)

LSTM (Hochreiter and Schmidhuber, 1997) introduces a cell state $\mathbf{c}_t$ as an information highway, with three gates controlling information flow:

DfWhy LSTM Solves Vanishing Gradients

The cell state update is linear:

\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t

The gradient through the cell state is:

\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \text{diag}(\mathbf{f}_t) + \text{other terms}

When $\mathbf{f}_t \approx 1$ (forget gate open), the gradient flows unchanged: $\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} \approx I$ . This creates a gradient highway that prevents vanishing.

Parameters per LSTM cell: $4 \times (d_h \times (d_h + d_x) + d_h)$ (4 gates, each with input and hidden weights).

GRU (Gated Recurrent Unit)

GRU (Cho et al., 2014) simplifies LSTM by merging the cell and hidden state and using only two gates:

DfGRU Equations

\mathbf{r}_t = \sigma(W_r \cdot [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_r) \quad \text{(reset gate)}

\mathbf{z}_t = \sigma(W_z \cdot [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_z) \quad \text{(update gate)}

\tilde{\mathbf{h}}_t = \tanh(W \cdot [\mathbf{r}_t \odot \mathbf{h}_{t-1}, \mathbf{x}_t]) \quad \text{(candidate)}

\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t

Comparison: GRU has ~25% fewer parameters than LSTM (2 gates vs 3, no separate cell state). Empirically, performance is comparable; GRU trains faster due to fewer parameters.

Bidirectional RNN

Processes the sequence in both directions and concatenates hidden states:

\overrightarrow{\mathbf{h}_t} = \text{RNN}_{\text{fwd}}(\mathbf{x}_t, \overrightarrow{\mathbf{h}_{t-1}})

\overleftarrow{\mathbf{h}_t} = \text{RNN}_{\text{bwd}}(\mathbf{x}_t, \overleftarrow{\mathbf{h}_{t+1}})

\mathbf{h}_t = [\overrightarrow{\mathbf{h}_t} \| \overleftarrow{\mathbf{h}_t}]

Use case: NER, sentiment analysis — where full context is available. Cannot be used for autoregressive generation.

Sequence-to-Sequence Architecture

PyTorch Implementation

Example: LSTM Model

import torch.nn as nn

class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes, num_layers=2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(
            embed_dim, hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=0.3,
            bidirectional=True
        )
        self.fc = nn.Linear(hidden_dim * 2, num_classes)  # *2 for bidirectional

    def forward(self, x):
        embedded = self.embedding(x)                     # (B, T, E)
        output, (hidden, cell) = self.lstm(embedded)     # output: (B, T, 2H)
        hidden_cat = torch.cat([hidden[-2], hidden[-1]], dim=1)  # (B, 2H)
        return self.fc(hidden_cat)                       # (B, C)

Key Takeaways

Summary: RNNs and LSTMs

RNNs process sequential data with hidden state, but suffer from vanishing gradients
LSTM solves this with a cell state highway and three gates (forget, input, output)
GRU is a simplified LSTM with 2 gates, ~25% fewer parameters, comparable performance
Bidirectional RNNs process both directions for full context
Seq2Seq (encoder-decoder) for translation, summarization — bottleneck is the context vector
Teacher forcing accelerates training but causes train/test mismatch
LSTMs being replaced by Transformers for most NLP tasks
LSTMs still useful for time series, edge devices, and streaming data

What to Learn Next

-> Transformers Learn the architecture replacing RNNs.

-> NLP Fundamentals Master natural language processing basics.

-> Time Series Analysis Apply RNNs to time-dependent data.

-> Attention Deep Dive Understand how attention solves the bottleneck.

-> Neural Networks Understand the foundation of deep learning.

-> Sequence-to-Sequence Build models for translation and summarization.

RNN, LSTM and GRU — Sequential Data Complete Guide