🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

RNN, LSTM and GRU — Sequential Data Complete Guide

Deep LearningRNNs🟢 Free Lesson

Advertisement

Deep Learning

RNNs and LSTMs — Neural Networks That Remember

Explore recurrent neural networks designed to process sequential data with memory of past inputs.

  • Sequential processing — handle time series and text data
  • LSTM gates — solve the vanishing gradient problem
  • GRU simplification — efficient recurrent architectures

Memory is the diary we all carry about with us.

RNN, LSTM and GRU — Complete Guide

Recurrent networks process sequential data by maintaining a hidden state ht\mathbf{h}_t that carries information across time steps. Unlike Transformers, they process one token at a time with O(1)O(1) memory per step but O(n)O(n) sequential operations for a sequence of length nn.


Vanilla RNN

At each time step tt, the RNN computes:

ht=tanh(Whhht1+Wxhxt+b)\mathbf{h}_t = \tanh(W_{hh} \mathbf{h}_{t-1} + W_{xh} \mathbf{x}_t + \mathbf{b})
y^t=Whyht\hat{\mathbf{y}}_t = W_{hy} \mathbf{h}_t
Unrolled RNN over Timet-1RNNCelltanh(W·[h,x]+b)xt-1tRNNCelltanh(W·[h,x]+b)xtytt+1RNNCelltanh(W·[h,x]+b)xt+1yt+1TRNNCelltanh(W·[h,x]+b)xTyTht-1htht+1Same weights W, W• The RNN cell is the same function applied at every time step (weight sharing across time)• Hidden state ht ∈ ℝd encodes all information about the past (fixed-size bottleneck)Sequence length n determines the number of unrolled steps — cannot parallelize!• Problem: Gradient ∂L/∂ht = ∏ ∂hk/∂hk-1 → 0 exponentially (vanishing gradient)

How the RNN processes sequences: This diagram shows an RNN "unrolled" across time steps. At each step t, the RNN cell takes two inputs: the current input x_t (e.g., a word in a sentence) and the previous hidden state h_{t-1} (memory of all past inputs). It combines them through W·[h,x]+b and applies tanh to produce the new hidden state h_t and output y_t. The crucial detail: the SAME weights W are used at every time step — this is weight sharing across time, which means the model learns a single function that works regardless of sequence length. The hidden state h_t acts as a compressed memory of everything seen so far. The information bottleneck problem is visible: all past information must fit into the fixed-size vector h_t. The red text at the bottom highlights the critical flaw: during backpropagation, gradients multiply through the chain of time steps, causing them to vanish (or explode) exponentially, making it impossible to learn long-range dependencies beyond ~10-20 steps.

DfVanishing and Exploding Gradients

During backpropagation through time (BPTT), gradients are computed as:

LTht=LThTk=t+1Thkhk1=LThTk=t+1TWhhdiag(σ(zk))\frac{\partial \mathcal{L}_T}{\partial \mathbf{h}_t} = \frac{\partial \mathcal{L}_T}{\partial \mathbf{h}_T} \prod_{k=t+1}^{T} \frac{\partial \mathbf{h}_k}{\partial \mathbf{h}_{k-1}} = \frac{\partial \mathcal{L}_T}{\partial \mathbf{h}_T} \prod_{k=t+1}^{T} W_{hh}^\top \cdot \text{diag}(\sigma'(\mathbf{z}_k))

Since σ(zk)(0,1]\sigma'(\mathbf{z}_k) \in (0, 1] (for tanh, max is 1), the product either:

  • Vanishes when Whhmaxσ<1\|W_{hh}\| \cdot \max|\sigma'| < 1 → gradients → 0 exponentially
  • Explodes when Whhmaxσ>1\|W_{hh}\| \cdot \max|\sigma'| > 1 → gradients → ∞

The spectral radius ρ(Whh)\rho(W_{hh}) determines long-term gradient flow. For stable gradients, need ρ(Whh)1\rho(W_{hh}) \approx 1.

Effective memory: Vanilla RNNs can learn dependencies up to ~10-20 steps (Pascanu et al., 2013).


LSTM (Long Short-Term Memory)

LSTM (Hochreiter and Schmidhuber, 1997) introduces a cell state ct\mathbf{c}_t as an information highway, with three gates controlling information flow:

LSTM Cell ArchitectureCell State ct — Information HighwayForgetGateσ(W·[ht-1,xt])×ct-1InputGateσ(W·[ht-1,xt])Candidatetanh(·)t×+ct = ft⊙ct-1 + it⊙c̃tOutputGateσ(W·[ht-1,xt])×tanh(ct)htLSTM EquationsForget:ft = σ(Wf·[ht-1, xt] + bf)Input:it = σ(Wi·[ht-1, xt] + bi), c̃t = tanh(Wc·[ht-1, xt] + bc)Cell:ct = ft ⊙ ct-1 + it ⊙ c̃tOutput:ot = σ(Wo·[ht-1, xt] + bo), ht = ot ⊙ tanh(ct)

DfWhy LSTM Solves Vanishing Gradients

The cell state update is linear:

ct=ftct1+itc~t\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t

The gradient through the cell state is:

ctct1=diag(ft)+other terms\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \text{diag}(\mathbf{f}_t) + \text{other terms}

When ft1\mathbf{f}_t \approx 1 (forget gate open), the gradient flows unchanged: ctct1I\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} \approx I. This creates a gradient highway that prevents vanishing.

Parameters per LSTM cell: 4×(dh×(dh+dx)+dh)4 \times (d_h \times (d_h + d_x) + d_h) (4 gates, each with input and hidden weights).


GRU (Gated Recurrent Unit)

GRU (Cho et al., 2014) simplifies LSTM by merging the cell and hidden state and using only two gates:

GRU Cell Architecturextht-1ResetGate (r)UpdateGate (z)Candidateh̃ = tanh(W·[r⊙h,x])Interpolateht = (1-z)⊙ht-1 + z⊙h̃thtoutput at time t

DfGRU Equations

rt=σ(Wr[ht1,xt]+br)(reset gate)\mathbf{r}_t = \sigma(W_r \cdot [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_r) \quad \text{(reset gate)}
zt=σ(Wz[ht1,xt]+bz)(update gate)\mathbf{z}_t = \sigma(W_z \cdot [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_z) \quad \text{(update gate)}
h~t=tanh(W[rtht1,xt])(candidate)\tilde{\mathbf{h}}_t = \tanh(W \cdot [\mathbf{r}_t \odot \mathbf{h}_{t-1}, \mathbf{x}_t]) \quad \text{(candidate)}
ht=(1zt)ht1+zth~t\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t

Comparison: GRU has ~25% fewer parameters than LSTM (2 gates vs 3, no separate cell state). Empirically, performance is comparable; GRU trains faster due to fewer parameters.


Bidirectional RNN

Processes the sequence in both directions and concatenates hidden states:

ht=RNNfwd(xt,ht1)\overrightarrow{\mathbf{h}_t} = \text{RNN}_{\text{fwd}}(\mathbf{x}_t, \overrightarrow{\mathbf{h}_{t-1}})
ht=RNNbwd(xt,ht+1)\overleftarrow{\mathbf{h}_t} = \text{RNN}_{\text{bwd}}(\mathbf{x}_t, \overleftarrow{\mathbf{h}_{t+1}})
ht=[htht]\mathbf{h}_t = [\overrightarrow{\mathbf{h}_t} \| \overleftarrow{\mathbf{h}_t}]

Use case: NER, sentiment analysis — where full context is available. Cannot be used for autoregressive generation.


Sequence-to-Sequence Architecture

Encoder-Decoder (Seq2Seq)EncoderThecatsath₁h₂h₃Contexth₃Decoder<S>Ilgattosedevas₀s₁s₂s₃Output: "Il gatto sedeva"Bottleneck ProblemEntire input sequence compressed into fixed-size vector hnLong sequences lose information → Attention mechanism solves thisTeacher forcing: feed ground-truth as input during training

PyTorch Implementation

Example: LSTM Model

import torch.nn as nn

class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes, num_layers=2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(
            embed_dim, hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=0.3,
            bidirectional=True
        )
        self.fc = nn.Linear(hidden_dim * 2, num_classes)  # *2 for bidirectional

    def forward(self, x):
        embedded = self.embedding(x)                     # (B, T, E)
        output, (hidden, cell) = self.lstm(embedded)     # output: (B, T, 2H)
        hidden_cat = torch.cat([hidden[-2], hidden[-1]], dim=1)  # (B, 2H)
        return self.fc(hidden_cat)                       # (B, C)

Key Takeaways

Summary: RNNs and LSTMs

  • RNNs process sequential data with hidden state, but suffer from vanishing gradients
  • LSTM solves this with a cell state highway and three gates (forget, input, output)
  • GRU is a simplified LSTM with 2 gates, ~25% fewer parameters, comparable performance
  • Bidirectional RNNs process both directions for full context
  • Seq2Seq (encoder-decoder) for translation, summarization — bottleneck is the context vector
  • Teacher forcing accelerates training but causes train/test mismatch
  • LSTMs being replaced by Transformers for most NLP tasks
  • LSTMs still useful for time series, edge devices, and streaming data

What to Learn Next

-> Transformers Learn the architecture replacing RNNs.

-> NLP Fundamentals Master natural language processing basics.

-> Time Series Analysis Apply RNNs to time-dependent data.

-> Attention Deep Dive Understand how attention solves the bottleneck.

-> Neural Networks Understand the foundation of deep learning.

-> Sequence-to-Sequence Build models for translation and summarization.

Premium Content

RNN, LSTM and GRU — Sequential Data Complete Guide

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Machine Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement