🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

RNN & LSTM: Sequential Data, Gate Mechanisms, Attention — Asked at Google & Amazon

Deep Learning Premium InterviewsSequential Models⭐ Premium

Advertisement

Google & Amazon

RNN & LSTM: Sequential Data, Gate Mechanisms & Attention

Premium Interview Preparation — Sequence Modeling Mastery

🎯 The Interview Question

"Explain how RNNs process sequential data and why they suffer from vanishing gradients. Describe the LSTM cell architecture in detail, explaining each gate's mathematical function and purpose. How do LSTMs solve the long-term dependency problem, and what are their limitations compared to Transformers?"

This question is fundamental for roles involving time series, NLP, and recommendation systems at Google and Amazon.


📚 Detailed Answer

Recurrent Neural Networks: Processing Sequences

RNNs maintain a hidden state ht\mathbf{h}_t that captures information from previous time steps:

ht=tanh(Whhht1+Wxhxt+bh)\mathbf{h}_t = \tanh(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{xh}\mathbf{x}_t + \mathbf{b}_h)
yt=Whyht+by\mathbf{y}_t = \mathbf{W}_{hy}\mathbf{h}_t + \mathbf{b}_y

The hidden state acts as a "memory" that summarizes the sequence seen so far. At each time step, the same parameters are used — parameter sharing across time.

Backpropagation Through Time (BPTT):

To compute gradients, the RNN is unrolled through time:

LWhh=t=1TLtWhh\frac{\partial \mathcal{L}}{\partial \mathbf{W}_{hh}} = \sum_{t=1}^{T} \frac{\partial \mathcal{L}_t}{\partial \mathbf{W}_{hh}}

Each gradient depends on the chain rule through all previous time steps:

hTh1=t=2Ththt1=t=2Tdiag(1ht2)Whh\frac{\partial \mathbf{h}_T}{\partial \mathbf{h}_1} = \prod_{t=2}^{T} \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}} = \prod_{t=2}^{T} \text{diag}(1-\mathbf{h}_t^2) \mathbf{W}_{hh}

⚠️

The vanishing gradient problem is severe in RNNs. With tanh activation (derivative max = 1) and weight matrix Whh\mathbf{W}_{hh}, the gradient magnitude is bounded by WhhT1\|\mathbf{W}_{hh}\|^{T-1}. If Whh<1\|\mathbf{W}_{hh}\| < 1, gradients vanish exponentially with sequence length.

LSTM: Long Short-Term Memory

LSTMs introduce a cell state ct\mathbf{c}_t and three gates to control information flow:

Forget Gate

ft=σ(Wf[ht1,xt]+bf)\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)

Decides what to remove from the cell state. Output is a vector of values in [0,1][0, 1]:

  • 1 = keep everything
  • 0 = forget everything

Input Gate

it=σ(Wi[ht1,xt]+bi)\mathbf{i}_t = \sigma(\mathbf{W}_i[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i)

Controls what new information to store in the cell state.

Candidate Cell State

c~t=tanh(Wc[ht1,xt]+bc)\tilde{\mathbf{c}}_t = \tanh(\mathbf{W}_c[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_c)

Creates candidate values that could be added to the cell state.

Cell State Update

ct=ftct1+itc~t\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t

This is the key innovation:

  1. Multiply old cell state by forget gate (what to forget)
  2. Add input gate × candidate (what to add)

Output Gate

ot=σ(Wo[ht1,xt]+bo)\mathbf{o}_t = \sigma(\mathbf{W}_o[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o)

Controls what parts of the cell state to output.

Hidden State Update

ht=ottanh(ct)\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)

Why LSTMs Solve Vanishing Gradients

The cell state update creates a "gradient highway":

ctct1=diag(ft)+other terms\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \text{diag}(\mathbf{f}_t) + \text{other terms}

When ft1\mathbf{f}_t \approx 1 (forget gate open), the gradient flows through unchanged:

cTc1t=2Tdiag(ft)\frac{\partial \mathbf{c}_T}{\partial \mathbf{c}_1} \approx \prod_{t=2}^{T} \text{diag}(\mathbf{f}_t)

If forget gates are mostly 1, gradients can flow for hundreds of time steps without vanishing.

LSTM Variants

Peephole Connections

Adds direct connections from cell state to gates:

ft=σ(Wf[ht1,xt]+Wpfct1+bf\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{W}_{pf} \odot \mathbf{c}_{t-1} + \mathbf{b}_f

Allows gates to "peek" at the cell state for better timing.

GRU (Gated Recurrent Unit)

Simplified version combining forget and input gates:

zt=σ(Wz[ht1,xt])(update gate)\mathbf{z}_t = \sigma(\mathbf{W}_z[\mathbf{h}_{t-1}, \mathbf{x}_t]) \quad \text{(update gate)}
rt=σ(Wr[ht1,xt])(reset gate)\mathbf{r}_t = \sigma(\mathbf{W}_r[\mathbf{h}_{t-1}, \mathbf{x}_t]) \quad \text{(reset gate)}
h~t=tanh(Wh[rtht1,xt])\tilde{\mathbf{h}}_t = \tanh(\mathbf{W}_h[\mathbf{r}_t \odot \mathbf{h}_{t-1}, \mathbf{x}_t])
ht=(1zt)ht1+zth~t\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t

Fewer parameters than LSTM, often comparable performance.

Bidirectional RNN

Processes sequence in both directions:

ht=RNN(xt,ht1)\overrightarrow{\mathbf{h}}_t = \text{RNN}(\mathbf{x}_t, \overrightarrow{\mathbf{h}}_{t-1})
ht=RNN(xt,ht+1)\overleftarrow{\mathbf{h}}_t = \text{RNN}(\mathbf{x}_t, \overleftarrow{\mathbf{h}}_{t+1})
ht=[ht;ht]\mathbf{h}_t = [\overrightarrow{\mathbf{h}}_t; \overleftarrow{\mathbf{h}}_t]

Useful when the entire sequence is available (not streaming).

Limitations Compared to Transformers

AspectLSTMTransformer
ParallelizationSequential (slow)Parallel (fast)
Long-range dependenciesLimited by memoryDirect attention
Training speedO(T) per stepO(1) per step (parallel)
MemoryO(1) per stepO(T²) for attention
InterpretabilityHardAttention weights visible

Real-World Applications

  1. Speech Recognition: Amazon Alexa, Google Assistant use LSTM/GRU for acoustic modeling
  2. Machine Translation: Encoder-decoder LSTMs (predecessor to Transformers)
  3. Time Series: Stock prediction, anomaly detection, forecasting
  4. Music Generation: Sequential generation of notes
  5. Video Analysis: Frame-by-frame understanding

Follow-Up Questions

Q: When would you choose LSTM over Transformer? A: For streaming/online prediction with low latency requirements, or when the sequence is very long and memory is constrained. LSTMs have O(1) memory per step vs O(T²) for Transformers.

Q: How do you handle very long sequences with LSTMs? A: Use truncated BPTT (limit backpropagation history), gradient clipping, and architectural tricks like attention mechanisms over LSTM outputs.

Q: What is the relationship between LSTM gates and attention? A: Both control information flow. Gates are fixed per time step; attention weights are input-dependent and can attend to any position. Modern architectures often combine both.

Related Topics

Advertisement