RNN & LSTM: Sequential Data, Gate Mechanisms, Attention — Asked at Google & Amazon

🎯 The Interview Question

"Explain how RNNs process sequential data and why they suffer from vanishing gradients. Describe the LSTM cell architecture in detail, explaining each gate's mathematical function and purpose. How do LSTMs solve the long-term dependency problem, and what are their limitations compared to Transformers?"

This question is fundamental for roles involving time series, NLP, and recommendation systems at Google and Amazon.

📚 Detailed Answer

Recurrent Neural Networks: Processing Sequences

RNNs maintain a hidden state $\mathbf{h}_t$ that captures information from previous time steps:

\mathbf{h}_t = \tanh(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{xh}\mathbf{x}_t + \mathbf{b}_h)

\mathbf{y}_t = \mathbf{W}_{hy}\mathbf{h}_t + \mathbf{b}_y

The hidden state acts as a "memory" that summarizes the sequence seen so far. At each time step, the same parameters are used — parameter sharing across time.

Backpropagation Through Time (BPTT):

To compute gradients, the RNN is unrolled through time:

\frac{\partial \mathcal{L}}{\partial \mathbf{W}_{hh}} = \sum_{t=1}^{T} \frac{\partial \mathcal{L}_t}{\partial \mathbf{W}_{hh}}

Each gradient depends on the chain rule through all previous time steps:

\frac{\partial \mathbf{h}_T}{\partial \mathbf{h}_1} = \prod_{t=2}^{T} \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}} = \prod_{t=2}^{T} \text{diag}(1-\mathbf{h}_t^2) \mathbf{W}_{hh}

⚠️

The vanishing gradient problem is severe in RNNs. With tanh activation (derivative max = 1) and weight matrix $\mathbf{W}_{hh}$ , the gradient magnitude is bounded by $\|\mathbf{W}_{hh}\|^{T-1}$ . If $\|\mathbf{W}_{hh}\| < 1$ , gradients vanish exponentially with sequence length.

LSTM: Long Short-Term Memory

LSTMs introduce a cell state $\mathbf{c}_t$ and three gates to control information flow:

Forget Gate

\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)

Decides what to remove from the cell state. Output is a vector of values in $[0, 1]$ :

1 = keep everything
0 = forget everything

Input Gate

\mathbf{i}_t = \sigma(\mathbf{W}_i[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i)

Controls what new information to store in the cell state.

Candidate Cell State

\tilde{\mathbf{c}}_t = \tanh(\mathbf{W}_c[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_c)

Creates candidate values that could be added to the cell state.

Cell State Update

\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t

This is the key innovation:

Multiply old cell state by forget gate (what to forget)
Add input gate × candidate (what to add)

Output Gate

\mathbf{o}_t = \sigma(\mathbf{W}_o[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o)

Controls what parts of the cell state to output.

Hidden State Update

\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)

Why LSTMs Solve Vanishing Gradients

The cell state update creates a "gradient highway":

\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \text{diag}(\mathbf{f}_t) + \text{other terms}

When $\mathbf{f}_t \approx 1$ (forget gate open), the gradient flows through unchanged:

\frac{\partial \mathbf{c}_T}{\partial \mathbf{c}_1} \approx \prod_{t=2}^{T} \text{diag}(\mathbf{f}_t)

If forget gates are mostly 1, gradients can flow for hundreds of time steps without vanishing.

LSTM Variants

Peephole Connections

Adds direct connections from cell state to gates:

\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{W}_{pf} \odot \mathbf{c}_{t-1} + \mathbf{b}_f

Allows gates to "peek" at the cell state for better timing.

GRU (Gated Recurrent Unit)

Simplified version combining forget and input gates:

\mathbf{z}_t = \sigma(\mathbf{W}_z[\mathbf{h}_{t-1}, \mathbf{x}_t]) \quad \text{(update gate)}

\mathbf{r}_t = \sigma(\mathbf{W}_r[\mathbf{h}_{t-1}, \mathbf{x}_t]) \quad \text{(reset gate)}

\tilde{\mathbf{h}}_t = \tanh(\mathbf{W}_h[\mathbf{r}_t \odot \mathbf{h}_{t-1}, \mathbf{x}_t])

\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t

Fewer parameters than LSTM, often comparable performance.

Bidirectional RNN

Processes sequence in both directions:

\overrightarrow{\mathbf{h}}_t = \text{RNN}(\mathbf{x}_t, \overrightarrow{\mathbf{h}}_{t-1})

\overleftarrow{\mathbf{h}}_t = \text{RNN}(\mathbf{x}_t, \overleftarrow{\mathbf{h}}_{t+1})

\mathbf{h}_t = [\overrightarrow{\mathbf{h}}_t; \overleftarrow{\mathbf{h}}_t]

Useful when the entire sequence is available (not streaming).

Limitations Compared to Transformers

Aspect	LSTM	Transformer
Parallelization	Sequential (slow)	Parallel (fast)
Long-range dependencies	Limited by memory	Direct attention
Training speed	O(T) per step	O(1) per step (parallel)
Memory	O(1) per step	O(T²) for attention
Interpretability	Hard	Attention weights visible

Real-World Applications

Speech Recognition: Amazon Alexa, Google Assistant use LSTM/GRU for acoustic modeling
Machine Translation: Encoder-decoder LSTMs (predecessor to Transformers)
Time Series: Stock prediction, anomaly detection, forecasting
Music Generation: Sequential generation of notes
Video Analysis: Frame-by-frame understanding

Follow-Up Questions

Q: When would you choose LSTM over Transformer? A: For streaming/online prediction with low latency requirements, or when the sequence is very long and memory is constrained. LSTMs have O(1) memory per step vs O(T²) for Transformers.

Q: How do you handle very long sequences with LSTMs? A: Use truncated BPTT (limit backpropagation history), gradient clipping, and architectural tricks like attention mechanisms over LSTM outputs.

Q: What is the relationship between LSTM gates and attention? A: Both control information flow. Gates are fixed per time step; attention weights are input-dependent and can attend to any position. Modern architectures often combine both.