🎯 The Interview Question
"Explain the Transformer architecture in detail, starting from the self-attention mechanism. What is multi-head attention and why is it necessary? How does positional encoding work, and why can't Transformers process sequences without it? Discuss the computational complexity and how it limits application to long sequences."
This is THE question for modern AI interviews. Transformers power GPT, BERT, PaLM, and virtually all state-of-the-art models.
📚 Detailed Answer
Self-Attention: The Core Innovation
Self-attention allows each position in a sequence to attend to all other positions, computing a weighted sum of values based on relevance.
Given input (sequence length , dimension ), we compute:
Query, Key, Value projections:
where and .
Scaled Dot-Product Attention:
The scaling factor prevents dot products from growing too large, which would push softmax into regions with extremely small gradients.
Why Q, K, V?
- Query: What am I looking for?
- Key: What do I contain?
- Value: What do I provide?
The attention score measures how well query matches key .
💡
Self-attention has complexity and memory. This quadratic scaling is the main limitation for long sequences. Recent innovations like Flash Attention and linear attention variants aim to reduce this.
Multi-Head Attention
Instead of a single attention function, we use parallel attention heads:
where each head attends with different learned projections:
Typically , so total computation is similar to single-head attention.
Why Multi-Head?
- Different representation subspaces: Each head can learn different types of relationships (syntactic, semantic, positional)
- Parallel processing: Multiple attention patterns computed simultaneously
- Richer representations: Concatenation combines different perspectives
Positional Encoding
Self-attention is permutation-invariant — it doesn't know the order of tokens. Positional encoding injects position information:
Sinusoidal Encoding (Original Transformer)
where is position and is dimension.
Properties:
- Each dimension represents a different frequency
- can be expressed as a linear function of
- Generalizes to unseen sequence lengths
Learned Positional Encoding
where is a learned embedding for position .
Used in BERT and GPT; simpler but doesn't generalize beyond training length.
Rotary Position Embedding (RoPE)
Encodes relative positions directly into the attention computation:
Used in LLaMA, PaLM; naturally handles relative positions.
Complete Transformer Architecture
Input Embedding + Positional Encoding
↓
┌───────────────────────┐
│ Encoder Block × N │
│ ┌─────────────────┐ │
│ │ Multi-Head Self │ │
│ │ Attention + Add │ │
│ │ & Layer Norm │ │
│ └─────────────────┘ │
│ ┌─────────────────┐ │
│ │ Feed-Forward + │ │
│ │ Add & Layer Norm│ │
│ └─────────────────┘ │
└───────────────────────┘
↓
┌───────────────────────┐
│ Decoder Block × N │
│ ┌─────────────────┐ │
│ │ Masked Multi- │ │
│ │ Head Self-Attn │ │
│ └─────────────────┘ │
│ ┌─────────────────┐ │
│ │ Cross-Attention │ │
│ │ (Encoder-Decoder)│ │
│ └─────────────────┘ │
│ ┌─────────────────┐ │
│ │ Feed-Forward │ │
│ └─────────────────┘ │
└───────────────────────┘
↓
Output Projection + Softmax
Computational Complexity Analysis
| Operation | Time Complexity | Memory |
|---|---|---|
| Self-Attention | ||
| Multi-Head Attention | ||
| Feed-Forward | ||
| Total per layer |
For (typical): attention is dominant. For : feed-forward is dominant.
Optimizations for Long Sequences
- Sparse Attention: Only attend to local windows (Longformer, BigBird)
- Linear Attention: Approximate softmax with kernel functions
- Flash Attention: IO-aware exact attention with compute but memory
- Grouped Query Attention: Share keys/values across query heads
Follow-Up Questions
Q: Why is used in scaled attention? A: Without scaling, dot products grow with , pushing softmax into regions with tiny gradients. The scaling keeps variance constant regardless of .
Q: What is the difference between encoder-only, decoder-only, and encoder-decoder Transformers? A: Encoder-only (BERT): bidirectional, for understanding. Decoder-only (GPT): autoregressive, for generation. Encoder-decoder (T5): for seq2seq tasks.
Q: How does mask attention work in decoders? A: Causal masking prevents position from attending to positions , ensuring autoregressive generation. Implemented by setting future positions to before softmax.