Transformers: Self-Attention, Multi-Head Attention, Positional Encoding — Asked at OpenAI & Google

🎯 The Interview Question

"Explain the Transformer architecture in detail, starting from the self-attention mechanism. What is multi-head attention and why is it necessary? How does positional encoding work, and why can't Transformers process sequences without it? Discuss the computational complexity and how it limits application to long sequences."

This is THE question for modern AI interviews. Transformers power GPT, BERT, PaLM, and virtually all state-of-the-art models.

📚 Detailed Answer

Self-Attention: The Core Innovation

Self-attention allows each position in a sequence to attend to all other positions, computing a weighted sum of values based on relevance.

Given input $\mathbf{X} \in \mathbb{R}^{n \times d}$ (sequence length $n$ , dimension $d$ ), we compute:

Query, Key, Value projections:

\mathbf{Q} = \mathbf{X}\mathbf{W}^Q, \quad \mathbf{K} = \mathbf{X}\mathbf{W}^K, \quad \mathbf{V} = \mathbf{X}\mathbf{W}^V

where $\mathbf{W}^Q, \mathbf{W}^K \in \mathbb{R}^{d \times d_k}$ and $\mathbf{W}^V \in \mathbb{R}^{d \times d_v}$ .

Scaled Dot-Product Attention:

\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}

The scaling factor $\sqrt{d_k}$ prevents dot products from growing too large, which would push softmax into regions with extremely small gradients.

Why Q, K, V?

Query: What am I looking for?
Key: What do I contain?
Value: What do I provide?

The attention score $q_i^T k_j$ measures how well query $i$ matches key $j$ .

💡

Self-attention has $O(n^2 d)$ complexity and $O(n^2)$ memory. This quadratic scaling is the main limitation for long sequences. Recent innovations like Flash Attention and linear attention variants aim to reduce this.

Multi-Head Attention

Instead of a single attention function, we use $h$ parallel attention heads:

\text{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\mathbf{W}^O

where each head attends with different learned projections:

\text{head}_i = \text{Attention}(\mathbf{X}\mathbf{W}_i^Q, \mathbf{X}\mathbf{W}_i^K, \mathbf{X}\mathbf{W}_i^V)

Typically $d_k = d_v = d/h$ , so total computation is similar to single-head attention.

Why Multi-Head?

Different representation subspaces: Each head can learn different types of relationships (syntactic, semantic, positional)
Parallel processing: Multiple attention patterns computed simultaneously
Richer representations: Concatenation combines different perspectives

Positional Encoding

Self-attention is permutation-invariant — it doesn't know the order of tokens. Positional encoding injects position information:

Sinusoidal Encoding (Original Transformer)

PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)

PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)

where $pos$ is position and $i$ is dimension.

Properties:

Each dimension represents a different frequency
$PE_{pos+k}$ can be expressed as a linear function of $PE_{pos}$
Generalizes to unseen sequence lengths

Learned Positional Encoding

\mathbf{h}_i = \mathbf{x}_i + \mathbf{p}_i

where $\mathbf{p}_i \in \mathbb{R}^d$ is a learned embedding for position $i$ .

Used in BERT and GPT; simpler but doesn't generalize beyond training length.

Rotary Position Embedding (RoPE)

Encodes relative positions directly into the attention computation:

\text{RoPE}(q, m) = \begin{pmatrix} q_0 \\ q_1 \\ q_2 \\ q_3 \end{pmatrix} \otimes \begin{pmatrix} \cos(m\theta_0) \\ \cos(m\theta_0) \\ \cos(m\theta_1) \\ \cos(m\theta_1) \end{pmatrix} + \begin{pmatrix} -q_1 \\ q_0 \\ -q_3 \\ q_2 \end{pmatrix} \otimes \begin{pmatrix} \sin(m\theta_0) \\ \sin(m\theta_0) \\ \sin(m\theta_1) \\ \sin(m\theta_1) \end{pmatrix}

Used in LLaMA, PaLM; naturally handles relative positions.

Complete Transformer Architecture

Architecture Diagram

Input Embedding + Positional Encoding
        ↓
┌───────────────────────┐
│  Encoder Block × N    │
│  ┌─────────────────┐  │
│  │ Multi-Head Self │  │
│  │ Attention + Add │  │
│  │ & Layer Norm    │  │
│  └─────────────────┘  │
│  ┌─────────────────┐  │
│  │ Feed-Forward +  │  │
│  │ Add & Layer Norm│  │
│  └─────────────────┘  │
└───────────────────────┘
        ↓
┌───────────────────────┐
│  Decoder Block × N    │
│  ┌─────────────────┐  │
│  │ Masked Multi-   │  │
│  │ Head Self-Attn  │  │
│  └─────────────────┘  │
│  ┌─────────────────┐  │
│  │ Cross-Attention │  │
│  │ (Encoder-Decoder)│ │
│  └─────────────────┘  │
│  ┌─────────────────┐  │
│  │ Feed-Forward    │  │
│  └─────────────────┘  │
└───────────────────────┘
        ↓
Output Projection + Softmax

Computational Complexity Analysis

Operation	Time Complexity	Memory
Self-Attention	$O(n^2 d)$	$O(n^2)$
Multi-Head Attention	$O(n^2 d)$	$O(n^2)$
Feed-Forward	$O(n d^2)$	$O(nd)$
Total per layer	$O(n^2 d + n d^2)$	$O(n^2 + nd)$

For $n \ll d$ (typical): attention is dominant. For $n \gg d$ : feed-forward is dominant.

Optimizations for Long Sequences

Sparse Attention: Only attend to local windows (Longformer, BigBird)
Linear Attention: Approximate softmax with kernel functions
Flash Attention: IO-aware exact attention with $O(n^2)$ compute but $O(n)$ memory
Grouped Query Attention: Share keys/values across query heads

Follow-Up Questions

Q: Why is $\sqrt{d_k}$ used in scaled attention? A: Without scaling, dot products grow with $d_k$ , pushing softmax into regions with tiny gradients. The scaling keeps variance constant regardless of $d_k$ .

Q: What is the difference between encoder-only, decoder-only, and encoder-decoder Transformers? A: Encoder-only (BERT): bidirectional, for understanding. Decoder-only (GPT): autoregressive, for generation. Encoder-decoder (T5): for seq2seq tasks.

Q: How does mask attention work in decoders? A: Causal masking prevents position $i$ from attending to positions $j > i$ , ensuring autoregressive generation. Implemented by setting future positions to $-\infty$ before softmax.