🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Transformers: Self-Attention, Multi-Head Attention, Positional Encoding — Asked at OpenAI & Google

Deep Learning Premium InterviewsTransformer Architecture⭐ Premium

Advertisement

OpenAI & Google

Transformers: Self-Attention, Multi-Head Attention & Positional Encoding

Premium Interview Preparation — Modern Architecture Mastery

🎯 The Interview Question

"Explain the Transformer architecture in detail, starting from the self-attention mechanism. What is multi-head attention and why is it necessary? How does positional encoding work, and why can't Transformers process sequences without it? Discuss the computational complexity and how it limits application to long sequences."

This is THE question for modern AI interviews. Transformers power GPT, BERT, PaLM, and virtually all state-of-the-art models.


📚 Detailed Answer

Self-Attention: The Core Innovation

Self-attention allows each position in a sequence to attend to all other positions, computing a weighted sum of values based on relevance.

Given input XRn×d\mathbf{X} \in \mathbb{R}^{n \times d} (sequence length nn, dimension dd), we compute:

Query, Key, Value projections:

Q=XWQ,K=XWK,V=XWV\mathbf{Q} = \mathbf{X}\mathbf{W}^Q, \quad \mathbf{K} = \mathbf{X}\mathbf{W}^K, \quad \mathbf{V} = \mathbf{X}\mathbf{W}^V

where WQ,WKRd×dk\mathbf{W}^Q, \mathbf{W}^K \in \mathbb{R}^{d \times d_k} and WVRd×dv\mathbf{W}^V \in \mathbb{R}^{d \times d_v}.

Scaled Dot-Product Attention:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}

The scaling factor dk\sqrt{d_k} prevents dot products from growing too large, which would push softmax into regions with extremely small gradients.

Why Q, K, V?

  • Query: What am I looking for?
  • Key: What do I contain?
  • Value: What do I provide?

The attention score qiTkjq_i^T k_j measures how well query ii matches key jj.

💡

Self-attention has O(n2d)O(n^2 d) complexity and O(n2)O(n^2) memory. This quadratic scaling is the main limitation for long sequences. Recent innovations like Flash Attention and linear attention variants aim to reduce this.

Multi-Head Attention

Instead of a single attention function, we use hh parallel attention heads:

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\mathbf{W}^O

where each head attends with different learned projections:

headi=Attention(XWiQ,XWiK,XWiV)\text{head}_i = \text{Attention}(\mathbf{X}\mathbf{W}_i^Q, \mathbf{X}\mathbf{W}_i^K, \mathbf{X}\mathbf{W}_i^V)

Typically dk=dv=d/hd_k = d_v = d/h, so total computation is similar to single-head attention.

Why Multi-Head?

  1. Different representation subspaces: Each head can learn different types of relationships (syntactic, semantic, positional)
  2. Parallel processing: Multiple attention patterns computed simultaneously
  3. Richer representations: Concatenation combines different perspectives

Positional Encoding

Self-attention is permutation-invariant — it doesn't know the order of tokens. Positional encoding injects position information:

Sinusoidal Encoding (Original Transformer)

PE(pos,2i)=sin(pos100002i/d)PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)
PE(pos,2i+1)=cos(pos100002i/d)PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)

where pospos is position and ii is dimension.

Properties:

  • Each dimension represents a different frequency
  • PEpos+kPE_{pos+k} can be expressed as a linear function of PEposPE_{pos}
  • Generalizes to unseen sequence lengths

Learned Positional Encoding

hi=xi+pi\mathbf{h}_i = \mathbf{x}_i + \mathbf{p}_i

where piRd\mathbf{p}_i \in \mathbb{R}^d is a learned embedding for position ii.

Used in BERT and GPT; simpler but doesn't generalize beyond training length.

Rotary Position Embedding (RoPE)

Encodes relative positions directly into the attention computation:

RoPE(q,m)=(q0q1q2q3)(cos(mθ0)cos(mθ0)cos(mθ1)cos(mθ1))+(q1q0q3q2)(sin(mθ0)sin(mθ0)sin(mθ1)sin(mθ1))\text{RoPE}(q, m) = \begin{pmatrix} q_0 \\ q_1 \\ q_2 \\ q_3 \end{pmatrix} \otimes \begin{pmatrix} \cos(m\theta_0) \\ \cos(m\theta_0) \\ \cos(m\theta_1) \\ \cos(m\theta_1) \end{pmatrix} + \begin{pmatrix} -q_1 \\ q_0 \\ -q_3 \\ q_2 \end{pmatrix} \otimes \begin{pmatrix} \sin(m\theta_0) \\ \sin(m\theta_0) \\ \sin(m\theta_1) \\ \sin(m\theta_1) \end{pmatrix}

Used in LLaMA, PaLM; naturally handles relative positions.

Complete Transformer Architecture

Architecture Diagram
Input Embedding + Positional Encoding
        ↓
┌───────────────────────┐
│  Encoder Block × N    │
│  ┌─────────────────┐  │
│  │ Multi-Head Self │  │
│  │ Attention + Add │  │
│  │ & Layer Norm    │  │
│  └─────────────────┘  │
│  ┌─────────────────┐  │
│  │ Feed-Forward +  │  │
│  │ Add & Layer Norm│  │
│  └─────────────────┘  │
└───────────────────────┘
        ↓
┌───────────────────────┐
│  Decoder Block × N    │
│  ┌─────────────────┐  │
│  │ Masked Multi-   │  │
│  │ Head Self-Attn  │  │
│  └─────────────────┘  │
│  ┌─────────────────┐  │
│  │ Cross-Attention │  │
│  │ (Encoder-Decoder)│ │
│  └─────────────────┘  │
│  ┌─────────────────┐  │
│  │ Feed-Forward    │  │
│  └─────────────────┘  │
└───────────────────────┘
        ↓
Output Projection + Softmax

Computational Complexity Analysis

OperationTime ComplexityMemory
Self-AttentionO(n2d)O(n^2 d)O(n2)O(n^2)
Multi-Head AttentionO(n2d)O(n^2 d)O(n2)O(n^2)
Feed-ForwardO(nd2)O(n d^2)O(nd)O(nd)
Total per layerO(n2d+nd2)O(n^2 d + n d^2)O(n2+nd)O(n^2 + nd)

For ndn \ll d (typical): attention is dominant. For ndn \gg d: feed-forward is dominant.

Optimizations for Long Sequences

  1. Sparse Attention: Only attend to local windows (Longformer, BigBird)
  2. Linear Attention: Approximate softmax with kernel functions
  3. Flash Attention: IO-aware exact attention with O(n2)O(n^2) compute but O(n)O(n) memory
  4. Grouped Query Attention: Share keys/values across query heads

Follow-Up Questions

Q: Why is dk\sqrt{d_k} used in scaled attention? A: Without scaling, dot products grow with dkd_k, pushing softmax into regions with tiny gradients. The scaling keeps variance constant regardless of dkd_k.

Q: What is the difference between encoder-only, decoder-only, and encoder-decoder Transformers? A: Encoder-only (BERT): bidirectional, for understanding. Decoder-only (GPT): autoregressive, for generation. Encoder-decoder (T5): for seq2seq tasks.

Q: How does mask attention work in decoders? A: Causal masking prevents position ii from attending to positions j>ij > i, ensuring autoregressive generation. Implemented by setting future positions to -\infty before softmax.

Related Topics

Advertisement