Deep Learning
Transformers — The Architecture That Changed Everything
Master the Transformer architecture that powers GPT, BERT, and all modern language models.
- Self-attention mechanism — process all tokens simultaneously
- Parallel processing — much faster than sequential RNNs
- Foundation of LLMs — powers ChatGPT, Claude, and more
Attention is all you need.
Transformers — Attention Is All You Need
Transformers (Vaswani et al., 2017) replaced RNNs as the dominant architecture for sequence processing. They achieve sequential operations (vs for RNNs), enabling massive parallelism on GPUs.
Self-Attention
The core mechanism: each token attends to every other token to compute a weighted representation.
How self-attention works: This diagram shows the complete flow of scaled dot-product attention. Each input token ("The", "cat", "sat", "on") is projected into three vectors: Query (Q) — "what am I looking for?", Key (K) — "what do I contain?", and Value (V) — "what information do I pass on?". The attention matrix on the right shows how much each token attends to every other token — darker cells mean higher attention weights. For example, "sat" attends strongly to "cat" (0.7) because they have a subject-verb relationship. The formula Attention(Q,K,V) = softmax(QK^T / √d_k) V computes this: dot-product of Q and K gives raw scores, scaling by √d_k prevents gradient vanishing, softmax converts to probabilities, and these weights multiply V to produce context-aware representations. The result: each token's output encodes information from ALL other tokens, weighted by relevance.
Multi-Head Attention
Multiple attention heads capture different types of relationships simultaneously:
Why multiple heads matter: Instead of computing one attention function, the Transformer splits the input into h=8 (or 12, or 96) parallel "heads", each with its own Q/K/V projections. Each head learns to focus on different relationship types — Head 1 might capture syntax (subject-verb), Head 2 captures semantics (adjective-noun), Head 3 captures coreference (pronoun-antecedent), and Head 4 captures positional patterns (adjacent words). The Concat block merges all head outputs, and a final linear projection W_O combines them into the output. This is like having multiple "experts" simultaneously analyzing the same sentence from different perspectives. The formula MultiHead = Concat(head₁,...,head_h) · W_O shows the process: compute h separate attention outputs, concatenate them, and project back to d_model dimensions. Each head operates on d_k = d_model/h dimensions, keeping total computation constant.
Positional Encoding
Since self-attention is permutation-invariant, we must inject position information:
How positional encoding works: Since self-attention treats the input as a set (not a sequence), it has no notion of word order. Positional encoding injects position information by adding a unique vector to each token's embedding. The sinusoidal method (left) uses fixed sine/cosine functions at different frequencies — position 0 gets one pattern, position 1 gets a slightly shifted pattern, and so on. Each dimension oscillates at a different frequency, creating a unique "fingerprint" for each position. The key property: PE(pos+k) can be expressed as a linear transformation of PE(pos), allowing the model to learn relative positions. The learned method (right) simply trains an embedding table for positions — more flexible but can't extrapolate beyond the maximum training length. Modern models like LLaMA use RoPE (Rotary Position Embeddings) which encode relative position through rotation matrices, combining the benefits of both approaches.
Transformer Block
How a Transformer block processes information: This is the fundamental building block of all modern LLMs. The input x first passes through Multi-Head Self-Attention (blue), where each token gathers information from all other tokens. The red dashed line on the left is a skip connection that adds the original input to the attention output — this prevents information loss and stabilizes training. Layer Norm (yellow) then normalizes the result. The same pattern repeats: the Feed-Forward Network (green) processes each position independently through two linear layers with GELU activation, followed by another skip connection and layer norm. The FFN acts as a "thinking" layer — after attention gathers context, the FFN makes decisions based on that context. This two-stage pattern (attend → think) is repeated N times (12 for BERT, 96 for GPT-4), with each layer building increasingly abstract representations.
DfTransformer Block Equations
Encoder block (applied times, BERT uses ):
FFN (position-wise, applied independently to each position):
where , , typically .
Decoder block adds: Masked multi-head attention (prevents attending to future tokens) and cross-attention (queries from decoder, keys/values from encoder).
Encoder vs Decoder
Encoder vs Decoder — the critical difference: The Encoder (BERT, left) uses bidirectional self-attention — every token can attend to every other token, both left and right. This makes it ideal for understanding tasks where you need the full context (classification, named entity recognition, question answering). The Decoder (GPT, right) uses masked self-attention — each token can only attend to previous tokens (and itself), preventing "cheating" by looking at future words. This autoregressive property makes it suitable for text generation, where you predict one token at a time. The decoder also includes cross-attention layers (not shown) that let it attend to the encoder's output, enabling sequence-to-sequence tasks like translation. The depth difference is striking: BERT-base uses 12 layers, while GPT-4 uses ~96 layers, reflecting the greater complexity of generation vs understanding.
PyTorch Implementation
Example: Transformer Block
import torch.nn as nn
class TransformerBlock(nn.Module):
def __init__(self, d_model=512, n_heads=8, d_ff=2048, dropout=0.1):
super().__init__()
self.attention = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(d_ff, d_model),
nn.Dropout(dropout)
)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
# Self-attention with residual
attn_out, _ = self.attention(x, x, x, attn_mask=mask)
x = self.norm1(x + self.dropout(attn_out))
# FFN with residual
ffn_out = self.ffn(x)
x = self.norm2(x + ffn_out)
return x
Key Takeaways
Summary: Transformers
- Self-attention enables parallel processing of sequences (O(1) sequential ops)
- Multi-head attention captures different relationship types simultaneously
- Positional encoding adds sequence order information (sinusoidal, learned, or RoPE)
- Encoder for understanding (BERT), Decoder for generation (GPT)
- Residual connections + Layer Norm stabilize deep transformers
- Complexity: O(n²·d) per layer — quadratic in sequence length
- Transformers power all modern LLMs (GPT, Claude, Gemini, LLaMA)
- Scaling laws: Performance scales predictably with parameters, data, compute
What to Learn Next
-> BERT Learn about bidirectional language understanding.
-> GPT Architecture Explore how GPT generates text.
-> Attention Deep Dive Master attention mechanisms in detail.
-> Vision Transformers Apply Transformers to computer vision.
-> Transfer Learning Leverage pre-trained Transformer models.
-> Training Deep Networks Master optimizers and regularization for Transformers.