🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Transformers — Attention Is All You Need Complete Guide

Deep LearningTransformers🟢 Free Lesson

Advertisement

Deep Learning

Transformers — The Architecture That Changed Everything

Master the Transformer architecture that powers GPT, BERT, and all modern language models.

  • Self-attention mechanism — process all tokens simultaneously
  • Parallel processing — much faster than sequential RNNs
  • Foundation of LLMs — powers ChatGPT, Claude, and more

Attention is all you need.

Transformers — Attention Is All You Need

Transformers (Vaswani et al., 2017) replaced RNNs as the dominant architecture for sequence processing. They achieve O(1)O(1) sequential operations (vs O(n)O(n) for RNNs), enabling massive parallelism on GPUs.


Self-Attention

The core mechanism: each token attends to every other token to compute a weighted representation.

Scaled Dot-Product Self-AttentionInput TokensThecatsatonLinear ProjectionsQ= XWQK= XWKV= XWVIntuition:Q = "What am I looking for?"K = "What do I contain?"V = "What info do I pass on?"Score = Q·KT / √dkAttention Weights (softmax(QKT/√dk))ThecatsatonThecatsatonDarker = higher attention weightAttention(Q, K, V) = softmax(QKT / √dk) V• Scaling by √dk prevents dot products from growing too large (softmax saturation) • Complexity: O(n²·d) — quadratic in sequence length • dk = dmodel / h (per head)

How self-attention works: This diagram shows the complete flow of scaled dot-product attention. Each input token ("The", "cat", "sat", "on") is projected into three vectors: Query (Q) — "what am I looking for?", Key (K) — "what do I contain?", and Value (V) — "what information do I pass on?". The attention matrix on the right shows how much each token attends to every other token — darker cells mean higher attention weights. For example, "sat" attends strongly to "cat" (0.7) because they have a subject-verb relationship. The formula Attention(Q,K,V) = softmax(QK^T / √d_k) V computes this: dot-product of Q and K gives raw scores, scaling by √d_k prevents gradient vanishing, softmax converts to probabilities, and these weights multiply V to produce context-aware representations. The result: each token's output encodes information from ALL other tokens, weighted by relevance.


Multi-Head Attention

Multiple attention heads capture different types of relationships simultaneously:

Multi-Head AttentionInputXHead 1: SyntaxAttn(Q1, K1, V1)Head 2: SemanticsAttn(Q2, K2, V2)Head 3: CorefAttn(Q3, K3, V3)Head 4: PositionAttn(Q4, K4, V4)Concat(heads)LinearWOOutputMultiHeadMultiHead(Q, K, V) = Concat(head₁, ..., headh) WOwhere headi = Attention(QWQi, KWKi, VWVi)Why Multiple Heads?BERT-base: 12 heads, dk=64 | GPT-4: ~96 heads | Each head specializes in different relationship typesTotal params per layer: 4·dmodel² (Q,K,V projections + output projection)

Why multiple heads matter: Instead of computing one attention function, the Transformer splits the input into h=8 (or 12, or 96) parallel "heads", each with its own Q/K/V projections. Each head learns to focus on different relationship types — Head 1 might capture syntax (subject-verb), Head 2 captures semantics (adjective-noun), Head 3 captures coreference (pronoun-antecedent), and Head 4 captures positional patterns (adjacent words). The Concat block merges all head outputs, and a final linear projection W_O combines them into the output. This is like having multiple "experts" simultaneously analyzing the same sentence from different perspectives. The formula MultiHead = Concat(head₁,...,head_h) · W_O shows the process: compute h separate attention outputs, concatenate them, and project back to d_model dimensions. Each head operates on d_k = d_model/h dimensions, keeping total computation constant.


Positional Encoding

Since self-attention is permutation-invariant, we must inject position information:

Positional EncodingSinusoidal (Original)PE(pos,2i) = sin(pos / 10000^(2i/d))PE(pos,2i+1) = cos(pos / 10000^(2i/d))• Each dimension has a different frequency• Allows model to learn relative positions• Extrapolates to unseen sequence lengths• PE(pos+k) is linear transform of PE(pos)Learned (BERT, GPT)E(pos) = Embedding(pos, d)• Each position gets a learnable vector• More flexible than sinusoidal• Cannot extrapolate beyond training length• Adds dmodel × max_len parametersRoPE (Rotary PE): Used in LLaMA, GPT-NeoXEncodes relative position via rotation matrices

How positional encoding works: Since self-attention treats the input as a set (not a sequence), it has no notion of word order. Positional encoding injects position information by adding a unique vector to each token's embedding. The sinusoidal method (left) uses fixed sine/cosine functions at different frequencies — position 0 gets one pattern, position 1 gets a slightly shifted pattern, and so on. Each dimension oscillates at a different frequency, creating a unique "fingerprint" for each position. The key property: PE(pos+k) can be expressed as a linear transformation of PE(pos), allowing the model to learn relative positions. The learned method (right) simply trains an embedding table for positions — more flexible but can't extrapolate beyond the maximum training length. Modern models like LLaMA use RoPE (Rotary Position Embeddings) which encode relative position through rotation matrices, combining the benefits of both approaches.


Transformer Block

Transformer Encoder BlockInputx ∈ ℝn×dMulti-HeadAttention+Layer Normskip connectionFeed-ForwardFFN(x) = W₂·GELU(W₁·x + b₁) + b₂+Output: x' ∈ ℝn×d

How a Transformer block processes information: This is the fundamental building block of all modern LLMs. The input x first passes through Multi-Head Self-Attention (blue), where each token gathers information from all other tokens. The red dashed line on the left is a skip connection that adds the original input to the attention output — this prevents information loss and stabilizes training. Layer Norm (yellow) then normalizes the result. The same pattern repeats: the Feed-Forward Network (green) processes each position independently through two linear layers with GELU activation, followed by another skip connection and layer norm. The FFN acts as a "thinking" layer — after attention gathers context, the FFN makes decisions based on that context. This two-stage pattern (attend → think) is repeated N times (12 for BERT, 96 for GPT-4), with each layer building increasingly abstract representations.

DfTransformer Block Equations

Encoder block (applied NN times, BERT uses N=12N=12):

z=LayerNorm(x+MultiHead(x,x,x))\mathbf{z} = \text{LayerNorm}(\mathbf{x} + \text{MultiHead}(\mathbf{x}, \mathbf{x}, \mathbf{x}))
x=LayerNorm(z+FFN(z))\mathbf{x}' = \text{LayerNorm}(\mathbf{z} + \text{FFN}(\mathbf{z}))

FFN (position-wise, applied independently to each position):

FFN(x)=W2GELU(W1x+b1)+b2\text{FFN}(\mathbf{x}) = W_2 \cdot \text{GELU}(W_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2

where W1Rdmodel×dffW_1 \in \mathbb{R}^{d_{\text{model}} \times d_{ff}}, W2Rdff×dmodelW_2 \in \mathbb{R}^{d_{ff} \times d_{\text{model}}}, typically dff=4dmodeld_{ff} = 4 \cdot d_{\text{model}}.

Decoder block adds: Masked multi-head attention (prevents attending to future tokens) and cross-attention (queries from decoder, keys/values from encoder).


Encoder vs Decoder

Encoder vs Decoder ArchitecturesEncoder (BERT)Self-Attention (bidirectional)FFN× N layers (12 for BERT-base)Every token attends to ALL tokensGood for: classification, NER, QADecoder (GPT)Masked Self-Attention (causal)FFN× N layers (96 for GPT-4)Token attends only to PREVIOUS tokensGood for: text generation, completion

Encoder vs Decoder — the critical difference: The Encoder (BERT, left) uses bidirectional self-attention — every token can attend to every other token, both left and right. This makes it ideal for understanding tasks where you need the full context (classification, named entity recognition, question answering). The Decoder (GPT, right) uses masked self-attention — each token can only attend to previous tokens (and itself), preventing "cheating" by looking at future words. This autoregressive property makes it suitable for text generation, where you predict one token at a time. The decoder also includes cross-attention layers (not shown) that let it attend to the encoder's output, enabling sequence-to-sequence tasks like translation. The depth difference is striking: BERT-base uses 12 layers, while GPT-4 uses ~96 layers, reflecting the greater complexity of generation vs understanding.


PyTorch Implementation

Example: Transformer Block

import torch.nn as nn

class TransformerBlock(nn.Module):
    def __init__(self, d_model=512, n_heads=8, d_ff=2048, dropout=0.1):
        super().__init__()
        self.attention = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Self-attention with residual
        attn_out, _ = self.attention(x, x, x, attn_mask=mask)
        x = self.norm1(x + self.dropout(attn_out))
        # FFN with residual
        ffn_out = self.ffn(x)
        x = self.norm2(x + ffn_out)
        return x

Key Takeaways

Summary: Transformers

  • Self-attention enables parallel processing of sequences (O(1) sequential ops)
  • Multi-head attention captures different relationship types simultaneously
  • Positional encoding adds sequence order information (sinusoidal, learned, or RoPE)
  • Encoder for understanding (BERT), Decoder for generation (GPT)
  • Residual connections + Layer Norm stabilize deep transformers
  • Complexity: O(n²·d) per layer — quadratic in sequence length
  • Transformers power all modern LLMs (GPT, Claude, Gemini, LLaMA)
  • Scaling laws: Performance scales predictably with parameters, data, compute

What to Learn Next

-> BERT Learn about bidirectional language understanding.

-> GPT Architecture Explore how GPT generates text.

-> Attention Deep Dive Master attention mechanisms in detail.

-> Vision Transformers Apply Transformers to computer vision.

-> Transfer Learning Leverage pre-trained Transformer models.

-> Training Deep Networks Master optimizers and regularization for Transformers.

Premium Content

Transformers — Attention Is All You Need Complete Guide

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Machine Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement