Deep Learning
GPT Architecture - How Large Language Models Generate Text
Explore the GPT architecture and understand how large language models generate coherent text.
- Autoregressive generation - predict next token sequentially
- Decoder-only architecture - simple but powerful design
- Scaling laws - performance improves with more parameters
Language is the house of being.
GPT Architecture — Complete Guide
GPT (Generative Pre-trained Transformer) is a decoder-only transformer trained to predict the next token. It powers ChatGPT, GPT-4, and modern LLMs.
Decoder-Only Architecture
DfGPT Architecture
Each Decoder Block:
- Masked Self-Attention (can only attend to past) + Add and Norm
- Feed-Forward Network + Add and Norm
The mask prevents looking ahead: Token 4 can attend to tokens 1, 2, 3, 4 (NOT 5, 6, 7).
GPT Decoder Block Architecture
How the GPT decoder block works: This is the fundamental unit repeated N times in every GPT model. Input embeddings (token + positional) enter from the top and flow through two sub-layers. First, Masked Multi-Head Self-Attention (blue) computes context-aware representations — the "masked" part means each token can only attend to previous tokens, preventing information leakage from future tokens. The golden dashed line on the left is a residual connection that adds the original input to the attention output, then Layer Normalization stabilizes training. Second, the Feed-Forward Network (purple) processes each position independently through two linear layers with GELU activation, expanding to 4× d_model dimensions then projecting back. This FFN acts as a "knowledge store" — the attention layer gathers context, and the FFN makes decisions based on that context. The pattern (Attention → Add+Norm → FFN → Add+Norm) repeats for every block, with each layer learning increasingly abstract language patterns.
Autoregressive Generation
DfAutoregressive Generation
Generation process — predict one token at a time:
"The"-> predict next ->"cat""The cat"-> predict next ->"sat""The cat sat"-> predict next ->"on""The cat sat on"-> predict next ->"the"
Each prediction uses ALL previous tokens as context.
Autoregressive Text Generation Process
How autoregressive generation works: This diagram shows the step-by-step process of text generation. At each step, the model receives ALL previously generated tokens as input and predicts the next token. Step 1: given "The", the model predicts "cat" with P=0.82. Step 2: given "The, cat", it predicts "sat" with P=0.91 — notice the probability increased because the model now has more context. Step 3: given "The, cat, sat", it predicts "on" with P=0.76. The key insight: each prediction conditions on the ENTIRE history, not just the previous token. The model builds up the sentence one token at a time, with each new token becoming part of the input for the next prediction. This sequential process continues until an end-of-sequence token is generated or a maximum length is reached. The probabilities reflect the model's confidence — higher probability means the model is more certain about that token choice.
Causal Attention Mask
The causal mask ensures autoregressive behavior by preventing tokens from attending to future positions:
Where the mask matrix is:
Causal Attention Mask Visualization
How the causal mask enforces autoregression: This lower-triangular matrix visualizes which tokens can attend to which other tokens. Green cells (✓) mean the token CAN attend to that position; red cells (✗) mean attention is BLOCKED. Row 1 (x₁) can only attend to itself — it has no previous context. Row 2 (x₂) can attend to x₁ and itself, but not x₃ or x₄. Row 3 attends to x₁, x₂, x₃. Row 4 attends to all four. This creates the autoregressive property: when predicting token 4, the model can only use information from tokens 1, 2, 3 — never from token 5 or later. The mask is implemented by adding -∞ to blocked positions before softmax, making their attention weights effectively zero. This is what makes GPT a "decoder-only" model — it can only look left, never right.
Scaling Laws
Scaling Laws
Here,
- =Model size (parameters) — GPT-1: 117M -> GPT-3: 175B -> GPT-4: ~1.8T
- =Dataset size (tokens) — GPT-3: 300B tokens
- =Compute (FLOPs) — GPT-3: ~3.14 × 10²³ FLOPs
GPT Model Evolution and Scaling
How GPT models scale with resources: The timeline shows the explosive growth of GPT models. GPT-1 (2018) had 117M parameters trained on 5GB of book text using 8 GPUs for 5 days. GPT-2 (2019) jumped to 1.5B parameters on 40GB of web text. GPT-3 (2020) reached 175B parameters trained on 300B tokens — requiring ~3.14×10²³ FLOPs (equivalent to running a supercomputer for months). GPT-4 (2023) is estimated at ~1.8T parameters using a Mixture of Experts architecture. The scaling curve at the bottom shows the key insight: loss decreases as a power law with compute (Loss ∝ C^(-γ)). This means performance improves predictably with more parameters, data, and compute — you can forecast how much resources you need for a target performance level. The formula at the top (Loss ∝ N^(-α) + D^(-β) + C^(-γ)) shows that scaling any one factor alone has diminishing returns — you need to scale all three together.
Training Pipeline
DfGPT Training Pipeline
Pre-training:
- Predict next token on massive text corpus
- Objective: minimize cross-entropy loss
- 13+ trillion tokens
- Weeks/months on thousands of GPUs
Fine-tuning:
- Supervised fine-tuning (SFT)
- RLHF (Reinforcement Learning from Human Feedback)
- Constitutional AI (Anthropic)
GPT Training Pipeline
How GPT is trained end-to-end: This four-stage pipeline shows how a raw language model becomes ChatGPT. Stage 1 (Data Collection): Gather massive text corpora — Common Crawl (web pages), Wikipedia, books, and code from GitHub — totaling 13+ trillion tokens. Stage 2 (Pre-training): Train the model to predict the next token using cross-entropy loss with the AdamW optimizer. This takes 3-6 months on thousands of GPUs and teaches the model language structure, facts, and reasoning. Stage 3 (Supervised Fine-tuning): Train on ~100K human-written instruction-response pairs to teach the model to follow instructions. Stage 4 (RLHF): Use human feedback to train a reward model, then optimize the policy using PPO (Proximal Policy Optimization) with a KL-divergence penalty to stay close to the pre-trained model. The formulas at the bottom show the mathematical objectives: pre-training minimizes negative log-likelihood, while RLHF maximizes reward minus a penalty for deviating too far from the original model.
Key Takeaways
Summary: GPT Architecture
- GPT is a decoder-only transformer — predicts next token
- Masked attention prevents looking ahead
- Autoregressive generation produces text one token at a time
- Scaling laws predict performance from size, data, compute
- Pre-training + fine-tuning is the two-stage approach
- RLHF aligns models with human preferences
- GPT-4 uses MoE (Mixture of Experts) architecture
- Context window limits how much text the model can process
What to Learn Next
-> BERT Compare with bidirectional models.
-> Transformers Master the underlying architecture.
-> What are LLMs Learn the basics of large language models.
-> LLM Architecture Deep Dive Explore LLM internals in detail.
-> Pre-training Language Models Understand how models learn from text.
-> RLHF and Alignment Learn how to align models with human values.