Deep Learning

GPT Architecture - How Large Language Models Generate Text

Explore the GPT architecture and understand how large language models generate coherent text.

Autoregressive generation - predict next token sequentially
Decoder-only architecture - simple but powerful design
Scaling laws - performance improves with more parameters

Language is the house of being.

GPT Architecture — Complete Guide

GPT (Generative Pre-trained Transformer) is a decoder-only transformer trained to predict the next token. It powers ChatGPT, GPT-4, and modern LLMs.

Decoder-Only Architecture

DfGPT Architecture

\text{Input tokens} \rightarrow \text{Embedding + Position} \rightarrow N \text{ Decoder Blocks} \rightarrow \text{Output logits}

Each Decoder Block:

Masked Self-Attention (can only attend to past) + Add and Norm
Feed-Forward Network + Add and Norm

The mask prevents looking ahead: Token 4 can attend to tokens 1, 2, 3, 4 (NOT 5, 6, 7).

GPT Decoder Block Architecture

How the GPT decoder block works: This is the fundamental unit repeated N times in every GPT model. Input embeddings (token + positional) enter from the top and flow through two sub-layers. First, Masked Multi-Head Self-Attention (blue) computes context-aware representations — the "masked" part means each token can only attend to previous tokens, preventing information leakage from future tokens. The golden dashed line on the left is a residual connection that adds the original input to the attention output, then Layer Normalization stabilizes training. Second, the Feed-Forward Network (purple) processes each position independently through two linear layers with GELU activation, expanding to 4× d_model dimensions then projecting back. This FFN acts as a "knowledge store" — the attention layer gathers context, and the FFN makes decisions based on that context. The pattern (Attention → Add+Norm → FFN → Add+Norm) repeats for every block, with each layer learning increasingly abstract language patterns.

Autoregressive Generation

DfAutoregressive Generation

Generation process — predict one token at a time:

"The" -> predict next -> "cat"
"The cat" -> predict next -> "sat"
"The cat sat" -> predict next -> "on"
"The cat sat on" -> predict next -> "the"

Each prediction uses ALL previous tokens as context.

Autoregressive Text Generation Process

How autoregressive generation works: This diagram shows the step-by-step process of text generation. At each step, the model receives ALL previously generated tokens as input and predicts the next token. Step 1: given "The", the model predicts "cat" with P=0.82. Step 2: given "The, cat", it predicts "sat" with P=0.91 — notice the probability increased because the model now has more context. Step 3: given "The, cat, sat", it predicts "on" with P=0.76. The key insight: each prediction conditions on the ENTIRE history, not just the previous token. The model builds up the sentence one token at a time, with each new token becoming part of the input for the next prediction. This sequential process continues until an end-of-sequence token is generated or a maximum length is reached. The probabilities reflect the model's confidence — higher probability means the model is more certain about that token choice.

Causal Attention Mask

The causal mask ensures autoregressive behavior by preventing tokens from attending to future positions:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V

Where the mask matrix $M$ is:

M_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}

Causal Attention Mask Visualization

How the causal mask enforces autoregression: This lower-triangular matrix visualizes which tokens can attend to which other tokens. Green cells (✓) mean the token CAN attend to that position; red cells (✗) mean attention is BLOCKED. Row 1 (x₁) can only attend to itself — it has no previous context. Row 2 (x₂) can attend to x₁ and itself, but not x₃ or x₄. Row 3 attends to x₁, x₂, x₃. Row 4 attends to all four. This creates the autoregressive property: when predicting token 4, the model can only use information from tokens 1, 2, 3 — never from token 5 or later. The mask is implemented by adding -∞ to blocked positions before softmax, making their attention weights effectively zero. This is what makes GPT a "decoder-only" model — it can only look left, never right.

Scaling Laws

\text{Loss} \propto N^{-\alpha} + D^{-\beta} + C^{-\gamma}

Here,

$N$ =Model size (parameters) — GPT-1: 117M -> GPT-3: 175B -> GPT-4: ~1.8T
$D$ =Dataset size (tokens) — GPT-3: 300B tokens
$C$ =Compute (FLOPs) — GPT-3: ~3.14 × 10²³ FLOPs

GPT Model Evolution and Scaling

How GPT models scale with resources: The timeline shows the explosive growth of GPT models. GPT-1 (2018) had 117M parameters trained on 5GB of book text using 8 GPUs for 5 days. GPT-2 (2019) jumped to 1.5B parameters on 40GB of web text. GPT-3 (2020) reached 175B parameters trained on 300B tokens — requiring ~3.14×10²³ FLOPs (equivalent to running a supercomputer for months). GPT-4 (2023) is estimated at ~1.8T parameters using a Mixture of Experts architecture. The scaling curve at the bottom shows the key insight: loss decreases as a power law with compute (Loss ∝ C^(-γ)). This means performance improves predictably with more parameters, data, and compute — you can forecast how much resources you need for a target performance level. The formula at the top (Loss ∝ N^(-α) + D^(-β) + C^(-γ)) shows that scaling any one factor alone has diminishing returns — you need to scale all three together.

Training Pipeline

DfGPT Training Pipeline

Pre-training:

Predict next token on massive text corpus
Objective: minimize cross-entropy loss
13+ trillion tokens
Weeks/months on thousands of GPUs

Fine-tuning:

Supervised fine-tuning (SFT)
RLHF (Reinforcement Learning from Human Feedback)
Constitutional AI (Anthropic)

GPT Training Pipeline

How GPT is trained end-to-end: This four-stage pipeline shows how a raw language model becomes ChatGPT. Stage 1 (Data Collection): Gather massive text corpora — Common Crawl (web pages), Wikipedia, books, and code from GitHub — totaling 13+ trillion tokens. Stage 2 (Pre-training): Train the model to predict the next token using cross-entropy loss with the AdamW optimizer. This takes 3-6 months on thousands of GPUs and teaches the model language structure, facts, and reasoning. Stage 3 (Supervised Fine-tuning): Train on ~100K human-written instruction-response pairs to teach the model to follow instructions. Stage 4 (RLHF): Use human feedback to train a reward model, then optimize the policy using PPO (Proximal Policy Optimization) with a KL-divergence penalty to stay close to the pre-trained model. The formulas at the bottom show the mathematical objectives: pre-training minimizes negative log-likelihood, while RLHF maximizes reward minus a penalty for deviating too far from the original model.

Key Takeaways

Summary: GPT Architecture

GPT is a decoder-only transformer — predicts next token
Masked attention prevents looking ahead
Autoregressive generation produces text one token at a time
Scaling laws predict performance from size, data, compute
Pre-training + fine-tuning is the two-stage approach
RLHF aligns models with human preferences
GPT-4 uses MoE (Mixture of Experts) architecture
Context window limits how much text the model can process

What to Learn Next

-> BERT Compare with bidirectional models.

-> Transformers Master the underlying architecture.

-> What are LLMs Learn the basics of large language models.

-> LLM Architecture Deep Dive Explore LLM internals in detail.

-> Pre-training Language Models Understand how models learn from text.

-> RLHF and Alignment Learn how to align models with human values.

GPT Architecture — Decoder-Only Transformers Complete Guide

GPT Architecture - How Large Language Models Generate Text

GPT Architecture — Complete Guide

Decoder-Only Architecture

DfGPT Architecture

GPT Decoder Block Architecture

Autoregressive Generation

DfAutoregressive Generation

Autoregressive Text Generation Process

Causal Attention Mask

Causal Attention Mask Visualization

Scaling Laws

Scaling Laws

GPT Model Evolution and Scaling

Training Pipeline

DfGPT Training Pipeline

GPT Training Pipeline

Key Takeaways

Summary: GPT Architecture

What to Learn Next

Premium Content

Need Expert Machine Learning Help?