🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

GPT Architecture — Decoder-Only Transformers Complete Guide

Advanced TopicsLLM Architecture🟢 Free Lesson

Advertisement

Deep Learning

GPT Architecture - How Large Language Models Generate Text

Explore the GPT architecture and understand how large language models generate coherent text.

  • Autoregressive generation - predict next token sequentially
  • Decoder-only architecture - simple but powerful design
  • Scaling laws - performance improves with more parameters

Language is the house of being.

GPT Architecture — Complete Guide

GPT (Generative Pre-trained Transformer) is a decoder-only transformer trained to predict the next token. It powers ChatGPT, GPT-4, and modern LLMs.


Decoder-Only Architecture

DfGPT Architecture

Input tokensEmbedding + PositionN Decoder BlocksOutput logits\text{Input tokens} \rightarrow \text{Embedding + Position} \rightarrow N \text{ Decoder Blocks} \rightarrow \text{Output logits}

Each Decoder Block:

  1. Masked Self-Attention (can only attend to past) + Add and Norm
  2. Feed-Forward Network + Add and Norm

The mask prevents looking ahead: Token 4 can attend to tokens 1, 2, 3, 4 (NOT 5, 6, 7).

GPT Decoder Block Architecture

GPT Decoder Block (×N)Input Embeddings + PositionAdd and LayerNormMasked Multi-Head Self-AttentionQ, K, V projections → Attention → Mask → OutputEach token attends only to previous tokens (causal mask)ResidualAdd and LayerNormFeed-Forward NetworkLinear(d_model → 4×d_model) → GELU → Linear(4×d_model → d_model)Position-wise (applied independently to each position)Add and LayerNorm→ Next Block or Output Layer

How the GPT decoder block works: This is the fundamental unit repeated N times in every GPT model. Input embeddings (token + positional) enter from the top and flow through two sub-layers. First, Masked Multi-Head Self-Attention (blue) computes context-aware representations — the "masked" part means each token can only attend to previous tokens, preventing information leakage from future tokens. The golden dashed line on the left is a residual connection that adds the original input to the attention output, then Layer Normalization stabilizes training. Second, the Feed-Forward Network (purple) processes each position independently through two linear layers with GELU activation, expanding to 4× d_model dimensions then projecting back. This FFN acts as a "knowledge store" — the attention layer gathers context, and the FFN makes decisions based on that context. The pattern (Attention → Add+Norm → FFN → Add+Norm) repeats for every block, with each layer learning increasingly abstract language patterns.


Autoregressive Generation

DfAutoregressive Generation

Generation process — predict one token at a time:

  1. "The" -> predict next -> "cat"
  2. "The cat" -> predict next -> "sat"
  3. "The cat sat" -> predict next -> "on"
  4. "The cat sat on" -> predict next -> "the"

Each prediction uses ALL previous tokens as context.

Autoregressive Text Generation Process

Autoregressive Generation: P(x_t | x_1, ..., x_{t-1})Step 1: Input[The]GPT ModelP(next|The)Output: "cat"P = 0.82Step 2: Input[The, cat]GPT ModelP(next|The,cat)Output: "sat"P = 0.91Step 3: Input[The, cat, sat]GPT ModelP(next|...)Output: "on"P = 0.76Final: "The cat sat on the mat"Each step conditions on all previous tokens

How autoregressive generation works: This diagram shows the step-by-step process of text generation. At each step, the model receives ALL previously generated tokens as input and predicts the next token. Step 1: given "The", the model predicts "cat" with P=0.82. Step 2: given "The, cat", it predicts "sat" with P=0.91 — notice the probability increased because the model now has more context. Step 3: given "The, cat, sat", it predicts "on" with P=0.76. The key insight: each prediction conditions on the ENTIRE history, not just the previous token. The model builds up the sentence one token at a time, with each new token becoming part of the input for the next prediction. This sequential process continues until an end-of-sequence token is generated or a maximum length is reached. The probabilities reflect the model's confidence — higher probability means the model is more certain about that token choice.


Causal Attention Mask

The causal mask ensures autoregressive behavior by preventing tokens from attending to future positions:

Attention(Q,K,V)=softmax(QKTdk+M)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V

Where the mask matrix MM is:

Mij={0if jiif j>iM_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}

Causal Attention Mask Visualization

Causal Attention Mask (4 tokens)x₁x₂x₃xâ‚„QK✓✓✓✓✓✓✓✓✓✓Key (attended positions)QueryCan attend (mask = 0)Blocked (mask = -≡)Lower-triangular mask ensures causal/autoregressive property

How the causal mask enforces autoregression: This lower-triangular matrix visualizes which tokens can attend to which other tokens. Green cells (✓) mean the token CAN attend to that position; red cells (✗) mean attention is BLOCKED. Row 1 (x₁) can only attend to itself — it has no previous context. Row 2 (x₂) can attend to x₁ and itself, but not x₃ or x₄. Row 3 attends to x₁, x₂, x₃. Row 4 attends to all four. This creates the autoregressive property: when predicting token 4, the model can only use information from tokens 1, 2, 3 — never from token 5 or later. The mask is implemented by adding -∞ to blocked positions before softmax, making their attention weights effectively zero. This is what makes GPT a "decoder-only" model — it can only look left, never right.


Scaling Laws

Scaling Laws

LossNα+Dβ+Cγ\text{Loss} \propto N^{-\alpha} + D^{-\beta} + C^{-\gamma}

Here,

  • NN=Model size (parameters) — GPT-1: 117M -> GPT-3: 175B -> GPT-4: ~1.8T
  • DD=Dataset size (tokens) — GPT-3: 300B tokens
  • CC=Compute (FLOPs) — GPT-3: ~3.14 × 10²³ FLOPs

GPT Model Evolution and Scaling

GPT Evolution: Parameters, Training Data, and Compute1GPT-1 (2018)117M parameters5GB BookCorpus8 GPUs × 5 days2GPT-2 (2019)1.5B parameters40GB WebText~256 TPU pods3GPT-3 (2020)175B parameters300B tokens~3.14×10²³ FLOPs4GPT-4 (2023)~1.8T parameters (MoE)~13T tokensMultimodal (text+image)Power-Law ScalingLog(Compute / FLOPs)Log(Loss)GPT-1GPT-2GPT-3GPT-4

How GPT models scale with resources: The timeline shows the explosive growth of GPT models. GPT-1 (2018) had 117M parameters trained on 5GB of book text using 8 GPUs for 5 days. GPT-2 (2019) jumped to 1.5B parameters on 40GB of web text. GPT-3 (2020) reached 175B parameters trained on 300B tokens — requiring ~3.14×10²³ FLOPs (equivalent to running a supercomputer for months). GPT-4 (2023) is estimated at ~1.8T parameters using a Mixture of Experts architecture. The scaling curve at the bottom shows the key insight: loss decreases as a power law with compute (Loss ∝ C^(-γ)). This means performance improves predictably with more parameters, data, and compute — you can forecast how much resources you need for a target performance level. The formula at the top (Loss ∝ N^(-α) + D^(-β) + C^(-γ)) shows that scaling any one factor alone has diminishing returns — you need to scale all three together.


Training Pipeline

DfGPT Training Pipeline

Pre-training:

  • Predict next token on massive text corpus
  • Objective: minimize cross-entropy loss
  • 13+ trillion tokens
  • Weeks/months on thousands of GPUs

Fine-tuning:

  • Supervised fine-tuning (SFT)
  • RLHF (Reinforcement Learning from Human Feedback)
  • Constitutional AI (Anthropic)

GPT Training Pipeline

GPT Training Pipeline1. Data Collection• Web crawl (Common Crawl)• Books, Wikipedia• Code (GitHub)• 13T+ tokens2. Pre-training• Next-token prediction• Cross-entropy loss• AdamW optimizer• ~3-6 months on GPUs3. SFT• Human-written instruction-response pairs• ~100K examples4. RLHF• Reward model• PPO optimization• Human feedback• AlignmentTraining ObjectivePre-training: L(θ) = -Σ log P(x_t | x_{<t}; θ)RLHF: max E[r_φ(y|x)] - β·KL(π_θ || π_ref)

How GPT is trained end-to-end: This four-stage pipeline shows how a raw language model becomes ChatGPT. Stage 1 (Data Collection): Gather massive text corpora — Common Crawl (web pages), Wikipedia, books, and code from GitHub — totaling 13+ trillion tokens. Stage 2 (Pre-training): Train the model to predict the next token using cross-entropy loss with the AdamW optimizer. This takes 3-6 months on thousands of GPUs and teaches the model language structure, facts, and reasoning. Stage 3 (Supervised Fine-tuning): Train on ~100K human-written instruction-response pairs to teach the model to follow instructions. Stage 4 (RLHF): Use human feedback to train a reward model, then optimize the policy using PPO (Proximal Policy Optimization) with a KL-divergence penalty to stay close to the pre-trained model. The formulas at the bottom show the mathematical objectives: pre-training minimizes negative log-likelihood, while RLHF maximizes reward minus a penalty for deviating too far from the original model.


Key Takeaways

Summary: GPT Architecture

  • GPT is a decoder-only transformer — predicts next token
  • Masked attention prevents looking ahead
  • Autoregressive generation produces text one token at a time
  • Scaling laws predict performance from size, data, compute
  • Pre-training + fine-tuning is the two-stage approach
  • RLHF aligns models with human preferences
  • GPT-4 uses MoE (Mixture of Experts) architecture
  • Context window limits how much text the model can process

What to Learn Next

-> BERT Compare with bidirectional models.

-> Transformers Master the underlying architecture.

-> What are LLMs Learn the basics of large language models.

-> LLM Architecture Deep Dive Explore LLM internals in detail.

-> Pre-training Language Models Understand how models learn from text.

-> RLHF and Alignment Learn how to align models with human values.

Premium Content

GPT Architecture — Decoder-Only Transformers Complete Guide

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Machine Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement