GPT Models
The Generative Pre-trained Transformer (GPT) family demonstrates the power of scaling autoregressive language models. GPT models use only the decoder portion of the transformer with causal masking.
Autoregressive Language Modeling
GPT models predict the next token given all previous tokens.
DfAutoregressive Language Modeling
The model maximizes the log-likelihood of the training data:
DfTraining Objective
Causal Masking
Causal masking ensures each position can only attend to previous positions (and itself), preventing information leakage from future tokens.
import torch
import torch.nn as nn
import math
def create_causal_attention_mask(seq_len, device='cpu'):
mask = torch.triu(torch.ones(seq_len, seq_len, device=device), diagonal=1)
mask = mask.masked_fill(mask == 1, float('-inf'))
return mask.unsqueeze(0).unsqueeze(0)
# Example for sequence length 5
mask = create_causal_attention_mask(5)
print(mask)
# tensor([[[[ 0., -inf, -inf, -inf, -inf],
# [ 0., 0., -inf, -inf, -inf],
# [ 0., 0., 0., -inf, -inf],
# [ 0., 0., 0., 0., -inf],
# [ 0., 0., 0., 0., 0.]]]])
GPT-2 Architecture
| Parameter | Value |
|---|---|
| Vocabulary size | 50,257 (BPE) |
| Context window | 1024 tokens |
| Model dimension | 1600 |
| Attention heads | 25 |
| Transformer layers | 48 |
| Total parameters | 1.5 billion |
from transformers import GPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
def generate_text(prompt, max_length=50, temperature=1.0, top_k=50):
input_ids = tokenizer.encode(prompt, return_tensors='pt')
output = model.generate(
input_ids,
max_length=max_length,
temperature=temperature,
top_k=top_k,
do_sample=True,
no_repeat_ngram_size=2
)
return tokenizer.decode(output[0], skip_special_tokens=True)
print(generate_text("The future of AI is"))
GPT-3 Scaling
GPT-3 demonstrated that scaling up language models dramatically improves few-shot and zero-shot performance.
| Model | Parameters | Layers | d_model | Heads | Training Tokens |
|---|---|---|---|---|---|
| GPT-3 Small | 125M | 12 | 768 | 12 | 300B |
| GPT-3 Medium | 350M | 24 | 1024 | 16 | 300B |
| GPT-3 Large | 760M | 24 | 1536 | 16 | 300B |
| GPT-3 XL | 1.3B | 24 | 2048 | 24 | 300B |
| GPT-3 2.7B | 2.7B | 32 | 2564 | 32 | 300B |
| GPT-3 6.7B | 6.7B | 32 | 4096 | 32 | 300B |
| GPT-3 13B | 13B | 40 | 5140 | 40 | 300B |
| GPT-3 175B | 175B | 96 | 12288 | 96 | 300B |
DfScaling Law (Kaplan et al.)
Where N is the number of parameters, and L is the expected loss.
Complete GPT Implementation
import torch
import torch.nn as nn
class CausalSelfAttention(nn.Module):
def __init__(self, d_model, num_heads, max_len, dropout=0.1):
super().__init__()
self.num_heads = num_heads
self.head_dim = d_model // num_heads
self.qkv = nn.Linear(d_model, 3 * d_model)
self.proj = nn.Linear(d_model, d_model)
self.attn_dropout = nn.Dropout(dropout)
self.resid_dropout = nn.Dropout(dropout)
# Causal mask
self.register_buffer("mask", torch.tril(
torch.ones(max_len, max_len)
).view(1, 1, max_len, max_len))
def forward(self, x):
B, T, C = x.size()
qkv = self.qkv(x).reshape(B, T, 3, self.num_heads, self.head_dim).permute(2, 0, 3, 1, 4)
q, k, v = qkv[0], qkv[1], qkv[2]
att = (q @ k.transpose(-2, -1)) * (1.0 / (k.size(-1) ** 0.5))
att = att.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))
att = torch.softmax(att, dim=-1)
att = self.attn_dropout(att)
y = (att @ v).transpose(1, 2).reshape(B, T, C)
y = self.resid_dropout(self.proj(y))
return y
class GPTBlock(nn.Module):
def __init__(self, d_model, num_heads, max_len, dropout=0.1):
super().__init__()
self.ln1 = nn.LayerNorm(d_model)
self.attn = CausalSelfAttention(d_model, num_heads, max_len, dropout)
self.ln2 = nn.LayerNorm(d_model)
self.ffn = nn.Sequential(
nn.Linear(d_model, 4 * d_model),
nn.GELU(),
nn.Linear(4 * d_model, d_model),
nn.Dropout(dropout)
)
def forward(self, x):
x = x + self.attn(self.ln1(x))
x = x + self.ffn(self.ln2(x))
return x
class GPT(nn.Module):
def __init__(self, vocab_size, d_model, num_heads, num_layers, max_len, dropout=0.1):
super().__init__()
self.token_emb = nn.Embedding(vocab_size, d_model)
self.pos_emb = nn.Embedding(max_len, d_model)
self.dropout = nn.Dropout(dropout)
self.blocks = nn.Sequential(*[
GPTBlock(d_model, num_heads, max_len, dropout)
for _ in range(num_layers)
])
self.ln_f = nn.LayerNorm(d_model)
self.head = nn.Linear(d_model, vocab_size, bias=False)
def forward(self, idx, targets=None):
B, T = idx.size()
pos = torch.arange(0, T, device=idx.device).unsqueeze(0)
x = self.dropout(self.token_emb(idx) + self.pos_emb(pos))
x = self.blocks(x)
x = self.ln_f(x)
logits = self.head(x)
if targets is None:
return logits
loss = nn.functional.cross_entropy(
logits.view(-1, logits.size(-1)),
targets.view(-1)
)
return logits, loss
Zero-Shot vs Few-Shot vs Fine-Tuning
| Approach | Examples | Performance | Cost |
|---|---|---|---|
| Zero-shot | 0 | Moderate | Low |
| One-shot | 1 | Good | Low |
| Few-shot | 5-100 | Very Good | Low |
| Fine-tuning | 1000+ | Best | High |
GPT-3's in-context learning capability emerges from scale β the model learns to "learn" from examples provided in the prompt without updating weights.
Text Generation Strategies
| Strategy | Description | Control |
|---|---|---|
| Greedy | Always pick highest probability | None |
| Beam search | Keep top-k sequences | num_beams |
| Top-k | Sample from top-k tokens | top_k |
| Top-p (nucleus) | Sample from smallest set with cumulative prob β₯ p | top_p |
| Temperature | Scale logits before softmax | temperature |