πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

GPT Models

TransformersGenerative Pre-trained Transformers🟒 Free Lesson

Advertisement

GPT Models

The Generative Pre-trained Transformer (GPT) family demonstrates the power of scaling autoregressive language models. GPT models use only the decoder portion of the transformer with causal masking.

Autoregressive Language Modeling

GPT models predict the next token given all previous tokens.

DfAutoregressive Language Modeling

The model maximizes the log-likelihood of the training data:

DfTraining Objective

Causal Masking

Causal masking ensures each position can only attend to previous positions (and itself), preventing information leakage from future tokens.

import torch
import torch.nn as nn
import math

def create_causal_attention_mask(seq_len, device='cpu'):
    mask = torch.triu(torch.ones(seq_len, seq_len, device=device), diagonal=1)
    mask = mask.masked_fill(mask == 1, float('-inf'))
    return mask.unsqueeze(0).unsqueeze(0)

# Example for sequence length 5
mask = create_causal_attention_mask(5)
print(mask)
# tensor([[[[ 0., -inf, -inf, -inf, -inf],
#           [ 0.,   0., -inf, -inf, -inf],
#           [ 0.,   0.,   0., -inf, -inf],
#           [ 0.,   0.,   0.,   0., -inf],
#           [ 0.,   0.,   0.,   0.,   0.]]]])

GPT-2 Architecture

ParameterValue
Vocabulary size50,257 (BPE)
Context window1024 tokens
Model dimension1600
Attention heads25
Transformer layers48
Total parameters1.5 billion
from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

def generate_text(prompt, max_length=50, temperature=1.0, top_k=50):
    input_ids = tokenizer.encode(prompt, return_tensors='pt')

    output = model.generate(
        input_ids,
        max_length=max_length,
        temperature=temperature,
        top_k=top_k,
        do_sample=True,
        no_repeat_ngram_size=2
    )

    return tokenizer.decode(output[0], skip_special_tokens=True)

print(generate_text("The future of AI is"))

GPT-3 Scaling

GPT-3 demonstrated that scaling up language models dramatically improves few-shot and zero-shot performance.

ModelParametersLayersd_modelHeadsTraining Tokens
GPT-3 Small125M1276812300B
GPT-3 Medium350M24102416300B
GPT-3 Large760M24153616300B
GPT-3 XL1.3B24204824300B
GPT-3 2.7B2.7B32256432300B
GPT-3 6.7B6.7B32409632300B
GPT-3 13B13B40514040300B
GPT-3 175B175B961228896300B

DfScaling Law (Kaplan et al.)

Where N is the number of parameters, and L is the expected loss.

Complete GPT Implementation

import torch
import torch.nn as nn

class CausalSelfAttention(nn.Module):
    def __init__(self, d_model, num_heads, max_len, dropout=0.1):
        super().__init__()
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads

        self.qkv = nn.Linear(d_model, 3 * d_model)
        self.proj = nn.Linear(d_model, d_model)
        self.attn_dropout = nn.Dropout(dropout)
        self.resid_dropout = nn.Dropout(dropout)

        # Causal mask
        self.register_buffer("mask", torch.tril(
            torch.ones(max_len, max_len)
        ).view(1, 1, max_len, max_len))

    def forward(self, x):
        B, T, C = x.size()
        qkv = self.qkv(x).reshape(B, T, 3, self.num_heads, self.head_dim).permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]

        att = (q @ k.transpose(-2, -1)) * (1.0 / (k.size(-1) ** 0.5))
        att = att.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))
        att = torch.softmax(att, dim=-1)
        att = self.attn_dropout(att)

        y = (att @ v).transpose(1, 2).reshape(B, T, C)
        y = self.resid_dropout(self.proj(y))
        return y

class GPTBlock(nn.Module):
    def __init__(self, d_model, num_heads, max_len, dropout=0.1):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.attn = CausalSelfAttention(d_model, num_heads, max_len, dropout)
        self.ln2 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.GELU(),
            nn.Linear(4 * d_model, d_model),
            nn.Dropout(dropout)
        )

    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.ffn(self.ln2(x))
        return x

class GPT(nn.Module):
    def __init__(self, vocab_size, d_model, num_heads, num_layers, max_len, dropout=0.1):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_len, d_model)
        self.dropout = nn.Dropout(dropout)

        self.blocks = nn.Sequential(*[
            GPTBlock(d_model, num_heads, max_len, dropout)
            for _ in range(num_layers)
        ])
        self.ln_f = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size, bias=False)

    def forward(self, idx, targets=None):
        B, T = idx.size()
        pos = torch.arange(0, T, device=idx.device).unsqueeze(0)

        x = self.dropout(self.token_emb(idx) + self.pos_emb(pos))
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.head(x)

        if targets is None:
            return logits

        loss = nn.functional.cross_entropy(
            logits.view(-1, logits.size(-1)),
            targets.view(-1)
        )
        return logits, loss

Zero-Shot vs Few-Shot vs Fine-Tuning

ApproachExamplesPerformanceCost
Zero-shot0ModerateLow
One-shot1GoodLow
Few-shot5-100Very GoodLow
Fine-tuning1000+BestHigh

GPT-3's in-context learning capability emerges from scale β€” the model learns to "learn" from examples provided in the prompt without updating weights.

Text Generation Strategies

StrategyDescriptionControl
GreedyAlways pick highest probabilityNone
Beam searchKeep top-k sequencesnum_beams
Top-kSample from top-k tokenstop_k
Top-p (nucleus)Sample from smallest set with cumulative prob β‰₯ ptop_p
TemperatureScale logits before softmaxtemperature

Top-p (Nucleus) Sampling

⭐

Premium Content

GPT Models

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert NLP Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement