πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Pre-training Language Models

TrainingPre-training🟒 Free Lesson

Advertisement

LLM Training

Pretraining Language Models β€” Learning Language from the Internet

Pre-training is the foundational stage of LLM development, where models learn general language representations from massive text corpora. This guide covers training objectives, loss functions, and the scaling laws that govern this process.

  • CLM vs MLM β€” Causal Language Modeling dominates modern LLM training
  • Scaling Laws β€” Model size and data should scale equally with compute
  • Data Quality β€” Deduplication and curation matter as much as quantity

Scale is not just a featureβ€”it is the strategy.

Pre-training Language Models

Pre-training is the foundational stage of LLM development, where models learn general language representations from massive text corpora. This tutorial covers the objectives, loss functions, and scaling laws that govern this process.

DfPre-training

Pre-training is the process of training a language model on a large unlabeled text corpus using self-supervised learning objectives. The model learns to predict tokens from context, acquiring general knowledge about language structure, semantics, and world knowledge.

Language Modeling Objectives

Causal Language Modeling (CLM)

CLM is the objective used by GPT, LLaMA, and most modern LLMs. The model predicts the next token given all previous tokens.

Causal Language Modeling

LCLM=βˆ’βˆ‘t=1Tlog⁑P(xt∣x1,…,xtβˆ’1;ΞΈ)\mathcal{L}_{\text{CLM}} = -\sum_{t=1}^{T} \log P(x_t | x_1, \ldots, x_{t-1}; \theta)

Here,

  • xtx_t=Token at position t
  • TT=Sequence length
  • ΞΈ\theta=Model parameters

Masked Language Modeling (MLM)

MLM is the objective used by BERT. The model predicts randomly masked tokens given the bidirectional context.

Masked Language Modeling

LMLM=βˆ’βˆ‘t∈Mlog⁑P(xt∣xβˆ–M;ΞΈ)\mathcal{L}_{\text{MLM}} = -\sum_{t \in \mathcal{M}} \log P(x_t | x_{\setminus \mathcal{M}}; \theta)

Here,

  • M\mathcal{M}=Set of masked positions
  • xβˆ–Mx_{\setminus \mathcal{M}}=Unmasked tokens (bidirectional context)

CLM is preferred for LLMs because: (1) it enables autoregressive generation, (2) it scales more naturally to large models, and (3) in-context learning emerges from the CLM objective.

Cross-Entropy Loss

The training objective for language models is typically the cross-entropy loss between the model's predicted distribution and the true token distribution.

Cross-Entropy Loss
L(ΞΈ)=βˆ’1Tβˆ‘t=1Tlog⁑PΞΈ(xt∣x<t)=βˆ’1Tβˆ‘t=1Tlog⁑exp⁑(zxt)βˆ‘v=1Vexp⁑(zv)\mathcal{L}(\theta) = -\frac{1}{T} \sum_{t=1}^{T} \log P_\theta(x_t | x_{<t}) = -\frac{1}{T} \sum_{t=1}^{T} \log \frac{\exp(z_{x_t})}{\sum_{v=1}^{V} \exp(z_v)}

Here,

  • TT=Sequence length
  • PΞΈP_\theta=Model's predicted probability
  • zxtz_{x_t}=Logit for the correct token x_t
  • VV=Vocabulary size
  • zvz_v=Logit for vocabulary token v

Perplexity

Perplexity is the standard evaluation metric for language models, measuring how well the model predicts the test data.

Perplexity

PPL=exp⁑(L)=exp⁑(βˆ’1Tβˆ‘t=1Tlog⁑PΞΈ(xt∣x<t))\text{PPL} = \exp(\mathcal{L}) = \exp\left(-\frac{1}{T} \sum_{t=1}^{T} \log P_\theta(x_t | x_{<t})\right)

Here,

  • PPL\text{PPL}=Perplexity
  • L\mathcal{L}=Cross-entropy loss
  • TT=Test sequence length

Intuitively, perplexity represents the average branching factor---the number of equally likely next tokens the model considers at each position.

Perplexity as Branching Factor

PPL=2H=2βˆ’1Tβˆ‘tlog⁑2P(xt∣x<t)\text{PPL} = 2^{H} = 2^{-\frac{1}{T} \sum_t \log_2 P(x_t | x_{<t})}

Here,

  • HH=Entropy of the model in bits per token

A perplexity of 10 means the model is, on average, as uncertain as choosing uniformly among 10 candidates. GPT-3 achieves ~20 perplexity on standard benchmarks; GPT-4 achieves ~10-15.

Training Data

The quality and diversity of training data are critical for LLM performance.

Data Sources

  • Web crawls: Common Crawl, C4, RefinedWeb
  • Books: Books3, Gutenberg
  • Code: GitHub, StackOverflow
  • Academic: arXiv, S2ORC
  • Wikipedia: Multiple languages

Data Quality Pipeline

  1. Deduplication: Remove duplicate documents (exact and fuzzy)
  2. Filtering: Remove low-quality, toxic, or PII content
  3. Re-weighting: Adjust domain proportions based on quality
  4. Tokenization: Convert to token sequences

Data Mixing Proportions

P(domaini)=wiβ‹…βˆ£Diβˆ£βˆ‘jwjβ‹…βˆ£Dj∣P(\text{domain}_i) = \frac{w_i \cdot |D_i|}{\sum_j w_j \cdot |D_j|}

Here,

  • wiw_i=Weight for domain i
  • ∣Di∣|D_i|=Size of domain i

Chinchilla Scaling Laws

The Chinchilla paper (Hoffmann et al., 2022) established optimal scaling relationships between model size and data size.

Chinchilla Optimal Scaling
Nopt∝C0.5,Dopt∝C0.5N_{\text{opt}} \propto C^{0.5}, \quad D_{\text{opt}} \propto C^{0.5}

Here,

  • NoptN_{\text{opt}}=Optimal model size for compute budget C
  • DoptD_{\text{opt}}=Optimal data size for compute budget C
  • CC=Total compute budget (FLOPs)

This implies that for optimal performance, model size and data size should scale equally with compute. This challenged the prior paradigm of training very large models on insufficient data.

Chinchilla vs GPT-3

ModelParametersTokensTokens/ParamFLOPs
GPT-3175B300B1.73.1e23
Chinchilla70B1.4T205.0e23

Chinchilla achieves better performance with fewer parameters but more data, demonstrating the importance of data scaling.

For a detailed treatment of scaling laws, see our module on Scaling Laws and Chinchilla.

Curriculum Learning

Curriculum learning involves presenting training data in a structured order, from easy to hard examples.

DfCurriculum Learning

Curriculum learning is a training strategy where data is presented in order of increasing difficulty. For LLMs, this can mean: (1) starting with shorter sequences, (2) gradually increasing data complexity, or (3) focusing on higher-quality data later in training.

Practical Example: Pre-training with HuggingFace

`python from transformers import ( AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, DataCollatorForLanguageModeling ) from datasets import load_dataset

model_name = "meta-llama/Llama-2-7b-hf" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name)

dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")

def tokenize(examples): return tokenizer(examples["text"], truncation=True, max_length=2048)

tokenized = dataset.map(tokenize, batched=True, remove_columns=["text"])

training_args = TrainingArguments( output_dir="./llama2-pretrained", per_device_train_batch_size=4, gradient_accumulation_steps=8, learning_rate=2e-4, warmup_steps=1000, max_steps=100000, fp16=True, logging_steps=100, save_steps=10000, optim="adamw_torch", weight_decay=0.1, lr_scheduler_type="cosine", )

trainer = Trainer( model=model, args=training_args, train_dataset=tokenized, data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False), )

trainer.train() `

Pre-training from scratch requires massive compute resources (typically thousands of GPUs for months). For most practitioners, fine-tuning a pre-trained model is more practical. See our modules on Fine-tuning and LoRA.

Practice Exercises

  1. Mathematical: Calculate the perplexity of a model with cross-entropy loss of 3.2 nats. How does this compare to a model with loss of 2.8 nats?

  2. Analysis: Given a compute budget of 1e24 FLOPs, what is the Chinchilla-optimal model size and training data size? How does this compare to GPT-3's configuration?

  3. Implementation: Implement a simple CLM training loop from scratch using PyTorch. Train a small model on a text file and track perplexity over training.

  4. Research: Compare the data mixing proportions used in LLaMA 2, Mistral, and Qwen. How do their domain weights differ?

Key Takeaways:

  • Causal Language Modeling (CLM) is the dominant pre-training objective for LLMs
  • Cross-entropy loss measures the difference between predicted and true token distributions
  • Perplexity (exp of loss) measures model uncertainty in bits per token
  • Chinchilla scaling laws show model size and data should scale equally with compute
  • Data quality and deduplication are as important as data quantity
  • Curriculum learning can improve training efficiency

What to Learn Next

-> Fine-Tuning LLMs Customizing language models for your specific tasks and domains.

-> LoRA and PEFT Efficient fine-tuning without full retraining using low-rank adaptation.

-> QLoRA and Quantization Running LLMs on consumer hardware with INT4 quantization.

-> RLHF and Alignment Making LLMs safe and helpful through reinforcement learning from human feedback.

-> Constitutional AI Reducing dependence on human annotation through AI self-alignment.

-> Scaling Laws and Chinchilla Understanding the mathematical relationships governing model performance.

⭐

Premium Content

Pre-training Language Models

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert LLM Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement