LLM Training

Pretraining Language Models — Learning Language from the Internet

Pre-training is the foundational stage of LLM development, where models learn general language representations from massive text corpora. This guide covers training objectives, loss functions, and the scaling laws that govern this process.

CLM vs MLM — Causal Language Modeling dominates modern LLM training
Scaling Laws — Model size and data should scale equally with compute
Data Quality — Deduplication and curation matter as much as quantity

Scale is not just a feature—it is the strategy.

Pre-training Language Models

Pre-training is the foundational stage of LLM development, where models learn general language representations from massive text corpora. This tutorial covers the objectives, loss functions, and scaling laws that govern this process.

DfPre-training

Pre-training is the process of training a language model on a large unlabeled text corpus using self-supervised learning objectives. The model learns to predict tokens from context, acquiring general knowledge about language structure, semantics, and world knowledge.

Language Modeling Objectives

Causal Language Modeling (CLM)

CLM is the objective used by GPT, LLaMA, and most modern LLMs. The model predicts the next token given all previous tokens.

Causal Language Modeling

\mathcal{L}_{\text{CLM}} = -\sum_{t=1}^{T} \log P(x_t | x_1, \ldots, x_{t-1}; \theta)

Here,

$x_t$ =Token at position t
$T$ =Sequence length
$\theta$ =Model parameters

Masked Language Modeling (MLM)

MLM is the objective used by BERT. The model predicts randomly masked tokens given the bidirectional context.

Masked Language Modeling

\mathcal{L}_{\text{MLM}} = -\sum_{t \in \mathcal{M}} \log P(x_t | x_{\setminus \mathcal{M}}; \theta)

Here,

$\mathcal{M}$ =Set of masked positions
$x_{\setminus \mathcal{M}}$ =Unmasked tokens (bidirectional context)

CLM is preferred for LLMs because: (1) it enables autoregressive generation, (2) it scales more naturally to large models, and (3) in-context learning emerges from the CLM objective.

Cross-Entropy Loss

The training objective for language models is typically the cross-entropy loss between the model's predicted distribution and the true token distribution.

Cross-Entropy Loss

\mathcal{L}(\theta) = -\frac{1}{T} \sum_{t=1}^{T} \log P_\theta(x_t | x_{<t}) = -\frac{1}{T} \sum_{t=1}^{T} \log \frac{\exp(z_{x_t})}{\sum_{v=1}^{V} \exp(z_v)}

Here,

$T$ =Sequence length
$P_\theta$ =Model's predicted probability
$z_{x_t}$ =Logit for the correct token x_t
$V$ =Vocabulary size
$z_v$ =Logit for vocabulary token v

Perplexity

Perplexity is the standard evaluation metric for language models, measuring how well the model predicts the test data.

Perplexity

\text{PPL} = \exp(\mathcal{L}) = \exp\left(-\frac{1}{T} \sum_{t=1}^{T} \log P_\theta(x_t | x_{<t})\right)

Here,

$\text{PPL}$ =Perplexity
$\mathcal{L}$ =Cross-entropy loss
$T$ =Test sequence length

Intuitively, perplexity represents the average branching factor---the number of equally likely next tokens the model considers at each position.

Perplexity as Branching Factor

\text{PPL} = 2^{H} = 2^{-\frac{1}{T} \sum_t \log_2 P(x_t | x_{<t})}

Here,

$H$ =Entropy of the model in bits per token

A perplexity of 10 means the model is, on average, as uncertain as choosing uniformly among 10 candidates. GPT-3 achieves ~20 perplexity on standard benchmarks; GPT-4 achieves ~10-15.

Training Data

The quality and diversity of training data are critical for LLM performance.

Data Sources

Web crawls: Common Crawl, C4, RefinedWeb
Books: Books3, Gutenberg
Code: GitHub, StackOverflow
Academic: arXiv, S2ORC
Wikipedia: Multiple languages

Data Quality Pipeline

Deduplication: Remove duplicate documents (exact and fuzzy)
Filtering: Remove low-quality, toxic, or PII content
Re-weighting: Adjust domain proportions based on quality
Tokenization: Convert to token sequences

Data Mixing Proportions

P(\text{domain}_i) = \frac{w_i \cdot |D_i|}{\sum_j w_j \cdot |D_j|}

Here,

$w_i$ =Weight for domain i
$|D_i|$ =Size of domain i

Chinchilla Scaling Laws

The Chinchilla paper (Hoffmann et al., 2022) established optimal scaling relationships between model size and data size.

Chinchilla Optimal Scaling

N_{\text{opt}} \propto C^{0.5}, \quad D_{\text{opt}} \propto C^{0.5}

Here,

$N_{\text{opt}}$ =Optimal model size for compute budget C
$D_{\text{opt}}$ =Optimal data size for compute budget C
$C$ =Total compute budget (FLOPs)

This implies that for optimal performance, model size and data size should scale equally with compute. This challenged the prior paradigm of training very large models on insufficient data.

Chinchilla vs GPT-3

Model	Parameters	Tokens	Tokens/Param	FLOPs
GPT-3	175B	300B	1.7	3.1e23
Chinchilla	70B	1.4T	20	5.0e23

Chinchilla achieves better performance with fewer parameters but more data, demonstrating the importance of data scaling.

For a detailed treatment of scaling laws, see our module on Scaling Laws and Chinchilla.

Curriculum Learning

Curriculum learning involves presenting training data in a structured order, from easy to hard examples.

DfCurriculum Learning

Curriculum learning is a training strategy where data is presented in order of increasing difficulty. For LLMs, this can mean: (1) starting with shorter sequences, (2) gradually increasing data complexity, or (3) focusing on higher-quality data later in training.

Practical Example: Pre-training with HuggingFace

`python from transformers import ( AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, DataCollatorForLanguageModeling ) from datasets import load_dataset

model_name = "meta-llama/Llama-2-7b-hf" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name)

dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")

def tokenize(examples): return tokenizer(examples["text"], truncation=True, max_length=2048)

tokenized = dataset.map(tokenize, batched=True, remove_columns=["text"])

training_args = TrainingArguments( output_dir="./llama2-pretrained", per_device_train_batch_size=4, gradient_accumulation_steps=8, learning_rate=2e-4, warmup_steps=1000, max_steps=100000, fp16=True, logging_steps=100, save_steps=10000, optim="adamw_torch", weight_decay=0.1, lr_scheduler_type="cosine", )

trainer = Trainer( model=model, args=training_args, train_dataset=tokenized, data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False), )

trainer.train() `

Pre-training from scratch requires massive compute resources (typically thousands of GPUs for months). For most practitioners, fine-tuning a pre-trained model is more practical. See our modules on Fine-tuning and LoRA.

Practice Exercises

Mathematical: Calculate the perplexity of a model with cross-entropy loss of 3.2 nats. How does this compare to a model with loss of 2.8 nats?
Analysis: Given a compute budget of 1e24 FLOPs, what is the Chinchilla-optimal model size and training data size? How does this compare to GPT-3's configuration?
Implementation: Implement a simple CLM training loop from scratch using PyTorch. Train a small model on a text file and track perplexity over training.
Research: Compare the data mixing proportions used in LLaMA 2, Mistral, and Qwen. How do their domain weights differ?

Key Takeaways:

Causal Language Modeling (CLM) is the dominant pre-training objective for LLMs
Cross-entropy loss measures the difference between predicted and true token distributions
Perplexity (exp of loss) measures model uncertainty in bits per token
Chinchilla scaling laws show model size and data should scale equally with compute
Data quality and deduplication are as important as data quantity
Curriculum learning can improve training efficiency

What to Learn Next

-> Fine-Tuning LLMs Customizing language models for your specific tasks and domains.

-> LoRA and PEFT Efficient fine-tuning without full retraining using low-rank adaptation.

-> QLoRA and Quantization Running LLMs on consumer hardware with INT4 quantization.

-> RLHF and Alignment Making LLMs safe and helpful through reinforcement learning from human feedback.

-> Constitutional AI Reducing dependence on human annotation through AI self-alignment.

-> Scaling Laws and Chinchilla Understanding the mathematical relationships governing model performance.

Pre-training Language Models

Pretraining Language Models — Learning Language from the Internet

Pre-training Language Models

DfPre-training

Language Modeling Objectives

Causal Language Modeling (CLM)

Causal Language Modeling

Masked Language Modeling (MLM)

Masked Language Modeling

Cross-Entropy Loss

Perplexity

Perplexity

Perplexity as Branching Factor

Training Data

Data Sources

Data Quality Pipeline

Data Mixing Proportions

Chinchilla Scaling Laws

Chinchilla vs GPT-3

Curriculum Learning

DfCurriculum Learning

Practical Example: Pre-training with HuggingFace

Practice Exercises

What to Learn Next

Premium Content

Need Expert LLM Help?