LLM Training

RLHF and Alignment — Making LLMs Safe and Helpful

Alignment ensures that LLMs behave in accordance with human values and intentions. This guide covers the RLHF pipeline, reward modeling, PPO, DPO, and theoretical foundations for building safe and helpful AI systems.

RLHF Pipeline — Supervised fine-tuning followed by reward modeling and PPO
DPO — Direct preference optimization bypasses reward modeling for simpler alignment
Constitutional AI — Reduces dependence on human annotation through principles

Alignment is not a feature—it is a responsibility.

RLHF and Alignment

Alignment ensures that LLMs behave in accordance with human values and intentions. This tutorial covers the reinforcement learning from human feedback (RLHF) pipeline, reward modeling, PPO, DPO, and theoretical foundations.

DfAlignment

Alignment is the process of ensuring that an AI system's behavior matches human values, intentions, and preferences. For LLMs, alignment means producing helpful, harmless, and honest outputs that satisfy user intent.

The Alignment Pipeline

The standard alignment pipeline consists of three stages:

Pre-training: Learn general language representations from large corpora
Supervised Fine-Tuning (SFT): Fine-tune on high-quality instruction-response pairs
RLHF/DPO: Align with human preferences using reward modeling or direct optimization

For a detailed treatment of reinforcement learning fundamentals, see our module on Reinforcement Learning.

Reward Modeling

What is a Reward Model?

A reward model is a neural network trained to predict human preferences. Given a prompt and two responses, it predicts which response a human would prefer.

DfReward Model

A reward model R(x, y) assigns a scalar score to a (prompt, response) pair, representing how well the response satisfies human preferences. It is trained on pairwise comparison data from human annotators.

Reward Model Training

Reward Model Loss

\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l)} \left[\log \sigma(R(x, y_w) - R(x, y_l))\right]

Here,

$x$ =Prompt/input
$y_w$ =Preferred (winning) response
$y_l$ =Dispreferred (losing) response
$\sigma$ =Sigmoid function
$R$ =Reward model

This is the Bradley-Terry model for pairwise comparisons:

Bradley-Terry Preference Model

P(y_w \\succ y_l | x) = \sigma(R(x, y_w) - R(x, y_l))

Here,

$P(y_w \succ y_l | x)$ =Probability that y_w is preferred over y_l

PPO: Proximal Policy Optimization

PPO is the standard RL algorithm used in RLHF to optimize the policy against the reward model.

PPO Objective

\mathcal{L}_{\text{PPO}} = \mathbb{E}_{t} \left[\min\left(\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} \hat{A}_t, \text{clip}\left(\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}, 1-\epsilon, 1+\epsilon\right) \hat{A}_t\right)\right]

Here,

$\pi_\theta$ =Current policy
$\pi_{\theta_{\text{old}}}$ =Previous policy
$\hat{A}_t$ =Estimated advantage at time t
$\epsilon$ =Clip parameter (typically 0.2)

RLHF Objective with KL Penalty

RLHF Objective

\mathcal{L}_{\text{RLHF}} = \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(y|x)} \left[R(x, y) - \beta \cdot D_{\text{KL}}(\pi_\theta(y|x) \| \pi_{\text{ref}}(y|x))\right]

Here,

$R(x, y)$ =Reward for response y to prompt x
$\pi_\theta$ =Current policy (being optimized)
$\pi_{\text{ref}}$ =Reference policy (SFT model, frozen)
$\beta$ =KL penalty coefficient
$D_{\text{KL}}$ =KL divergence

The KL penalty prevents the policy from diverging too far from the reference model, avoiding reward hacking and mode collapse.

Reward Hacking

ThReward Hacking

Reward hacking occurs when the policy learns to exploit weaknesses in the reward model to obtain high rewards without actually satisfying human preferences. Formally, the policy finds y* = argmax_y R(x, y) such that R(x, y*) >> R(x, y_{\text{human}}), even though y* is not actually preferred by humans.

To mitigate reward hacking: (1) use a larger reward model, (2) train on diverse preference data, (3) apply KL constraints, (4) use reward model ensemble, (5) include constitutional AI principles.

DPO: Direct Preference Optimization

DPO (Rafailov et al., 2023) bypasses reward modeling and PPO, directly optimizing the policy on preference data.

DPO Loss

\mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x, y_w, y_l)} \left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]

Here,

$y_w$ =Preferred response
$y_l$ =Dispreferred response
$\pi_\theta$ =Policy being optimized
$\pi_{\text{ref}}$ =Reference policy (SFT model)
$\beta$ =Temperature parameter controlling deviation from reference

DPO vs RLHF Comparison

Aspect	RLHF (PPO)	DPO
Reward model	Required	Not needed
Training stability	Unstable (RL)	Stable (supervised)
Compute cost	High (4 models)	Low (2 models)
Sample efficiency	Low	High
Performance	Strong	Competitive
Implementation	Complex	Simple

DPO's key insight: the optimal policy under RLHF can be expressed in closed form as a function of the reward, eliminating the need for explicit reward modeling.

Other Alignment Methods

Constitutional AI (CAI)

DfConstitutional AI

Constitutional AI uses a set of principles (constitution) to guide the model's self-improvement. The model critiques and revises its own outputs based on the principles, then trains on the improved data.

RLAIF: AI Feedback

DfRLAIF

RLAIF replaces human annotators with an AI system for generating preference data. A larger, more capable model provides feedback, reducing the cost and scaling limitations of human annotation.

Practical Implementation

`python from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead from transformers import AutoTokenizer

Configure PPO

config = PPOConfig( learning_rate=1.41e-5, batch_size=64, mini_batch_size=16, ppo_epochs=4, kl_penalty="kl", init_kl_coef=0.2, target_kl=6.0, )

Load models

model = AutoModelForCausalLMWithValueHead.from_pretrained("sft-model") ref_model = AutoModelForCausalLMWithValueHead.from_pretrained("sft-model") tokenizer = AutoTokenizer.from_pretrained("sft-model")

Create trainer

ppo_trainer = PPOTrainer( config=config, model=model, ref_model=ref_model, tokenizer=tokenizer, )

Training loop

for batch in dataloader: query_tensors = [tokenizer.encode(q, return_tensors="pt")[0] for q in batch["query"]]

response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=256)

rewards = [reward_model(q, r) for q, r in zip(batch["query"], responses)]

stats = ppo_trainer.step(query_tensors, response_tensors, rewards) `

DPO Training

`python from trl import DPOTrainer, DPOConfig

dpo_config = DPOConfig( beta=0.1, learning_rate=5e-7, per_device_train_batch_size=4, gradient_accumulation_steps=4, max_length=1024, max_prompt_length=512, )

trainer = DPOTrainer( model=model, ref_model=ref_model, tokenizer=tokenizer, train_dataset=preference_dataset, args=dpo_config, )

trainer.train() `

Practice Exercises

Mathematical: Derive the DPO loss from the RLHF objective. Show that the optimal policy can be expressed in closed form.
Implementation: Train a reward model on the Anthropic HH dataset. Evaluate its accuracy on held-out preference pairs.
Comparison: Compare PPO and DPO on the same task. Which is more stable? Which achieves better final performance?
Research: Investigate reward hacking in RLHF. Design a simple experiment that demonstrates the phenomenon.

Key Takeaways:

Alignment ensures LLMs behave in accordance with human values
RLHF uses reward modeling + PPO to optimize against human preferences
DPO directly optimizes the policy on preference data, bypassing reward modeling
KL penalties prevent reward hacking and mode collapse
Reward hacking is a fundamental challenge in RLHF
Constitutional AI and RLAIF reduce dependence on human annotation
DPO is simpler and more stable than PPO for most alignment tasks

What to Learn Next

-> Constitutional AI Reducing dependence on human annotation through AI self-alignment.

-> LLM Safety and Red Teaming Testing and hardening LLMs against adversarial attacks.

-> Fine-Tuning LLMs Customizing language models for your specific tasks and domains.

-> Instruction Tuning Teaching models to follow complex multi-step instructions reliably.

-> Pretraining Language Models Learning language from the internet with CLM, scaling laws, and data curation.

-> Building Production LLM Apps From prototype to production: deploying LLMs at scale.

RLHF and Alignment

RLHF and Alignment — Making LLMs Safe and Helpful

RLHF and Alignment

DfAlignment

The Alignment Pipeline

Reward Modeling

What is a Reward Model?

DfReward Model

Reward Model Training

Reward Model Loss

Bradley-Terry Preference Model

PPO: Proximal Policy Optimization

PPO Objective

PPO Objective

RLHF Objective with KL Penalty

Reward Hacking

ThReward Hacking

DPO: Direct Preference Optimization

DPO vs RLHF Comparison

Other Alignment Methods

Constitutional AI (CAI)

DfConstitutional AI

RLAIF: AI Feedback

DfRLAIF

Practical Implementation

Configure PPO

Load models

Create trainer

Training loop

DPO Training

Practice Exercises

What to Learn Next

Premium Content

Need Expert LLM Help?