πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

RLHF and Alignment

AlignmentRLHF🟒 Free Lesson

Advertisement

LLM Training

RLHF and Alignment β€” Making LLMs Safe and Helpful

Alignment ensures that LLMs behave in accordance with human values and intentions. This guide covers the RLHF pipeline, reward modeling, PPO, DPO, and theoretical foundations for building safe and helpful AI systems.

  • RLHF Pipeline β€” Supervised fine-tuning followed by reward modeling and PPO
  • DPO β€” Direct preference optimization bypasses reward modeling for simpler alignment
  • Constitutional AI β€” Reduces dependence on human annotation through principles

Alignment is not a featureβ€”it is a responsibility.

RLHF and Alignment

Alignment ensures that LLMs behave in accordance with human values and intentions. This tutorial covers the reinforcement learning from human feedback (RLHF) pipeline, reward modeling, PPO, DPO, and theoretical foundations.

DfAlignment

Alignment is the process of ensuring that an AI system's behavior matches human values, intentions, and preferences. For LLMs, alignment means producing helpful, harmless, and honest outputs that satisfy user intent.

The Alignment Pipeline

The standard alignment pipeline consists of three stages:

  1. Pre-training: Learn general language representations from large corpora
  2. Supervised Fine-Tuning (SFT): Fine-tune on high-quality instruction-response pairs
  3. RLHF/DPO: Align with human preferences using reward modeling or direct optimization

For a detailed treatment of reinforcement learning fundamentals, see our module on Reinforcement Learning.

Reward Modeling

What is a Reward Model?

A reward model is a neural network trained to predict human preferences. Given a prompt and two responses, it predicts which response a human would prefer.

DfReward Model

A reward model R(x, y) assigns a scalar score to a (prompt, response) pair, representing how well the response satisfies human preferences. It is trained on pairwise comparison data from human annotators.

Reward Model Training

Reward Model Loss

LRM=βˆ’E(x,yw,yl)[log⁑σ(R(x,yw)βˆ’R(x,yl))]\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l)} \left[\log \sigma(R(x, y_w) - R(x, y_l))\right]

Here,

  • xx=Prompt/input
  • ywy_w=Preferred (winning) response
  • yly_l=Dispreferred (losing) response
  • Οƒ\sigma=Sigmoid function
  • RR=Reward model

This is the Bradley-Terry model for pairwise comparisons:

Bradley-Terry Preference Model

P(ywsuccyl∣x)=Οƒ(R(x,yw)βˆ’R(x,yl))P(y_w \\succ y_l | x) = \sigma(R(x, y_w) - R(x, y_l))

Here,

  • P(yw≻yl∣x)P(y_w \succ y_l | x)=Probability that y_w is preferred over y_l

PPO: Proximal Policy Optimization

PPO is the standard RL algorithm used in RLHF to optimize the policy against the reward model.

PPO Objective

PPO Objective

LPPO=Et[min⁑(πθ(at∣st)πθold(at∣st)A^t,clip(πθ(at∣st)πθold(at∣st),1βˆ’Ο΅,1+Ο΅)A^t)]\mathcal{L}_{\text{PPO}} = \mathbb{E}_{t} \left[\min\left(\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} \hat{A}_t, \text{clip}\left(\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}, 1-\epsilon, 1+\epsilon\right) \hat{A}_t\right)\right]

Here,

  • πθ\pi_\theta=Current policy
  • πθold\pi_{\theta_{\text{old}}}=Previous policy
  • A^t\hat{A}_t=Estimated advantage at time t
  • Ο΅\epsilon=Clip parameter (typically 0.2)

RLHF Objective with KL Penalty

RLHF Objective
LRLHF=Ex∼D,yβˆΌΟ€ΞΈ(y∣x)[R(x,y)βˆ’Ξ²β‹…DKL(πθ(y∣x)βˆ₯Ο€ref(y∣x))]\mathcal{L}_{\text{RLHF}} = \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(y|x)} \left[R(x, y) - \beta \cdot D_{\text{KL}}(\pi_\theta(y|x) \| \pi_{\text{ref}}(y|x))\right]

Here,

  • R(x,y)R(x, y)=Reward for response y to prompt x
  • πθ\pi_\theta=Current policy (being optimized)
  • Ο€ref\pi_{\text{ref}}=Reference policy (SFT model, frozen)
  • Ξ²\beta=KL penalty coefficient
  • DKLD_{\text{KL}}=KL divergence

The KL penalty prevents the policy from diverging too far from the reference model, avoiding reward hacking and mode collapse.

Reward Hacking

ThReward Hacking

Reward hacking occurs when the policy learns to exploit weaknesses in the reward model to obtain high rewards without actually satisfying human preferences. Formally, the policy finds y* = argmax_y R(x, y) such that R(x, y*) >> R(x, y_{\text{human}}), even though y* is not actually preferred by humans.

To mitigate reward hacking: (1) use a larger reward model, (2) train on diverse preference data, (3) apply KL constraints, (4) use reward model ensemble, (5) include constitutional AI principles.

DPO: Direct Preference Optimization

DPO (Rafailov et al., 2023) bypasses reward modeling and PPO, directly optimizing the policy on preference data.

DPO Loss
LDPO=βˆ’E(x,yw,yl)[log⁑σ(Ξ²log⁑πθ(yw∣x)Ο€ref(yw∣x)βˆ’Ξ²log⁑πθ(yl∣x)Ο€ref(yl∣x))]\mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x, y_w, y_l)} \left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]

Here,

  • ywy_w=Preferred response
  • yly_l=Dispreferred response
  • πθ\pi_\theta=Policy being optimized
  • Ο€ref\pi_{\text{ref}}=Reference policy (SFT model)
  • Ξ²\beta=Temperature parameter controlling deviation from reference

DPO vs RLHF Comparison

AspectRLHF (PPO)DPO
Reward modelRequiredNot needed
Training stabilityUnstable (RL)Stable (supervised)
Compute costHigh (4 models)Low (2 models)
Sample efficiencyLowHigh
PerformanceStrongCompetitive
ImplementationComplexSimple

DPO's key insight: the optimal policy under RLHF can be expressed in closed form as a function of the reward, eliminating the need for explicit reward modeling.

Other Alignment Methods

Constitutional AI (CAI)

DfConstitutional AI

Constitutional AI uses a set of principles (constitution) to guide the model's self-improvement. The model critiques and revises its own outputs based on the principles, then trains on the improved data.

RLAIF: AI Feedback

DfRLAIF

RLAIF replaces human annotators with an AI system for generating preference data. A larger, more capable model provides feedback, reducing the cost and scaling limitations of human annotation.

Practical Implementation

`python from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead from transformers import AutoTokenizer

Configure PPO

config = PPOConfig( learning_rate=1.41e-5, batch_size=64, mini_batch_size=16, ppo_epochs=4, kl_penalty="kl", init_kl_coef=0.2, target_kl=6.0, )

Load models

model = AutoModelForCausalLMWithValueHead.from_pretrained("sft-model") ref_model = AutoModelForCausalLMWithValueHead.from_pretrained("sft-model") tokenizer = AutoTokenizer.from_pretrained("sft-model")

Create trainer

ppo_trainer = PPOTrainer( config=config, model=model, ref_model=ref_model, tokenizer=tokenizer, )

Training loop

for batch in dataloader: query_tensors = [tokenizer.encode(q, return_tensors="pt")[0] for q in batch["query"]]

response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=256)

rewards = [reward_model(q, r) for q, r in zip(batch["query"], responses)]

stats = ppo_trainer.step(query_tensors, response_tensors, rewards) `

DPO Training

`python from trl import DPOTrainer, DPOConfig

dpo_config = DPOConfig( beta=0.1, learning_rate=5e-7, per_device_train_batch_size=4, gradient_accumulation_steps=4, max_length=1024, max_prompt_length=512, )

trainer = DPOTrainer( model=model, ref_model=ref_model, tokenizer=tokenizer, train_dataset=preference_dataset, args=dpo_config, )

trainer.train() `

Practice Exercises

  1. Mathematical: Derive the DPO loss from the RLHF objective. Show that the optimal policy can be expressed in closed form.
  2. Implementation: Train a reward model on the Anthropic HH dataset. Evaluate its accuracy on held-out preference pairs.
  3. Comparison: Compare PPO and DPO on the same task. Which is more stable? Which achieves better final performance?
  4. Research: Investigate reward hacking in RLHF. Design a simple experiment that demonstrates the phenomenon.

Key Takeaways:

  • Alignment ensures LLMs behave in accordance with human values
  • RLHF uses reward modeling + PPO to optimize against human preferences
  • DPO directly optimizes the policy on preference data, bypassing reward modeling
  • KL penalties prevent reward hacking and mode collapse
  • Reward hacking is a fundamental challenge in RLHF
  • Constitutional AI and RLAIF reduce dependence on human annotation
  • DPO is simpler and more stable than PPO for most alignment tasks

What to Learn Next

-> Constitutional AI Reducing dependence on human annotation through AI self-alignment.

-> LLM Safety and Red Teaming Testing and hardening LLMs against adversarial attacks.

-> Fine-Tuning LLMs Customizing language models for your specific tasks and domains.

-> Instruction Tuning Teaching models to follow complex multi-step instructions reliably.

-> Pretraining Language Models Learning language from the internet with CLM, scaling laws, and data curation.

-> Building Production LLM Apps From prototype to production: deploying LLMs at scale.

⭐

Premium Content

RLHF and Alignment

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert LLM Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement