RLHF and Alignment
Reinforcement Learning from Human Feedback (RLHF) trains language models to produce outputs that humans prefer, going beyond what supervised fine-tuning alone can achieve.
The Alignment Problem
Pre-trained models optimize for next-token prediction, not for being helpful, harmless, or honest. RLHF aligns model behavior with human values.
| Stage | Objective | Data Required |
|---|---|---|
| Pre-training | Next-token prediction | Trillions of tokens |
| SFT | Task completion | Thousands of demonstrations |
| Reward Modeling | Predict human preferences | Paired preference comparisons |
| RL Optimization | Maximize reward with constraints | Prompts |
RLHF Pipeline
Reward Modeling
The reward model learns to predict human preferences from pairwise comparisons.
DfBradley-Terry Reward Model
Given pairs of responses where is preferred over , the Bradley-Terry model defines:
The reward model is trained with:
import torch
import torch.nn.functional as F
from transformers import AutoModelForSequenceClassification, AutoTokenizer
class RewardModelTrainer:
def __init__(self, model_name="meta-llama/Llama-2-7b-hf"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=1
)
if self.tokenizer.pad_token is None:
self.tokenizer.pad_token = self.tokenizer.eos_token
def compute_reward(self, prompt, response):
text = f"[INST] {prompt} [/INST] {response}"
inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
reward = self.model(**inputs).logits[0, 0]
return reward.item()
def train_step(self, batch):
chosen_texts = [
f"[INST] {p} [/INST] {c}" for p, c in zip(batch["prompts"], batch["chosen"])
]
rejected_texts = [
f"[INST] {p} [/INST] {r}" for p, r in zip(batch["prompts"], batch["rejected"])
]
chosen_inputs = self.tokenizer(
chosen_texts, padding=True, truncation=True, return_tensors="pt"
)
rejected_inputs = self.tokenizer(
rejected_texts, padding=True, truncation=True, return_tensors="pt"
)
chosen_rewards = self.model(**chosen_inputs).logits[:, 0]
rejected_rewards = self.model(**rejected_inputs).logits[:, 0]
loss = -F.logsigmoid(chosen_rewards - rejected_rewards).mean()
return loss, chosen_rewards.mean().item(), rejected_rewards.mean().item()
# Example training loop
trainer = RewardModelTrainer()
for batch in preference_dataloader:
loss, chosen_r, rejected_r = trainer.train_step(batch)
loss.backward()
optimizer.step()
optimizer.zero_grad()
print(f"Loss: {loss:.4f}, Chosen R: {chosen_r:.4f}, Rejected R: {rejected_r:.4f}")
PPO (Proximal Policy Optimization)
PPO is the primary algorithm used for RLHF optimization, balancing reward maximization with policy stability.
DfPPO-Clipped Objective
where is the advantage estimate and is the clipping parameter (typically 0.2).
import torch
import torch.nn.functional as F
from dataclasses import dataclass
@dataclass
class PPOConfig:
kl_coeff: float = 0.1
clip_range: float = 0.2
value_loss_coeff: float = 0.5
gamma: float = 1.0
gae_lambda: float = 0.95
learning_rate: float = 1e-5
ppo_epochs: int = 4
mini_batch_size: int = 4
class PPOTrainer:
def __init__(self, policy_model, ref_model, reward_model, tokenizer, config=None):
self.config = config or PPOConfig()
self.policy = policy_model
self.ref_model = ref_model
self.reward_model = reward_model
self.tokenizer = tokenizer
def compute_kl_penalty(self, logprobs, ref_logprobs):
"""KL divergence between current and reference policy."""
return logprobs - ref_logprobs
def compute_advantages(self, rewards, values):
"""Generalized Advantage Estimation (GAE)."""
advantages = torch.zeros_like(rewards)
last_gae = 0
for t in reversed(range(len(rewards))):
if t == len(rewards) - 1:
next_value = 0
else:
next_value = values[t + 1]
delta = rewards[t] + self.config.gamma * next_value - values[t]
advantages[t] = last_gae = delta + (
self.config.gamma * self.config.gae_lambda * last_gae
)
returns = advantages + values
return advantages, returns
def ppo_step(self, prompts):
"""Single PPO optimization step."""
all_logprobs = []
all_rewards = []
all_values = []
# Collect experience
for prompt in prompts:
response = self.generate_response(prompt)
logprobs = self.get_logprobs(prompt, response)
ref_logprobs = self.get_ref_logprobs(prompt, response)
reward = self.get_reward(prompt, response)
value = self.get_value(prompt, response)
kl = self.compute_kl_penalty(logprobs, ref_logprobs)
adjusted_reward = reward - self.config.kl_coeff * kl
all_logprobs.append(logprobs)
all_rewards.append(adjusted_reward)
all_values.append(value)
# PPO updates
for _ in range(self.config.ppo_epochs):
advantages, returns = self.compute_advantages(
torch.stack(all_rewards), torch.stack(all_values)
)
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
# Policy loss
ratio = torch.exp(new_logprobs - old_logprobs)
surr1 = ratio * advantages
surr2 = torch.clamp(ratio, 1 - self.config.clip_range, 1 + self.config.clip_range) * advantages
policy_loss = -torch.min(surr1, surr2).mean()
# Value loss
value_loss = F.mse_loss(predicted_values, returns)
# Total loss
total_loss = policy_loss + self.config.value_loss_coeff * value_loss
total_loss.backward()
self.optimizer.step()
self.optimizer.zero_grad()
return {
"policy_loss": policy_loss.item(),
"value_loss": value_loss.item(),
"mean_reward": torch.stack(all_rewards).mean().item(),
"mean_kl": kl.mean().item(),
}
DPO (Direct Preference Optimization)
DPO eliminates the need for a separate reward model by directly optimizing the policy from preferences.
DfDPO Loss
DPO reparameterizes the reward as:
The DPO loss is:
import torch
import torch.nn.functional as F
class DPOTrainer:
def __init__(self, policy_model, ref_model, tokenizer, beta=0.1):
self.policy = policy_model
self.ref_model = ref_model
self.tokenizer = tokenizer
self.beta = beta
def get_logprobs(self, model, input_ids, labels):
"""Compute per-token log probabilities."""
outputs = model(input_ids=input_ids)
logits = outputs.logits[:, :-1, :]
labels = labels[:, 1:]
log_probs = F.log_softmax(logits, dim=-1)
per_token_logprobs = torch.gather(
log_probs, 2, labels.unsqueeze(-1)
).squeeze(-1)
# Mask padding
mask = (labels != self.tokenizer.pad_token_id).float()
seq_logprobs = (per_token_logprobs * mask).sum(dim=-1)
return seq_logprobs
def dpo_loss(self, batch):
"""Compute DPO loss from preference pairs."""
chosen_ids = batch["chosen_input_ids"]
rejected_ids = batch["rejected_input_ids"]
with torch.no_grad():
ref_chosen_logps = self.get_logprobs(self.ref_model, chosen_ids, chosen_ids)
ref_rejected_logps = self.get_logprobs(self.ref_model, rejected_ids, rejected_ids)
policy_chosen_logps = self.get_logprobs(self.policy, chosen_ids, chosen_ids)
policy_rejected_logps = self.get_logprobs(self.policy, rejected_ids, rejected_ids)
chosen_logratios = policy_chosen_logps - ref_chosen_logps
rejected_logratios = policy_rejected_logps - ref_rejected_logps
logits = self.beta * (chosen_logratios - rejected_logratios)
loss = -F.logsigmoid(logits).mean()
chosen_rewards = self.beta * chosen_logratios
rejected_rewards = self.beta * rejected_logratios
return loss, {
"chosen_reward": chosen_rewards.mean().item(),
"rejected_reward": rejected_rewards.mean().item(),
"reward_margin": (chosen_rewards - rejected_rewards).mean().item(),
"accuracy": (logits > 0).float().mean().item(),
}
Comparison of Alignment Methods
| Method | Reward Model | Training Stability | Compute Cost | Sample Efficiency |
|---|---|---|---|---|
| PPO | Required | Moderate | High | Low |
| DPO | Not required | High | Moderate | High |
| RLHF (full) | Required | Low | Very high | Low |
| KTO | Not required | High | Moderate | Moderate |
| IPO | Not required | High | Moderate | High |
Reward Hacking
Models may exploit reward model weaknesses without producing genuinely better outputs.
DfReward Hacking Problem
The agent learns to maximize apparent reward rather than true quality:
Mitigation strategies include KL penalties, reward ensemble, and periodic reward model retraining.
| Technique | Description | Effectiveness |
|---|---|---|
| KL penalty | Penalize deviation from reference policy | Moderate |
| Reward ensemble | Average multiple reward models | High |
| Length penalty | Penalize unnecessarily long responses | High |
| Periodic retraining | Retrain reward model on new data | High |
Best Practices
- Start with strong SFT - RLHF improves upon, not replaces, supervised fine-tuning
- Collect diverse preferences - Avoid systematic labeling biases
- Monitor KL divergence - Prevent policy collapse or reward hacking
- Use reference model - Maintain baseline for stable training
- Evaluate with humans - Automatic metrics alone are insufficient
Key Takeaways
- PPO is the most established RLHF algorithm but requires significant compute
- DPO provides a simpler, more stable alternative without a separate reward model
- Reward modeling quality is critical for successful RLHF
- KL regularization prevents catastrophic policy collapse
- Human evaluation remains essential for measuring true alignment