RLHF and Alignment

Reinforcement Learning from Human Feedback (RLHF) trains language models to produce outputs that humans prefer, going beyond what supervised fine-tuning alone can achieve.

The Alignment Problem

Pre-trained models optimize for next-token prediction, not for being helpful, harmless, or honest. RLHF aligns model behavior with human values.

Stage	Objective	Data Required
Pre-training	Next-token prediction	Trillions of tokens
SFT	Task completion	Thousands of demonstrations
Reward Modeling	Predict human preferences	Paired preference comparisons
RL Optimization	Maximize reward with constraints	Prompts

RLHF Pipeline

Reward Modeling

The reward model learns to predict human preferences from pairwise comparisons.

DfBradley-Terry Reward Model

Given pairs of responses $(y_w, y_l)$ where $y_w$ is preferred over $y_l$ , the Bradley-Terry model defines:

P(y_w \succ y_l | x) = \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))

The reward model is trained with:

\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[\log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))\right]

import torch
import torch.nn.functional as F
from transformers import AutoModelForSequenceClassification, AutoTokenizer

class RewardModelTrainer:
    def __init__(self, model_name="meta-llama/Llama-2-7b-hf"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_name, num_labels=1
        )
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token

    def compute_reward(self, prompt, response):
        text = f"[INST] {prompt} [/INST] {response}"
        inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
        with torch.no_grad():
            reward = self.model(**inputs).logits[0, 0]
        return reward.item()

    def train_step(self, batch):
        chosen_texts = [
            f"[INST] {p} [/INST] {c}" for p, c in zip(batch["prompts"], batch["chosen"])
        ]
        rejected_texts = [
            f"[INST] {p} [/INST] {r}" for p, r in zip(batch["prompts"], batch["rejected"])
        ]

        chosen_inputs = self.tokenizer(
            chosen_texts, padding=True, truncation=True, return_tensors="pt"
        )
        rejected_inputs = self.tokenizer(
            rejected_texts, padding=True, truncation=True, return_tensors="pt"
        )

        chosen_rewards = self.model(**chosen_inputs).logits[:, 0]
        rejected_rewards = self.model(**rejected_inputs).logits[:, 0]

        loss = -F.logsigmoid(chosen_rewards - rejected_rewards).mean()
        return loss, chosen_rewards.mean().item(), rejected_rewards.mean().item()

# Example training loop
trainer = RewardModelTrainer()
for batch in preference_dataloader:
    loss, chosen_r, rejected_r = trainer.train_step(batch)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    print(f"Loss: {loss:.4f}, Chosen R: {chosen_r:.4f}, Rejected R: {rejected_r:.4f}")

PPO (Proximal Policy Optimization)

PPO is the primary algorithm used for RLHF optimization, balancing reward maximization with policy stability.

DfPPO-Clipped Objective

\mathcal{L}^{\text{PPO}}(\theta) = \mathbb{E}_t \left[\min\left(\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} \hat{A}_t, \ \text{clip}\left(\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}, 1-\epsilon, 1+\epsilon\right) \hat{A}_t\right)\right]

where $\hat{A}_t$ is the advantage estimate and $\epsilon$ is the clipping parameter (typically 0.2).

import torch
import torch.nn.functional as F
from dataclasses import dataclass

@dataclass
class PPOConfig:
    kl_coeff: float = 0.1
    clip_range: float = 0.2
    value_loss_coeff: float = 0.5
    gamma: float = 1.0
    gae_lambda: float = 0.95
    learning_rate: float = 1e-5
    ppo_epochs: int = 4
    mini_batch_size: int = 4

class PPOTrainer:
    def __init__(self, policy_model, ref_model, reward_model, tokenizer, config=None):
        self.config = config or PPOConfig()
        self.policy = policy_model
        self.ref_model = ref_model
        self.reward_model = reward_model
        self.tokenizer = tokenizer

    def compute_kl_penalty(self, logprobs, ref_logprobs):
        """KL divergence between current and reference policy."""
        return logprobs - ref_logprobs

    def compute_advantages(self, rewards, values):
        """Generalized Advantage Estimation (GAE)."""
        advantages = torch.zeros_like(rewards)
        last_gae = 0

        for t in reversed(range(len(rewards))):
            if t == len(rewards) - 1:
                next_value = 0
            else:
                next_value = values[t + 1]

            delta = rewards[t] + self.config.gamma * next_value - values[t]
            advantages[t] = last_gae = delta + (
                self.config.gamma * self.config.gae_lambda * last_gae
            )

        returns = advantages + values
        return advantages, returns

    def ppo_step(self, prompts):
        """Single PPO optimization step."""
        all_logprobs = []
        all_rewards = []
        all_values = []

        # Collect experience
        for prompt in prompts:
            response = self.generate_response(prompt)
            logprobs = self.get_logprobs(prompt, response)
            ref_logprobs = self.get_ref_logprobs(prompt, response)
            reward = self.get_reward(prompt, response)
            value = self.get_value(prompt, response)

            kl = self.compute_kl_penalty(logprobs, ref_logprobs)
            adjusted_reward = reward - self.config.kl_coeff * kl

            all_logprobs.append(logprobs)
            all_rewards.append(adjusted_reward)
            all_values.append(value)

        # PPO updates
        for _ in range(self.config.ppo_epochs):
            advantages, returns = self.compute_advantages(
                torch.stack(all_rewards), torch.stack(all_values)
            )
            advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

            # Policy loss
            ratio = torch.exp(new_logprobs - old_logprobs)
            surr1 = ratio * advantages
            surr2 = torch.clamp(ratio, 1 - self.config.clip_range, 1 + self.config.clip_range) * advantages
            policy_loss = -torch.min(surr1, surr2).mean()

            # Value loss
            value_loss = F.mse_loss(predicted_values, returns)

            # Total loss
            total_loss = policy_loss + self.config.value_loss_coeff * value_loss

            total_loss.backward()
            self.optimizer.step()
            self.optimizer.zero_grad()

        return {
            "policy_loss": policy_loss.item(),
            "value_loss": value_loss.item(),
            "mean_reward": torch.stack(all_rewards).mean().item(),
            "mean_kl": kl.mean().item(),
        }

DPO (Direct Preference Optimization)

DPO eliminates the need for a separate reward model by directly optimizing the policy from preferences.

DfDPO Loss

DPO reparameterizes the reward as:

r(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)

The DPO loss is:

\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y_w, y_l)} \left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]

import torch
import torch.nn.functional as F

class DPOTrainer:
    def __init__(self, policy_model, ref_model, tokenizer, beta=0.1):
        self.policy = policy_model
        self.ref_model = ref_model
        self.tokenizer = tokenizer
        self.beta = beta

    def get_logprobs(self, model, input_ids, labels):
        """Compute per-token log probabilities."""
        outputs = model(input_ids=input_ids)
        logits = outputs.logits[:, :-1, :]
        labels = labels[:, 1:]

        log_probs = F.log_softmax(logits, dim=-1)
        per_token_logprobs = torch.gather(
            log_probs, 2, labels.unsqueeze(-1)
        ).squeeze(-1)

        # Mask padding
        mask = (labels != self.tokenizer.pad_token_id).float()
        seq_logprobs = (per_token_logprobs * mask).sum(dim=-1)
        return seq_logprobs

    def dpo_loss(self, batch):
        """Compute DPO loss from preference pairs."""
        chosen_ids = batch["chosen_input_ids"]
        rejected_ids = batch["rejected_input_ids"]

        with torch.no_grad():
            ref_chosen_logps = self.get_logprobs(self.ref_model, chosen_ids, chosen_ids)
            ref_rejected_logps = self.get_logprobs(self.ref_model, rejected_ids, rejected_ids)

        policy_chosen_logps = self.get_logprobs(self.policy, chosen_ids, chosen_ids)
        policy_rejected_logps = self.get_logprobs(self.policy, rejected_ids, rejected_ids)

        chosen_logratios = policy_chosen_logps - ref_chosen_logps
        rejected_logratios = policy_rejected_logps - ref_rejected_logps

        logits = self.beta * (chosen_logratios - rejected_logratios)
        loss = -F.logsigmoid(logits).mean()

        chosen_rewards = self.beta * chosen_logratios
        rejected_rewards = self.beta * rejected_logratios

        return loss, {
            "chosen_reward": chosen_rewards.mean().item(),
            "rejected_reward": rejected_rewards.mean().item(),
            "reward_margin": (chosen_rewards - rejected_rewards).mean().item(),
            "accuracy": (logits > 0).float().mean().item(),
        }

Comparison of Alignment Methods

Method	Reward Model	Training Stability	Compute Cost	Sample Efficiency
PPO	Required	Moderate	High	Low
DPO	Not required	High	Moderate	High
RLHF (full)	Required	Low	Very high	Low
KTO	Not required	High	Moderate	Moderate
IPO	Not required	High	Moderate	High

Reward Hacking

Models may exploit reward model weaknesses without producing genuinely better outputs.

DfReward Hacking Problem

The agent learns to maximize apparent reward rather than true quality:

\max_\theta \mathbb{E}_{x} [r_\psi(x, \pi_\theta(x))] \neq \max_\theta \mathbb{E}_{x} [r_{\text{true}}(x, \pi_\theta(x))]

Mitigation strategies include KL penalties, reward ensemble, and periodic reward model retraining.

Technique	Description	Effectiveness
KL penalty	Penalize deviation from reference policy	Moderate
Reward ensemble	Average multiple reward models	High
Length penalty	Penalize unnecessarily long responses	High
Periodic retraining	Retrain reward model on new data	High

Best Practices

Start with strong SFT - RLHF improves upon, not replaces, supervised fine-tuning
Collect diverse preferences - Avoid systematic labeling biases
Monitor KL divergence - Prevent policy collapse or reward hacking
Use reference model - Maintain baseline for stable training
Evaluate with humans - Automatic metrics alone are insufficient

Key Takeaways

PPO is the most established RLHF algorithm but requires significant compute
DPO provides a simpler, more stable alternative without a separate reward model
Reward modeling quality is critical for successful RLHF
KL regularization prevents catastrophic policy collapse
Human evaluation remains essential for measuring true alignment

RLHF and Alignment