πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

RLHF and Alignment

Advanced NLPReinforcement Learning from Human Feedback🟒 Free Lesson

Advertisement

RLHF and Alignment

Reinforcement Learning from Human Feedback (RLHF) trains language models to produce outputs that humans prefer, going beyond what supervised fine-tuning alone can achieve.

The Alignment Problem

Pre-trained models optimize for next-token prediction, not for being helpful, harmless, or honest. RLHF aligns model behavior with human values.

StageObjectiveData Required
Pre-trainingNext-token predictionTrillions of tokens
SFTTask completionThousands of demonstrations
Reward ModelingPredict human preferencesPaired preference comparisons
RL OptimizationMaximize reward with constraintsPrompts

RLHF Pipeline


Reward Modeling

The reward model learns to predict human preferences from pairwise comparisons.

DfBradley-Terry Reward Model

Given pairs of responses (yw,yl)(y_w, y_l) where ywy_w is preferred over yly_l, the Bradley-Terry model defines:

P(yw≻yl∣x)=Οƒ(rΟ•(x,yw)βˆ’rΟ•(x,yl))P(y_w \succ y_l | x) = \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))

The reward model is trained with:

LRM=βˆ’E(x,yw,yl)∼D[log⁑σ(rΟ•(x,yw)βˆ’rΟ•(x,yl))]\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[\log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))\right]
import torch
import torch.nn.functional as F
from transformers import AutoModelForSequenceClassification, AutoTokenizer

class RewardModelTrainer:
    def __init__(self, model_name="meta-llama/Llama-2-7b-hf"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_name, num_labels=1
        )
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token

    def compute_reward(self, prompt, response):
        text = f"[INST] {prompt} [/INST] {response}"
        inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
        with torch.no_grad():
            reward = self.model(**inputs).logits[0, 0]
        return reward.item()

    def train_step(self, batch):
        chosen_texts = [
            f"[INST] {p} [/INST] {c}" for p, c in zip(batch["prompts"], batch["chosen"])
        ]
        rejected_texts = [
            f"[INST] {p} [/INST] {r}" for p, r in zip(batch["prompts"], batch["rejected"])
        ]

        chosen_inputs = self.tokenizer(
            chosen_texts, padding=True, truncation=True, return_tensors="pt"
        )
        rejected_inputs = self.tokenizer(
            rejected_texts, padding=True, truncation=True, return_tensors="pt"
        )

        chosen_rewards = self.model(**chosen_inputs).logits[:, 0]
        rejected_rewards = self.model(**rejected_inputs).logits[:, 0]

        loss = -F.logsigmoid(chosen_rewards - rejected_rewards).mean()
        return loss, chosen_rewards.mean().item(), rejected_rewards.mean().item()

# Example training loop
trainer = RewardModelTrainer()
for batch in preference_dataloader:
    loss, chosen_r, rejected_r = trainer.train_step(batch)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    print(f"Loss: {loss:.4f}, Chosen R: {chosen_r:.4f}, Rejected R: {rejected_r:.4f}")

PPO (Proximal Policy Optimization)

PPO is the primary algorithm used for RLHF optimization, balancing reward maximization with policy stability.

DfPPO-Clipped Objective

LPPO(ΞΈ)=Et[min⁑(πθ(at∣st)πθold(at∣st)A^t,Β clip(πθ(at∣st)πθold(at∣st),1βˆ’Ο΅,1+Ο΅)A^t)]\mathcal{L}^{\text{PPO}}(\theta) = \mathbb{E}_t \left[\min\left(\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} \hat{A}_t, \ \text{clip}\left(\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}, 1-\epsilon, 1+\epsilon\right) \hat{A}_t\right)\right]

where A^t\hat{A}_t is the advantage estimate and Ο΅\epsilon is the clipping parameter (typically 0.2).

import torch
import torch.nn.functional as F
from dataclasses import dataclass

@dataclass
class PPOConfig:
    kl_coeff: float = 0.1
    clip_range: float = 0.2
    value_loss_coeff: float = 0.5
    gamma: float = 1.0
    gae_lambda: float = 0.95
    learning_rate: float = 1e-5
    ppo_epochs: int = 4
    mini_batch_size: int = 4

class PPOTrainer:
    def __init__(self, policy_model, ref_model, reward_model, tokenizer, config=None):
        self.config = config or PPOConfig()
        self.policy = policy_model
        self.ref_model = ref_model
        self.reward_model = reward_model
        self.tokenizer = tokenizer

    def compute_kl_penalty(self, logprobs, ref_logprobs):
        """KL divergence between current and reference policy."""
        return logprobs - ref_logprobs

    def compute_advantages(self, rewards, values):
        """Generalized Advantage Estimation (GAE)."""
        advantages = torch.zeros_like(rewards)
        last_gae = 0

        for t in reversed(range(len(rewards))):
            if t == len(rewards) - 1:
                next_value = 0
            else:
                next_value = values[t + 1]

            delta = rewards[t] + self.config.gamma * next_value - values[t]
            advantages[t] = last_gae = delta + (
                self.config.gamma * self.config.gae_lambda * last_gae
            )

        returns = advantages + values
        return advantages, returns

    def ppo_step(self, prompts):
        """Single PPO optimization step."""
        all_logprobs = []
        all_rewards = []
        all_values = []

        # Collect experience
        for prompt in prompts:
            response = self.generate_response(prompt)
            logprobs = self.get_logprobs(prompt, response)
            ref_logprobs = self.get_ref_logprobs(prompt, response)
            reward = self.get_reward(prompt, response)
            value = self.get_value(prompt, response)

            kl = self.compute_kl_penalty(logprobs, ref_logprobs)
            adjusted_reward = reward - self.config.kl_coeff * kl

            all_logprobs.append(logprobs)
            all_rewards.append(adjusted_reward)
            all_values.append(value)

        # PPO updates
        for _ in range(self.config.ppo_epochs):
            advantages, returns = self.compute_advantages(
                torch.stack(all_rewards), torch.stack(all_values)
            )
            advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

            # Policy loss
            ratio = torch.exp(new_logprobs - old_logprobs)
            surr1 = ratio * advantages
            surr2 = torch.clamp(ratio, 1 - self.config.clip_range, 1 + self.config.clip_range) * advantages
            policy_loss = -torch.min(surr1, surr2).mean()

            # Value loss
            value_loss = F.mse_loss(predicted_values, returns)

            # Total loss
            total_loss = policy_loss + self.config.value_loss_coeff * value_loss

            total_loss.backward()
            self.optimizer.step()
            self.optimizer.zero_grad()

        return {
            "policy_loss": policy_loss.item(),
            "value_loss": value_loss.item(),
            "mean_reward": torch.stack(all_rewards).mean().item(),
            "mean_kl": kl.mean().item(),
        }

DPO (Direct Preference Optimization)

DPO eliminates the need for a separate reward model by directly optimizing the policy from preferences.

DfDPO Loss

DPO reparameterizes the reward as:

r(x,y)=Ξ²log⁑πθ(y∣x)Ο€ref(y∣x)+Ξ²log⁑Z(x)r(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)

The DPO loss is:

LDPO(ΞΈ)=βˆ’E(x,yw,yl)[log⁑σ(Ξ²log⁑πθ(yw∣x)Ο€ref(yw∣x)βˆ’Ξ²log⁑πθ(yl∣x)Ο€ref(yl∣x))]\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y_w, y_l)} \left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]
import torch
import torch.nn.functional as F

class DPOTrainer:
    def __init__(self, policy_model, ref_model, tokenizer, beta=0.1):
        self.policy = policy_model
        self.ref_model = ref_model
        self.tokenizer = tokenizer
        self.beta = beta

    def get_logprobs(self, model, input_ids, labels):
        """Compute per-token log probabilities."""
        outputs = model(input_ids=input_ids)
        logits = outputs.logits[:, :-1, :]
        labels = labels[:, 1:]

        log_probs = F.log_softmax(logits, dim=-1)
        per_token_logprobs = torch.gather(
            log_probs, 2, labels.unsqueeze(-1)
        ).squeeze(-1)

        # Mask padding
        mask = (labels != self.tokenizer.pad_token_id).float()
        seq_logprobs = (per_token_logprobs * mask).sum(dim=-1)
        return seq_logprobs

    def dpo_loss(self, batch):
        """Compute DPO loss from preference pairs."""
        chosen_ids = batch["chosen_input_ids"]
        rejected_ids = batch["rejected_input_ids"]

        with torch.no_grad():
            ref_chosen_logps = self.get_logprobs(self.ref_model, chosen_ids, chosen_ids)
            ref_rejected_logps = self.get_logprobs(self.ref_model, rejected_ids, rejected_ids)

        policy_chosen_logps = self.get_logprobs(self.policy, chosen_ids, chosen_ids)
        policy_rejected_logps = self.get_logprobs(self.policy, rejected_ids, rejected_ids)

        chosen_logratios = policy_chosen_logps - ref_chosen_logps
        rejected_logratios = policy_rejected_logps - ref_rejected_logps

        logits = self.beta * (chosen_logratios - rejected_logratios)
        loss = -F.logsigmoid(logits).mean()

        chosen_rewards = self.beta * chosen_logratios
        rejected_rewards = self.beta * rejected_logratios

        return loss, {
            "chosen_reward": chosen_rewards.mean().item(),
            "rejected_reward": rejected_rewards.mean().item(),
            "reward_margin": (chosen_rewards - rejected_rewards).mean().item(),
            "accuracy": (logits > 0).float().mean().item(),
        }

Comparison of Alignment Methods

MethodReward ModelTraining StabilityCompute CostSample Efficiency
PPORequiredModerateHighLow
DPONot requiredHighModerateHigh
RLHF (full)RequiredLowVery highLow
KTONot requiredHighModerateModerate
IPONot requiredHighModerateHigh

Reward Hacking

Models may exploit reward model weaknesses without producing genuinely better outputs.

DfReward Hacking Problem

The agent learns to maximize apparent reward rather than true quality:

max⁑θEx[rψ(x,πθ(x))]β‰ max⁑θEx[rtrue(x,πθ(x))]\max_\theta \mathbb{E}_{x} [r_\psi(x, \pi_\theta(x))] \neq \max_\theta \mathbb{E}_{x} [r_{\text{true}}(x, \pi_\theta(x))]

Mitigation strategies include KL penalties, reward ensemble, and periodic reward model retraining.

TechniqueDescriptionEffectiveness
KL penaltyPenalize deviation from reference policyModerate
Reward ensembleAverage multiple reward modelsHigh
Length penaltyPenalize unnecessarily long responsesHigh
Periodic retrainingRetrain reward model on new dataHigh

Best Practices

  1. Start with strong SFT - RLHF improves upon, not replaces, supervised fine-tuning
  2. Collect diverse preferences - Avoid systematic labeling biases
  3. Monitor KL divergence - Prevent policy collapse or reward hacking
  4. Use reference model - Maintain baseline for stable training
  5. Evaluate with humans - Automatic metrics alone are insufficient

Key Takeaways

  • PPO is the most established RLHF algorithm but requires significant compute
  • DPO provides a simpler, more stable alternative without a separate reward model
  • Reward modeling quality is critical for successful RLHF
  • KL regularization prevents catastrophic policy collapse
  • Human evaluation remains essential for measuring true alignment
⭐

Premium Content

RLHF and Alignment

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert NLP Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement