🎯 The Interview Question
"Explain the difference between value-based and policy-based reinforcement learning methods. How does Q-Learning work, and what is the role of the Bellman equation? What is the policy gradient theorem, and how does PPO improve upon it? What are the challenges of training RL agents, and how does RLHF work?"
This question is crucial for DeepMind (AlphaGo, Atari) and OpenAI (RLHF for ChatGPT, robotics).
📚 Detailed Answer
RL Framework
Markov Decision Process (MDP):
- : state space
- : action space
- : transition probability
- : reward function
- : discount factor
Objective: Maximize expected cumulative reward:
Value-Based Methods
State-Value Function
Action-Value Function (Q-Function)
Bellman Equation
Q-Learning Algorithm
Update rule:
where is learning rate.
Key insight: Off-policy learning — learns optimal Q-values regardless of behavior policy.
Deep Q-Network (DQN)
Approximates Q-function with neural network:
Key innovations:
- Experience replay: Store transitions , sample mini-batches
- Target network: Separate network for target computation, updated periodically
Loss function:
where are target network parameters.
Policy-Based Methods
Policy Gradient Theorem
Intuition: Increase probability of actions that lead to higher returns.
REINFORCE Algorithm
where is the return.
Problem: High variance, slow convergence.
Variance Reduction: Baseline
Common baseline: (advantage function).
Proximal Policy Optimization (PPO)
PPO constrains policy updates to prevent catastrophic changes:
Clipped objective:
where:
- (probability ratio)
- is advantage estimate
- is clipping parameter (typically 0.2)
Why PPO works:
- Simple to implement
- Stable training
- Good sample efficiency
- Used in RLHF, robotics, game AI
RLHF (Reinforcement Learning from Human Feedback)
Used in ChatGPT, Claude to align models with human preferences:
Step 1: Supervised Fine-Tuning
Train base model on human-written responses.
Step 2: Reward Model Training
Train reward model on human preference comparisons:
where is preferred response, is dispreferred.
Step 3: PPO Optimization
Maximize reward while staying close to SFT model:
Challenges in RL
| Challenge | Description | Solution |
|---|---|---|
| Sample inefficiency | Needs millions of samples | Model-based RL, offline RL |
| Reward shaping | Sparse rewards | Curriculum learning, curiosity |
| Exploration | Stuck in local optima | ε-greedy, entropy bonus |
| Non-stationarity | Environment changes | Robust algorithms |
Follow-Up Questions
Q: What is the difference between on-policy and off-policy learning? A: On-policy uses current policy for data collection (REINFORCE). Off-policy uses a different behavior policy (Q-Learning). Off-policy is more sample-efficient but harder to train.
Q: How does RLHF differ from standard RL? A: RLHF learns a reward model from human preferences rather than using a handcrafted reward function. This enables alignment with complex human values.
Q: What is the role of entropy in RL? A: Entropy encourages exploration by preventing the policy from becoming too deterministic. It's added as a bonus to the reward.