Reinforcement Learning: Q-Learning, Policy Gradient, PPO — Asked at DeepMind & OpenAI

🎯 The Interview Question

"Explain the difference between value-based and policy-based reinforcement learning methods. How does Q-Learning work, and what is the role of the Bellman equation? What is the policy gradient theorem, and how does PPO improve upon it? What are the challenges of training RL agents, and how does RLHF work?"

This question is crucial for DeepMind (AlphaGo, Atari) and OpenAI (RLHF for ChatGPT, robotics).

📚 Detailed Answer

RL Framework

Markov Decision Process (MDP): $(S, A, P, R, \gamma)$

$S$ : state space
$A$ : action space
$P(s'|s,a)$ : transition probability
$R(s,a)$ : reward function
$\gamma \in [0,1]$ : discount factor

Objective: Maximize expected cumulative reward:

J(\pi) = \mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right]

Value-Based Methods

State-Value Function

V^\pi(s) = \mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty} \gamma^t r_t | s_0 = s\right]

Action-Value Function (Q-Function)

Q^\pi(s,a) = \mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty} \gamma^t r_t | s_0 = s, a_0 = a\right]

Bellman Equation

Q^\pi(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^\pi(s')

Q^\pi(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \sum_{a'} \pi(a'|s') Q^\pi(s',a')

Q-Learning Algorithm

Update rule:

Q(s,a) \leftarrow Q(s,a) + \alpha \left[r + \gamma \max_{a'} Q(s',a') - Q(s,a)\right]

where $\alpha$ is learning rate.

Key insight: Off-policy learning — learns optimal Q-values regardless of behavior policy.

Deep Q-Network (DQN)

Approximates Q-function with neural network:

Q(s,a; \theta) \approx Q^\pi(s,a)

Key innovations:

Experience replay: Store transitions $(s,a,r,s')$ , sample mini-batches
Target network: Separate network for target computation, updated periodically

Loss function:

\mathcal{L}(\theta) = \mathbb{E}\left[\left(r + \gamma \max_{a'} Q(s',a';\theta^-) - Q(s,a;\theta)\right)^2\right]

where $\theta^-$ are target network parameters.

Policy-Based Methods

Policy Gradient Theorem

\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^\pi(s,a)\right]

Intuition: Increase probability of actions that lead to higher returns.

REINFORCE Algorithm

\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t

where $G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$ is the return.

Problem: High variance, slow convergence.

Variance Reduction: Baseline

\nabla_\theta J(\theta) = \mathbb{E}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot (Q^\pi(s,a) - b(s))\right]

Common baseline: $b(s) = V^\pi(s)$ (advantage function).

Proximal Policy Optimization (PPO)

PPO constrains policy updates to prevent catastrophic changes:

Clipped objective:

\mathcal{L}^{CLIP}(\theta) = \mathbb{E}\left[\min\left(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]

where:

$r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ (probability ratio)
$\hat{A}_t$ is advantage estimate
$\epsilon$ is clipping parameter (typically 0.2)

Why PPO works:

Simple to implement
Stable training
Good sample efficiency
Used in RLHF, robotics, game AI

RLHF (Reinforcement Learning from Human Feedback)

Used in ChatGPT, Claude to align models with human preferences:

Step 1: Supervised Fine-Tuning

Train base model on human-written responses.

Step 2: Reward Model Training

Train reward model on human preference comparisons:

\mathcal{L}_{RM} = -\mathbb{E}\left[\log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))\right]

where $y_w$ is preferred response, $y_l$ is dispreferred.

Step 3: PPO Optimization

Maximize reward while staying close to SFT model:

\mathcal{L}_{RLHF} = \mathbb{E}\left[r_\phi(x, y) - \beta D_{KL}(\pi_\theta \| \pi_{ref})\right]

Challenges in RL

Challenge	Description	Solution
Sample inefficiency	Needs millions of samples	Model-based RL, offline RL
Reward shaping	Sparse rewards	Curriculum learning, curiosity
Exploration	Stuck in local optima	ε-greedy, entropy bonus
Non-stationarity	Environment changes	Robust algorithms

Follow-Up Questions

Q: What is the difference between on-policy and off-policy learning? A: On-policy uses current policy for data collection (REINFORCE). Off-policy uses a different behavior policy (Q-Learning). Off-policy is more sample-efficient but harder to train.

Q: How does RLHF differ from standard RL? A: RLHF learns a reward model from human preferences rather than using a handcrafted reward function. This enables alignment with complex human values.

Q: What is the role of entropy in RL? A: Entropy encourages exploration by preventing the policy from becoming too deterministic. It's added as a bonus to the reward.