🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Reinforcement Learning: Q-Learning, Policy Gradient, PPO — Asked at DeepMind & OpenAI

Deep Learning Premium InterviewsReinforcement Learning⭐ Premium

Advertisement

DeepMind & OpenAI

Reinforcement Learning: Q-Learning, Policy Gradient & PPO

Premium Interview Preparation — RL Mastery

🎯 The Interview Question

"Explain the difference between value-based and policy-based reinforcement learning methods. How does Q-Learning work, and what is the role of the Bellman equation? What is the policy gradient theorem, and how does PPO improve upon it? What are the challenges of training RL agents, and how does RLHF work?"

This question is crucial for DeepMind (AlphaGo, Atari) and OpenAI (RLHF for ChatGPT, robotics).


📚 Detailed Answer

RL Framework

Markov Decision Process (MDP): (S,A,P,R,γ)(S, A, P, R, \gamma)

  • SS: state space
  • AA: action space
  • P(ss,a)P(s'|s,a): transition probability
  • R(s,a)R(s,a): reward function
  • γ[0,1]\gamma \in [0,1]: discount factor

Objective: Maximize expected cumulative reward:

J(π)=Eπ[t=0γtrt]J(\pi) = \mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right]

Value-Based Methods

State-Value Function

Vπ(s)=Eπ[t=0γtrts0=s]V^\pi(s) = \mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty} \gamma^t r_t | s_0 = s\right]

Action-Value Function (Q-Function)

Qπ(s,a)=Eπ[t=0γtrts0=s,a0=a]Q^\pi(s,a) = \mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty} \gamma^t r_t | s_0 = s, a_0 = a\right]

Bellman Equation

Qπ(s,a)=R(s,a)+γsP(ss,a)Vπ(s)Q^\pi(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^\pi(s')
Qπ(s,a)=R(s,a)+γsP(ss,a)aπ(as)Qπ(s,a)Q^\pi(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \sum_{a'} \pi(a'|s') Q^\pi(s',a')

Q-Learning Algorithm

Update rule:

Q(s,a)Q(s,a)+α[r+γmaxaQ(s,a)Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha \left[r + \gamma \max_{a'} Q(s',a') - Q(s,a)\right]

where α\alpha is learning rate.

Key insight: Off-policy learning — learns optimal Q-values regardless of behavior policy.

Deep Q-Network (DQN)

Approximates Q-function with neural network:

Q(s,a;θ)Qπ(s,a)Q(s,a; \theta) \approx Q^\pi(s,a)

Key innovations:

  1. Experience replay: Store transitions (s,a,r,s)(s,a,r,s'), sample mini-batches
  2. Target network: Separate network for target computation, updated periodically

Loss function:

L(θ)=E[(r+γmaxaQ(s,a;θ)Q(s,a;θ))2]\mathcal{L}(\theta) = \mathbb{E}\left[\left(r + \gamma \max_{a'} Q(s',a';\theta^-) - Q(s,a;\theta)\right)^2\right]

where θ\theta^- are target network parameters.

Policy-Based Methods

Policy Gradient Theorem

θJ(θ)=Eπθ[θlogπθ(as)Qπ(s,a)]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^\pi(s,a)\right]

Intuition: Increase probability of actions that lead to higher returns.

REINFORCE Algorithm

θθ+αθlogπθ(atst)Gt\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t

where Gt=k=tTγktrkG_t = \sum_{k=t}^{T} \gamma^{k-t} r_k is the return.

Problem: High variance, slow convergence.

Variance Reduction: Baseline

θJ(θ)=E[θlogπθ(as)(Qπ(s,a)b(s))]\nabla_\theta J(\theta) = \mathbb{E}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot (Q^\pi(s,a) - b(s))\right]

Common baseline: b(s)=Vπ(s)b(s) = V^\pi(s) (advantage function).

Proximal Policy Optimization (PPO)

PPO constrains policy updates to prevent catastrophic changes:

Clipped objective:

LCLIP(θ)=E[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]\mathcal{L}^{CLIP}(\theta) = \mathbb{E}\left[\min\left(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]

where:

  • rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} (probability ratio)
  • A^t\hat{A}_t is advantage estimate
  • ϵ\epsilon is clipping parameter (typically 0.2)

Why PPO works:

  • Simple to implement
  • Stable training
  • Good sample efficiency
  • Used in RLHF, robotics, game AI

RLHF (Reinforcement Learning from Human Feedback)

Used in ChatGPT, Claude to align models with human preferences:

Step 1: Supervised Fine-Tuning

Train base model on human-written responses.

Step 2: Reward Model Training

Train reward model on human preference comparisons:

LRM=E[logσ(rθ(x,yw)rθ(x,yl))]\mathcal{L}_{RM} = -\mathbb{E}\left[\log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))\right]

where ywy_w is preferred response, yly_l is dispreferred.

Step 3: PPO Optimization

Maximize reward while staying close to SFT model:

LRLHF=E[rϕ(x,y)βDKL(πθπref)]\mathcal{L}_{RLHF} = \mathbb{E}\left[r_\phi(x, y) - \beta D_{KL}(\pi_\theta \| \pi_{ref})\right]

Challenges in RL

ChallengeDescriptionSolution
Sample inefficiencyNeeds millions of samplesModel-based RL, offline RL
Reward shapingSparse rewardsCurriculum learning, curiosity
ExplorationStuck in local optimaε-greedy, entropy bonus
Non-stationarityEnvironment changesRobust algorithms

Follow-Up Questions

Q: What is the difference between on-policy and off-policy learning? A: On-policy uses current policy for data collection (REINFORCE). Off-policy uses a different behavior policy (Q-Learning). Off-policy is more sample-efficient but harder to train.

Q: How does RLHF differ from standard RL? A: RLHF learns a reward model from human preferences rather than using a handcrafted reward function. This enables alignment with complex human values.

Q: What is the role of entropy in RL? A: Entropy encourages exploration by preventing the policy from becoming too deterministic. It's added as a bonus to the reward.

Related Topics

Advertisement