Advanced Topics

Reinforcement Learning - Teaching Agents to Make Decisions

Learn how agents learn optimal strategies through trial and error in interactive environments.

Reward-based learning - optimize cumulative rewards over time
Exploration vs exploitation - balance trying new actions and using known good ones
Policy optimization - learn the best action for each state

The best way to predict the future is to invent it.

Reinforcement Learning — Complete Guide

Reinforcement learning trains agents to make decisions by maximizing cumulative reward through trial and error.

RL Framework

DfRL Framework

Agent interacts with Environment:

\text{State (S)} \rightarrow \text{Action (A)} \rightarrow \text{Reward (R)} \rightarrow \text{New State (S')}

Goal: Learn policy $\pi$ that maximizes cumulative reward.

Key concepts:

State: Current situation
Action: What agent can do
Reward: Feedback signal
Policy: Strategy (state -> action)
Value function: Expected cumulative reward
Q-value: Expected reward taking action in state

Agent-Environment Interaction Loop

How the RL interaction loop works: This diagram shows the fundamental cycle of reinforcement learning. The Agent (blue, left) maintains a policy π(a|s) — a strategy mapping states to actions — along with value functions V(s) and Q(s,a) that estimate expected rewards. At each time step, the agent observes the current state s_t, chooses an action a_t based on its policy, and sends it to the Environment (green, right). The environment responds with a new state s_{t+1} (based on transition dynamics P(s'|s,a)) and a reward signal r_t (feedback on how good the action was). The reward flows back to the agent (golden dashed lines) to update its policy. The agent's goal: learn a policy that maximizes cumulative reward over time. The bottom text summarizes the loop: observe → act → receive reward → observe new state → repeat. This trial-and-error process is how RL agents learn to play games, control robots, and make decisions.

MDP (Markov Decision Process)

A formal framework for RL:

\text{MDP} = (\mathcal{S}, \mathcal{A}, P, R, \gamma)

Where:

$\mathcal{S}$ : State space
$\mathcal{A}$ : Action space
$P(s'|s,a)$ : Transition probability
$R(s,a)$ : Reward function
$\gamma \in [0,1)$ : Discount factor

MDP State Transition Diagram

How an MDP models decision-making: This state transition diagram shows the formal framework underlying RL. Each circle represents a state (s₁ through s₄) — a specific situation the agent can be in. The arrows represent actions (a₁ through a₄) that transition between states, each labeled with its transition probability P and reward R. For example, from state s₁, taking action a₁ leads to s₂ with probability 0.8 and reward +1. The agent's goal is to find the optimal policy — which action to take in each state — that maximizes cumulative reward. The terminal state (red) indicates where episodes end. The Markov property means transitions depend ONLY on the current state, not the history. The discount factor γ controls how much the agent values future rewards vs immediate ones: γ=0 means only immediate rewards matter, γ→1 means the agent plans far ahead.

Q-Learning

Q-Learning Update Rule

Q(s,a) = Q(s,a) + \alpha \left[ r + \gamma \cdot \max_{a'} Q(s',a') - Q(s,a) \right]

Here,

$\alpha$ =Learning rate
$\gamma$ =Discount factor (future vs immediate reward)
$Q(s,a)$ =Q-value for state s and action a

DfQ-Learning Algorithm

Initialize Q-table with zeros
Observe state $s$
Choose action ( $\varepsilon$ -greedy)
Take action, observe reward $r$ and new state $s'$
Update: $Q(s,a) \mathrel{+}= \alpha[r + \gamma \cdot \max Q(s',a') - Q(s,a)]$
Repeat

Q-Learning Convergence Visualization

Deep Q-Network (DQN)

DfDQN

Replace the Q-table with a neural network (Q-network):

Input: State
Output: Q-value for each action

Key features:

Experience replay: Store and sample transitions
Target network: Stabilize training
Double DQN: Reduce overestimation

DQN Architecture

How DQN replaces the Q-table with a neural network: The Q-learning approach above uses a lookup table, which becomes impossible when states are high-dimensional (e.g., game screens with millions of pixel combinations). DQN solves this by using a neural network to approximate Q-values. The input is the state s (e.g., an 84×84×4 image stack — 4 consecutive frames for motion). The network has two convolutional layers (32 filters 8×8, then 64 filters 4×4) to extract visual features, a fully connected layer (512 units), and outputs Q(s,a) for EACH possible action simultaneously. The agent picks the action with highest Q-value. The loss function at the bottom shows how training works: the target is r + γ·max Q(s',a') (the Bellman equation), and the network learns to predict this target. The target network θ⁻ (updated periodically) stabilizes training by providing consistent targets.

Policy Gradient

Policy Gradient Objective

J(\theta) = \mathbb{E}\left[\sum_t \gamma^t r_t\right]

Here,

$\theta$ =Policy parameters
$\gamma$ =Discount factor
$r_t$ =Reward at time step t

DfREINFORCE Algorithm

Collect trajectory using current policy
Compute returns $G_t = \sum_k \gamma^k r_k$
Update: $\theta = \theta + \alpha \cdot G_t \cdot \nabla \log \pi(a_t|s_t)$

Advantages:

Can handle continuous action spaces
Learns stochastic policies
Works with high-dimensional states

Actor-Critic

DfActor-Critic

Combines value-based and policy-based methods:

Actor: Learns policy $\pi(a|s)$ — what to do
Critic: Learns value $V(s)$ — how good is state

A2C (Advantage Actor-Critic): Actor maximizes advantage, Critic estimates value. Advantage = actual return - baseline.

Actor-Critic Architecture

Key Takeaways

Summary: Reinforcement Learning

RL trains agents through trial and error
Q-learning learns action values
DQN scales Q-learning with neural networks
Policy gradients directly optimize the policy
Actor-critic combines both approaches
Exploration vs exploitation is the key tradeoff
RL requires careful reward design
Sim-to-real transfer for robotics

What to Learn Next

-> Neural Networks Deep RL combines neural nets with RL.

-> Model Evaluation Measure and compare model performance.

-> Time Series Apply RL to sequential decision making.

-> Causal Inference Understand cause and effect in data.

-> ML Ethics Consider responsible AI development.

-> ML System Design Build end-to-end ML systems.

Reinforcement Learning — Complete Guide

Reinforcement Learning - Teaching Agents to Make Decisions

Reinforcement Learning — Complete Guide

RL Framework

DfRL Framework

Agent-Environment Interaction Loop

MDP (Markov Decision Process)

MDP State Transition Diagram

Q-Learning

Q-Learning Update Rule

DfQ-Learning Algorithm

Q-Learning Convergence Visualization

Deep Q-Network (DQN)

DfDQN

DQN Architecture

Policy Gradient

Policy Gradient Objective

DfREINFORCE Algorithm

Actor-Critic

DfActor-Critic

Actor-Critic Architecture

Key Takeaways

Summary: Reinforcement Learning

What to Learn Next

Premium Content

Need Expert Machine Learning Help?