Advanced Topics
Reinforcement Learning - Teaching Agents to Make Decisions
Learn how agents learn optimal strategies through trial and error in interactive environments.
- Reward-based learning - optimize cumulative rewards over time
- Exploration vs exploitation - balance trying new actions and using known good ones
- Policy optimization - learn the best action for each state
The best way to predict the future is to invent it.
Reinforcement Learning — Complete Guide
Reinforcement learning trains agents to make decisions by maximizing cumulative reward through trial and error.
RL Framework
DfRL Framework
Agent interacts with Environment:
Goal: Learn policy that maximizes cumulative reward.
Key concepts:
- State: Current situation
- Action: What agent can do
- Reward: Feedback signal
- Policy: Strategy (state -> action)
- Value function: Expected cumulative reward
- Q-value: Expected reward taking action in state
Agent-Environment Interaction Loop
How the RL interaction loop works: This diagram shows the fundamental cycle of reinforcement learning. The Agent (blue, left) maintains a policy π(a|s) — a strategy mapping states to actions — along with value functions V(s) and Q(s,a) that estimate expected rewards. At each time step, the agent observes the current state s_t, chooses an action a_t based on its policy, and sends it to the Environment (green, right). The environment responds with a new state s_{t+1} (based on transition dynamics P(s'|s,a)) and a reward signal r_t (feedback on how good the action was). The reward flows back to the agent (golden dashed lines) to update its policy. The agent's goal: learn a policy that maximizes cumulative reward over time. The bottom text summarizes the loop: observe → act → receive reward → observe new state → repeat. This trial-and-error process is how RL agents learn to play games, control robots, and make decisions.
MDP (Markov Decision Process)
A formal framework for RL:
Where:
- : State space
- : Action space
- : Transition probability
- : Reward function
- : Discount factor
MDP State Transition Diagram
How an MDP models decision-making: This state transition diagram shows the formal framework underlying RL. Each circle represents a state (s₁ through s₄) — a specific situation the agent can be in. The arrows represent actions (a₁ through a₄) that transition between states, each labeled with its transition probability P and reward R. For example, from state s₁, taking action a₁ leads to s₂ with probability 0.8 and reward +1. The agent's goal is to find the optimal policy — which action to take in each state — that maximizes cumulative reward. The terminal state (red) indicates where episodes end. The Markov property means transitions depend ONLY on the current state, not the history. The discount factor γ controls how much the agent values future rewards vs immediate ones: γ=0 means only immediate rewards matter, γ→1 means the agent plans far ahead.
Q-Learning
Q-Learning Update Rule
Here,
- =Learning rate
- =Discount factor (future vs immediate reward)
- =Q-value for state s and action a
DfQ-Learning Algorithm
- Initialize Q-table with zeros
- Observe state
- Choose action (-greedy)
- Take action, observe reward and new state
- Update:
- Repeat
Q-Learning Convergence Visualization
Deep Q-Network (DQN)
DfDQN
Replace the Q-table with a neural network (Q-network):
- Input: State
- Output: Q-value for each action
Key features:
- Experience replay: Store and sample transitions
- Target network: Stabilize training
- Double DQN: Reduce overestimation
DQN Architecture
How DQN replaces the Q-table with a neural network: The Q-learning approach above uses a lookup table, which becomes impossible when states are high-dimensional (e.g., game screens with millions of pixel combinations). DQN solves this by using a neural network to approximate Q-values. The input is the state s (e.g., an 84×84×4 image stack — 4 consecutive frames for motion). The network has two convolutional layers (32 filters 8×8, then 64 filters 4×4) to extract visual features, a fully connected layer (512 units), and outputs Q(s,a) for EACH possible action simultaneously. The agent picks the action with highest Q-value. The loss function at the bottom shows how training works: the target is r + γ·max Q(s',a') (the Bellman equation), and the network learns to predict this target. The target network θ⁻ (updated periodically) stabilizes training by providing consistent targets.
Policy Gradient
Policy Gradient Objective
Here,
- =Policy parameters
- =Discount factor
- =Reward at time step t
DfREINFORCE Algorithm
- Collect trajectory using current policy
- Compute returns
- Update:
Advantages:
- Can handle continuous action spaces
- Learns stochastic policies
- Works with high-dimensional states
Actor-Critic
DfActor-Critic
Combines value-based and policy-based methods:
- Actor: Learns policy — what to do
- Critic: Learns value — how good is state
A2C (Advantage Actor-Critic): Actor maximizes advantage, Critic estimates value. Advantage = actual return - baseline.
Actor-Critic Architecture
Key Takeaways
Summary: Reinforcement Learning
- RL trains agents through trial and error
- Q-learning learns action values
- DQN scales Q-learning with neural networks
- Policy gradients directly optimize the policy
- Actor-critic combines both approaches
- Exploration vs exploitation is the key tradeoff
- RL requires careful reward design
- Sim-to-real transfer for robotics
What to Learn Next
-> Neural Networks Deep RL combines neural nets with RL.
-> Model Evaluation Measure and compare model performance.
-> Time Series Apply RL to sequential decision making.
-> Causal Inference Understand cause and effect in data.
-> ML Ethics Consider responsible AI development.
-> ML System Design Build end-to-end ML systems.