🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Reinforcement Learning — Complete Guide

Advanced TopicsReinforcement Learning🟢 Free Lesson

Advertisement

Advanced Topics

Reinforcement Learning - Teaching Agents to Make Decisions

Learn how agents learn optimal strategies through trial and error in interactive environments.

  • Reward-based learning - optimize cumulative rewards over time
  • Exploration vs exploitation - balance trying new actions and using known good ones
  • Policy optimization - learn the best action for each state

The best way to predict the future is to invent it.

Reinforcement Learning — Complete Guide

Reinforcement learning trains agents to make decisions by maximizing cumulative reward through trial and error.


RL Framework

DfRL Framework

Agent interacts with Environment:

State (S)Action (A)Reward (R)New State (S’)\text{State (S)} \rightarrow \text{Action (A)} \rightarrow \text{Reward (R)} \rightarrow \text{New State (S')}

Goal: Learn policy π\pi that maximizes cumulative reward.

Key concepts:

  • State: Current situation
  • Action: What agent can do
  • Reward: Feedback signal
  • Policy: Strategy (state -> action)
  • Value function: Expected cumulative reward
  • Q-value: Expected reward taking action in state

Agent-Environment Interaction Loop

Agent-Environment Interaction LoopAgentPolicy π(a|s)Value V(s)Q(s,a)EnvironmentState transitionsReward functionDynamics P(s'|s,a)Action (a_t)State (s_{t+1})Reward (r_t)At each time step t: observe s_t → take a_t → receive r_t → observe s_{t+1}

How the RL interaction loop works: This diagram shows the fundamental cycle of reinforcement learning. The Agent (blue, left) maintains a policy π(a|s) — a strategy mapping states to actions — along with value functions V(s) and Q(s,a) that estimate expected rewards. At each time step, the agent observes the current state s_t, chooses an action a_t based on its policy, and sends it to the Environment (green, right). The environment responds with a new state s_{t+1} (based on transition dynamics P(s'|s,a)) and a reward signal r_t (feedback on how good the action was). The reward flows back to the agent (golden dashed lines) to update its policy. The agent's goal: learn a policy that maximizes cumulative reward over time. The bottom text summarizes the loop: observe → act → receive reward → observe new state → repeat. This trial-and-error process is how RL agents learn to play games, control robots, and make decisions.


MDP (Markov Decision Process)

A formal framework for RL:

MDP=(S,A,P,R,γ)\text{MDP} = (\mathcal{S}, \mathcal{A}, P, R, \gamma)

Where:

  • S\mathcal{S}: State space
  • A\mathcal{A}: Action space
  • P(ss,a)P(s'|s,a): Transition probability
  • R(s,a)R(s,a): Reward function
  • γ[0,1)\gamma \in [0,1): Discount factor

MDP State Transition Diagram

Markov Decision Process (MDP)s₁State 1s₂State 2s₃State 3s₄State 4a₁: P=0.8, R=+1a₂: P=0.6, R=+2a₃: P=0.9, R=+3a₄: P=0.7, R=+1Terminal

How an MDP models decision-making: This state transition diagram shows the formal framework underlying RL. Each circle represents a state (s₁ through s₄) — a specific situation the agent can be in. The arrows represent actions (a₁ through a₄) that transition between states, each labeled with its transition probability P and reward R. For example, from state s₁, taking action a₁ leads to s₂ with probability 0.8 and reward +1. The agent's goal is to find the optimal policy — which action to take in each state — that maximizes cumulative reward. The terminal state (red) indicates where episodes end. The Markov property means transitions depend ONLY on the current state, not the history. The discount factor γ controls how much the agent values future rewards vs immediate ones: γ=0 means only immediate rewards matter, γ→1 means the agent plans far ahead.


Q-Learning

Q-Learning Update Rule

Q(s,a)=Q(s,a)+α[r+γmaxaQ(s,a)Q(s,a)]Q(s,a) = Q(s,a) + \alpha \left[ r + \gamma \cdot \max_{a'} Q(s',a') - Q(s,a) \right]

Here,

  • α\alpha=Learning rate
  • γ\gamma=Discount factor (future vs immediate reward)
  • Q(s,a)Q(s,a)=Q-value for state s and action a

DfQ-Learning Algorithm

  1. Initialize Q-table with zeros
  2. Observe state ss
  3. Choose action (ε\varepsilon-greedy)
  4. Take action, observe reward rr and new state ss'
  5. Update: Q(s,a)+=α[r+γmaxQ(s,a)Q(s,a)]Q(s,a) \mathrel{+}= \alpha[r + \gamma \cdot \max Q(s',a') - Q(s,a)]
  6. Repeat

Q-Learning Convergence Visualization

Q-Learning Convergence Over EpisodesAverage RewardEpisodesExploration PhaseConvergence Phaseε-greedy with high εrandom actionsQ-values convergeε decays to 0Q-star (optimal Q-values)

Deep Q-Network (DQN)

DfDQN

Replace the Q-table with a neural network (Q-network):

  • Input: State
  • Output: Q-value for each action

Key features:

  • Experience replay: Store and sample transitions
  • Target network: Stabilize training
  • Double DQN: Reduce overestimation

DQN Architecture

Deep Q-Network (DQN) ArchitectureInputState s[84×84×4]Conv32 filters8×8, stride 4ReLUConv64 filters4×4, stride 2ReLUFC512 unitsReLUOutputQ(s, a₁)Q(s, a₂)Q(s, a₃)Q(s, a₄)DQN Loss: L = E[(r + γ·max Q(s',a';θ⁻) - Q(s,a;θ))²]θ⁻: target network parameters (updated periodically)

How DQN replaces the Q-table with a neural network: The Q-learning approach above uses a lookup table, which becomes impossible when states are high-dimensional (e.g., game screens with millions of pixel combinations). DQN solves this by using a neural network to approximate Q-values. The input is the state s (e.g., an 84×84×4 image stack — 4 consecutive frames for motion). The network has two convolutional layers (32 filters 8×8, then 64 filters 4×4) to extract visual features, a fully connected layer (512 units), and outputs Q(s,a) for EACH possible action simultaneously. The agent picks the action with highest Q-value. The loss function at the bottom shows how training works: the target is r + γ·max Q(s',a') (the Bellman equation), and the network learns to predict this target. The target network θ⁻ (updated periodically) stabilizes training by providing consistent targets.


Policy Gradient

Policy Gradient Objective

J(θ)=E[tγtrt]J(\theta) = \mathbb{E}\left[\sum_t \gamma^t r_t\right]

Here,

  • θ\theta=Policy parameters
  • γ\gamma=Discount factor
  • rtr_t=Reward at time step t

DfREINFORCE Algorithm

  1. Collect trajectory using current policy
  2. Compute returns Gt=kγkrkG_t = \sum_k \gamma^k r_k
  3. Update: θ=θ+αGtlogπ(atst)\theta = \theta + \alpha \cdot G_t \cdot \nabla \log \pi(a_t|s_t)

Advantages:

  • Can handle continuous action spaces
  • Learns stochastic policies
  • Works with high-dimensional states

Actor-Critic

DfActor-Critic

Combines value-based and policy-based methods:

  • Actor: Learns policy π(as)\pi(a|s) — what to do
  • Critic: Learns value V(s)V(s) — how good is state

A2C (Advantage Actor-Critic): Actor maximizes advantage, Critic estimates value. Advantage = actual return - baseline.

Actor-Critic Architecture

Actor-Critic ArchitectureState s_tShared Feature ExtractorActor (Policy)π(a_t | s_t; θ)→ Action a_tCritic (Value)V(s_t; φ)→ Value estimateAdvantage: A(s_t,a_t) = r_t + γV(s_{t+1}) - V(s_t)

Key Takeaways

Summary: Reinforcement Learning

  • RL trains agents through trial and error
  • Q-learning learns action values
  • DQN scales Q-learning with neural networks
  • Policy gradients directly optimize the policy
  • Actor-critic combines both approaches
  • Exploration vs exploitation is the key tradeoff
  • RL requires careful reward design
  • Sim-to-real transfer for robotics

What to Learn Next

-> Neural Networks Deep RL combines neural nets with RL.

-> Model Evaluation Measure and compare model performance.

-> Time Series Apply RL to sequential decision making.

-> Causal Inference Understand cause and effect in data.

-> ML Ethics Consider responsible AI development.

-> ML System Design Build end-to-end ML systems.

Premium Content

Reinforcement Learning — Complete Guide

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Machine Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement