Taxonomy of RL Algorithms¶
The RL algorithm landscape is vast. This page provides a structured map to help you navigate it. Understanding the taxonomy helps you choose the right algorithm for a given problem and understand the trade-offs involved.
The Big Picture¶
graph TD
RL[RL Algorithms]
RL --> MF[Model-Free]
RL --> MB[Model-Based]
MF --> PG[Policy Optimization]
MF --> VB[Value-Based]
MF --> AC[Actor-Critic]
PG --> REINFORCE
PG --> TRPO
PG --> PPO
VB --> DQN
VB --> Rainbow
AC --> A2C/A3C
AC --> DDPG
AC --> TD3
AC --> SAC
MB --> Dyna
MB --> MBPO
MB --> Dreamer
MB --> MuZero
RL --> Offline[Offline RL]
Offline --> CQL
Offline --> IQL
Offline --> DT[Decision Transformer]
Key Axes of Classification¶
1. Model-Free vs. Model-Based¶
| Model-Free | Model-Based | |
|---|---|---|
| Learns dynamics? | No | Yes — learns \(\hat{P}(s'\|s,a)\) |
| Sample efficiency | Low (needs many interactions) | High (can plan with learned model) |
| Asymptotic performance | Often better | Can be limited by model accuracy |
| Examples | DQN, PPO, SAC | Dyna, MBPO, Dreamer, MuZero |
Model-free methods learn a policy or value function directly from experience without explicitly modeling environment dynamics.
Model-based methods learn (or are given) a model of the environment and use it for planning or generating synthetic experience.
2. On-Policy vs. Off-Policy¶
| On-Policy | Off-Policy | |
|---|---|---|
| Data source | Current policy \(\pi\) only | Any policy (replay buffer) |
| Sample efficiency | Low (discard data after update) | High (reuse old data) |
| Stability | Generally more stable | Can be unstable (deadly triad) |
| Examples | REINFORCE, PPO, A2C | DQN, DDPG, TD3, SAC |
On-policy algorithms only use data collected by the current policy. After each policy update, old data is discarded.
Off-policy algorithms can learn from data collected by any policy, typically stored in a replay buffer. This makes them much more sample-efficient.
3. Value-Based vs. Policy-Based vs. Actor-Critic¶
Value-based methods learn \(Q^*(s,a)\) and derive the policy implicitly:
- Works well for discrete action spaces
- Cannot handle continuous actions directly (argmax is intractable)
- Examples: DQN, Rainbow
Policy-based methods directly parameterize and optimize the policy \(\pi_\theta(a|s)\):
- Works for both discrete and continuous actions
- Can represent stochastic policies
- Higher variance gradients
- Examples: REINFORCE, TRPO, PPO
Actor-Critic methods combine both:
- Actor: policy \(\pi_\theta(a|s)\) (decides actions)
- Critic: value function \(V_\phi(s)\) or \(Q_\phi(s,a)\) (evaluates actions)
- The critic reduces the variance of policy gradient estimates
- Examples: A2C, DDPG, TD3, SAC
4. Stochastic vs. Deterministic Policy¶
Stochastic policies \(\pi_\theta(a|s)\) output a probability distribution over actions:
- Natural exploration through sampling
- Used in: PPO, SAC, A2C
Deterministic policies \(\mu_\theta(s)\) output a single action:
- Need explicit exploration noise (e.g., Gaussian, OU process)
- Can be more efficient when applicable
- Used in: DDPG, TD3
5. Online vs. Offline¶
Online RL: The agent interacts with the environment during training.
Offline RL (Batch RL): The agent learns entirely from a fixed dataset with no further interaction. This is critical for:
- Safety-critical domains (healthcare, autonomous driving)
- When real-world interaction is expensive
- Leveraging existing large-scale datasets
Algorithm Summary Table¶
| Algorithm | Type | On/Off-Policy | Action Space | Key Idea |
|---|---|---|---|---|
| REINFORCE | Policy Gradient | On | Discrete/Continuous | Monte Carlo policy gradient |
| DQN | Value-Based | Off | Discrete | Q-learning + neural nets + replay |
| A2C/A3C | Actor-Critic | On | Both | Parallel actors, advantage estimation |
| TRPO | Policy Gradient | On | Both | Trust region constraint |
| PPO | Policy Gradient | On | Both | Clipped surrogate objective |
| DDPG | Actor-Critic | Off | Continuous | Deterministic policy gradient + replay |
| TD3 | Actor-Critic | Off | Continuous | Twin critics, delayed updates |
| SAC | Actor-Critic | Off | Continuous | Maximum entropy framework |
| Dreamer | Model-Based | Off | Both | Learned world model + imagination |
| MuZero | Model-Based | Off | Discrete | Learned model + MCTS |
| CQL | Offline | Off | Both | Conservative Q-learning |
| IQL | Offline | Off | Both | Implicit Q-learning |
| DT | Offline | — | Both | RL as sequence modeling |
Choosing an Algorithm¶
For practical guidance:
- Continuous control (robotics, locomotion): Start with PPO (simple, robust) or SAC (better sample efficiency)
- Discrete actions (games, combinatorial): Start with DQN or PPO
- Sample efficiency matters: Use off-policy methods (SAC, TD3) or model-based (Dreamer, MBPO)
- Fixed dataset: Use offline RL (IQL, CQL)
- Sim-to-real: PPO with domain randomization is a common starting point
What's Next¶
- Intro to Policy Optimization — Understand the policy gradient theorem before diving into specific algorithms
- Individual algorithm pages for deep-dives