Taxonomy of RL Algorithms¶

The RL algorithm landscape is vast. This page provides a structured map to help you navigate it. Understanding the taxonomy helps you choose the right algorithm for a given problem and understand the trade-offs involved.

The Big Picture¶

graph TD
    RL[RL Algorithms]
    RL --> MF[Model-Free]
    RL --> MB[Model-Based]

    MF --> PG[Policy Optimization]
    MF --> VB[Value-Based]
    MF --> AC[Actor-Critic]

    PG --> REINFORCE
    PG --> TRPO
    PG --> PPO

    VB --> DQN
    VB --> Rainbow

    AC --> A2C/A3C
    AC --> DDPG
    AC --> TD3
    AC --> SAC

    MB --> Dyna
    MB --> MBPO
    MB --> Dreamer
    MB --> MuZero

    RL --> Offline[Offline RL]
    Offline --> CQL
    Offline --> IQL
    Offline --> DT[Decision Transformer]

Key Axes of Classification¶

1. Model-Free vs. Model-Based¶

	Model-Free	Model-Based
Learns dynamics?	No	Yes — learns \(\hat{P}(s'\\|s,a)\)
Sample efficiency	Low (needs many interactions)	High (can plan with learned model)
Asymptotic performance	Often better	Can be limited by model accuracy
Examples	DQN, PPO, SAC	Dyna, MBPO, Dreamer, MuZero

Model-free methods learn a policy or value function directly from experience without explicitly modeling environment dynamics.

Model-based methods learn (or are given) a model of the environment and use it for planning or generating synthetic experience.

2. On-Policy vs. Off-Policy¶

	On-Policy	Off-Policy
Data source	Current policy \(\pi\) only	Any policy (replay buffer)
Sample efficiency	Low (discard data after update)	High (reuse old data)
Stability	Generally more stable	Can be unstable (deadly triad)
Examples	REINFORCE, PPO, A2C	DQN, DDPG, TD3, SAC

On-policy algorithms only use data collected by the current policy. After each policy update, old data is discarded.

Off-policy algorithms can learn from data collected by any policy, typically stored in a replay buffer. This makes them much more sample-efficient.

3. Value-Based vs. Policy-Based vs. Actor-Critic¶

Value-based methods learn \(Q^*(s,a)\) and derive the policy implicitly:

\[ \pi(s) = \arg\max_a Q^*(s,a) \]

Works well for discrete action spaces
Cannot handle continuous actions directly (argmax is intractable)
Examples: DQN, Rainbow

Policy-based methods directly parameterize and optimize the policy \(\pi_\theta(a|s)\):

\[ \theta^* = \arg\max_\theta \mathbb{E}_{\pi_\theta} \left[ \sum_t \gamma^t r_t \right] \]

Works for both discrete and continuous actions
Can represent stochastic policies
Higher variance gradients
Examples: REINFORCE, TRPO, PPO

Actor-Critic methods combine both:

Actor: policy \(\pi_\theta(a|s)\) (decides actions)
Critic: value function \(V_\phi(s)\) or \(Q_\phi(s,a)\) (evaluates actions)
The critic reduces the variance of policy gradient estimates
Examples: A2C, DDPG, TD3, SAC

4. Stochastic vs. Deterministic Policy¶

Stochastic policies \(\pi_\theta(a|s)\) output a probability distribution over actions:

Natural exploration through sampling
Used in: PPO, SAC, A2C

Deterministic policies \(\mu_\theta(s)\) output a single action:

Need explicit exploration noise (e.g., Gaussian, OU process)
Can be more efficient when applicable
Used in: DDPG, TD3

5. Online vs. Offline¶

Online RL: The agent interacts with the environment during training.

Offline RL (Batch RL): The agent learns entirely from a fixed dataset with no further interaction. This is critical for:

Safety-critical domains (healthcare, autonomous driving)
When real-world interaction is expensive
Leveraging existing large-scale datasets

Algorithm Summary Table¶

Algorithm	Type	On/Off-Policy	Action Space	Key Idea
REINFORCE	Policy Gradient	On	Discrete/Continuous	Monte Carlo policy gradient
DQN	Value-Based	Off	Discrete	Q-learning + neural nets + replay
A2C/A3C	Actor-Critic	On	Both	Parallel actors, advantage estimation
TRPO	Policy Gradient	On	Both	Trust region constraint
PPO	Policy Gradient	On	Both	Clipped surrogate objective
DDPG	Actor-Critic	Off	Continuous	Deterministic policy gradient + replay
TD3	Actor-Critic	Off	Continuous	Twin critics, delayed updates
SAC	Actor-Critic	Off	Continuous	Maximum entropy framework
Dreamer	Model-Based	Off	Both	Learned world model + imagination
MuZero	Model-Based	Off	Discrete	Learned model + MCTS
CQL	Offline	Off	Both	Conservative Q-learning
IQL	Offline	Off	Both	Implicit Q-learning
DT	Offline	—	Both	RL as sequence modeling

Choosing an Algorithm¶

For practical guidance:

Continuous control (robotics, locomotion): Start with PPO (simple, robust) or SAC (better sample efficiency)
Discrete actions (games, combinatorial): Start with DQN or PPO
Sample efficiency matters: Use off-policy methods (SAC, TD3) or model-based (Dreamer, MBPO)
Fixed dataset: Use offline RL (IQL, CQL)
Sim-to-real: PPO with domain randomization is a common starting point

What's Next¶

Intro to Policy Optimization — Understand the policy gradient theorem before diving into specific algorithms
Individual algorithm pages for deep-dives