Model-Based Reinforcement Learning¶
Model-based RL learns a model of the environment's dynamics and uses it for planning, generating synthetic data, or both. By leveraging a learned model, these methods can be dramatically more sample-efficient than model-free approaches — often by 10-100x.
Why Model-Based?¶
| Aspect | Model-Free | Model-Based |
|---|---|---|
| Sample efficiency | Low (10⁶+ steps) | High (10³-10⁵ steps) |
| Asymptotic performance | Often better | Limited by model accuracy |
| Computation per step | Low | Higher (planning) |
| Generalization | Limited | Can generalize via model |
The core trade-off: model-based methods are more sample-efficient but introduce model bias — errors in the learned model can compound during planning.
The Dyna Architecture¶
Dyna (Sutton, 1991) is the foundational framework for model-based RL. The key idea: supplement real experience with simulated experience from a learned model.
graph LR
E[Real Environment] -->|real experience| A[Agent / Policy]
A -->|actions| E
E -->|transitions| M[Learned Model]
M -->|simulated experience| A
The Dyna loop:
- Act in the real environment, collect \((s, a, r, s')\)
- Update the model \(\hat{P}(s'|s,a)\), \(\hat{R}(s,a)\) with real data
- Generate simulated transitions from the model
- Update the policy/value function with both real and simulated data
Dyna-Q¶
The simplest Dyna variant applies Q-learning to both real and simulated transitions:
Pseudocode: Dyna-Q
Initialize Q(s,a), Model(s,a)
for each step do
s ← current state
a ← ε-greedy from Q(s,·)
Execute a, observe r, s'
Q(s,a) ← Q(s,a) + α[r + γ max_a' Q(s',a') - Q(s,a)]
Model(s,a) ← (r, s') // update model
for k = 1, ..., K do // K planning steps
s̃, ã ← random previously visited (s,a)
r̃, s̃' ← Model(s̃, ã) // simulate
Q(s̃,ã) ← Q(s̃,ã) + α[r̃ + γ max_a' Q(s̃',a') - Q(s̃,ã)]
end for
end for
MBPO (Model-Based Policy Optimization)¶
MBPO (Janner et al., 2019) provides theoretical grounding for when and how to use a learned model with policy optimization.
Key Insight¶
MBPO shows that monotonic policy improvement is possible under a learned model if rollouts are kept short (to limit compounding model error):
where \(k\) is the rollout length, \(\epsilon_m\) is the model error, and \(\epsilon_\pi\) is the policy shift.
Algorithm¶
- Train an ensemble of dynamics models \(\{f_{\theta_i}\}_{i=1}^{N}\) on real data
- Use the models to generate short rollouts (branched from real states)
- Add synthetic data to a model replay buffer
- Train the policy with SAC using both real and model data
The model ensemble:
- Captures epistemic uncertainty — disagreement between ensemble members indicates uncertain predictions
- Reduces overfitting to model errors
- Typical ensemble size: 5--7 models
Adaptive Rollout Length¶
MBPO adapts the rollout horizon based on model accuracy:
- Accurate model → longer rollouts (more synthetic data)
- Inaccurate model → shorter rollouts (stay close to real data)
Dreamer (Dream to Control)¶
The Dreamer family (Hafner et al., 2020, 2021, 2023) learns a world model and trains the policy entirely in "imagination" — latent-space rollouts from the learned model.
World Model Architecture¶
Dreamer uses a Recurrent State-Space Model (RSSM):
- Representation model: \(q(z_t | z_{t-1}, a_{t-1}, o_t)\) — encodes observations into latent states
- Transition model: \(p(z_t | z_{t-1}, a_{t-1})\) — predicts next latent state (without observation)
- Observation model: \(p(o_t | z_t)\) — decodes latent states back to observations
- Reward model: \(p(r_t | z_t)\) — predicts rewards from latent states
Imagination-Based Training¶
Once the world model is learned:
- Imagine trajectories by rolling out the transition model in latent space
- Predict rewards and values along imagined trajectories
- Backpropagate through the entire imagined trajectory to update the actor
This is efficient because latent-space rollouts are much cheaper than real environment interactions.
Dreamer Versions¶
| Version | Key Improvement |
|---|---|
| Dreamer (v1) | RSSM + imagination-based actor-critic |
| DreamerV2 | Discrete latent representations (categorical) |
| DreamerV3 | Symlog predictions, scales across domains without hyperparameter tuning |
DreamerV3 is notable for training a single set of hyperparameters across very different domains: Atari, DMControl, Minecraft, and more.
MuZero¶
MuZero (Schrittwieser et al., 2020) combines learned models with Monte Carlo Tree Search (MCTS). Unlike Dreamer, MuZero learns a model that predicts only what's needed for planning: reward, value, and policy — not observations.
Three Learned Functions¶
- Representation: \(h(o_1, \ldots, o_t) \to s^0\) — maps observations to latent state
- Dynamics: \(g(s^k, a^{k+1}) \to s^{k+1}, r^{k+1}\) — predicts next latent state and reward
- Prediction: \(f(s^k) \to p^k, v^k\) — predicts policy and value from latent state
Planning with MCTS¶
At each real step, MuZero performs MCTS using the learned model:
- Start from current latent state \(s^0 = h(o_t)\)
- Simulate forward using dynamics \(g\) and select actions via PUCT (like AlphaZero)
- Use the resulting search policy for action selection
MuZero achieved superhuman performance in Atari, Go, chess, and shogi — all with a single algorithm.
When to Use Model-Based Methods¶
Use model-based when:
- Sample efficiency is critical (real robot, expensive simulation)
- The environment has learnable dynamics (smooth, deterministic-ish)
- You need planning or lookahead capabilities
Prefer model-free when:
- Unlimited environment interactions are available
- Dynamics are too complex to model accurately
- You need maximum asymptotic performance
Connection to World Models¶
Model-based RL is closely related to World Models (Part II). The distinction:
- Model-based RL: model is a tool for policy optimization
- World Models: model is a core cognitive capability (representation, prediction, imagination)
The Dreamer line of work sits at the intersection, and foundation world models further blur this boundary.
Key References¶
- Sutton, R.S. (1991). "Dyna, an Integrated Architecture for Learning, Planning, and Reacting." SIGART Bulletin.
- Janner, M., Fu, J., Zhang, M., Levine, S. (2019). "When to Trust Your Model: Model-Based Policy Optimization." NeurIPS.
- Hafner, D., Lillicrap, T., Ba, J., Norouzi, M. (2020). "Dream to Control: Learning Behaviors by Latent Imagination." ICLR.
- Hafner, D., Lillicrap, T., Norouzi, M., Ba, J. (2021). "Mastering Atari with Discrete World Models." ICLR.
- Hafner, D., et al. (2023). "Mastering Diverse Domains through World Models." arXiv:2301.04104.
- Schrittwieser, J., et al. (2020). "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model." Nature.