Planning with World Models¶
Given a learned world model, how do we use it for decision-making? This page covers planning algorithms that leverage world models — from simple shooting methods to sophisticated tree search and imagination-based policy learning.
The Planning Problem¶
Given a world model \(f_\theta\) (transition dynamics) and a reward model \(r_\theta\), find an action sequence that maximizes expected cumulative reward:
where \(\hat{s}_{t+k+1} = f_\theta(\hat{s}_{t+k}, a_{t+k})\) and \(\hat{r}_{t+k} = r_\theta(\hat{s}_{t+k}, a_{t+k})\).
Random Shooting¶
The simplest approach: sample many random action sequences and pick the best one.
- Sample \(N\) action sequences: \(\{a_{t:t+H-1}^{(i)}\}_{i=1}^{N}\)
- Roll out each through the model to get predicted returns
- Execute the first action of the best sequence
Pros: Simple, parallelizable Cons: Inefficient in high-dimensional action spaces, scales poorly with horizon
Cross-Entropy Method (CEM)¶
CEM iteratively refines the action distribution:
Pseudocode: CEM Planning
CEM is widely used in model-based RL (e.g., PlaNet, PETS) due to its simplicity and effectiveness.
Key parameters:
- Population size \(N\) (typically 500-1000)
- Elite fraction (typically top 10%)
- Number of iterations \(I\) (typically 5-10)
- Planning horizon \(H\) (typically 5-30 steps)
Model Predictive Control (MPC)¶
MPC is a receding horizon approach:
- At each real timestep, plan \(H\) steps ahead using the model
- Execute only the first action of the planned sequence
- Re-plan at the next timestep with updated state
graph LR
S[Current State] --> P[Plan H steps]
P --> E[Execute first action]
E --> O[Observe new state]
O --> S
MPC naturally handles model errors by constantly re-planning from the true state.
Used by: PETS (Chua et al., 2018), PlaNet (Hafner et al., 2019)
PETS (Probabilistic Ensemble Trajectory Sampling)¶
PETS combines ensemble models with CEM planning:
- Train an ensemble of \(B\) dynamics models
- For each CEM sample, use trajectory sampling — randomly switch between ensemble members at each timestep
- This propagates both aleatoric (environment) and epistemic (model) uncertainty through the plan
Monte Carlo Tree Search (MCTS)¶
MCTS builds a search tree by balancing exploration and exploitation:
- Selection: Traverse tree using UCT (Upper Confidence bound for Trees)
- Expansion: Add a new node
- Simulation: Roll out a policy from the new node
- Backpropagation: Update value estimates up the tree
MCTS with Learned Models¶
AlphaZero (Silver et al., 2018): MCTS with a neural network providing:
- Prior policy \(p(a|s)\) for guiding search
- Value estimate \(v(s)\) for evaluating leaf nodes
MuZero (Schrittwieser et al., 2020): MCTS with a fully learned model:
- No access to true environment rules
- Learned dynamics model for tree expansion
- Achieved superhuman performance in Go, chess, shogi, and Atari
MCTS vs. Shooting Methods¶
| Aspect | Shooting / CEM | MCTS |
|---|---|---|
| Action space | Continuous (natural) | Discrete (natural) |
| Planning depth | Moderate (5-30) | Deep (hundreds) |
| Computation | Moderate | High |
| Best for | Continuous control | Games, structured problems |
Imagination-Based Policy Learning¶
Instead of planning online (at every timestep), learn a policy by training on imagined rollouts from the model. This amortizes the planning cost into the policy network.
Dreamer's Approach¶
Dreamer (Hafner et al., 2020) trains actor-critic networks purely in imagination:
-
World model generates imagined trajectories in latent space: \(z_{t+1} = \text{dyn}_\theta(z_t, a_t)\)
-
Critic estimates values along imagined trajectories: \(V_\psi(z_t) \approx \mathbb{E}\left[\sum_{k=0}^{H} \gamma^k r_{t+k}\right]\)
-
Actor is updated to maximize imagined returns: \(\max_\phi \mathbb{E}\left[\sum_{k=0}^{H} \gamma^k \left(r_{t+k} + \eta \mathcal{H}(\pi_\phi(\cdot|z_{t+k}))\right)\right]\)
-
Backpropagate through the entire imagined trajectory (straight-through gradients for discrete actions in DreamerV2/V3)
Key advantage: After training, the policy runs in real-time — no expensive search at decision time.
SVG (Stochastic Value Gradient)¶
SVG (Heess et al., 2015): Differentiates through the learned dynamics model to compute policy gradients:
This uses the full gradient through the model (unlike REINFORCE-style gradients), reducing variance.
Video Prediction as Planning¶
Recent work uses video prediction models directly for planning:
UniPi (Du et al., 2023):
- Use a text-conditioned video diffusion model to generate future video
- Extract actions from the generated video using an inverse dynamics model
- The video model serves as both the world model and the planner
This approach leverages the rich world knowledge in large video models.
Comparison of Planning Approaches¶
| Method | When to Use | Strengths | Weaknesses |
|---|---|---|---|
| Random Shooting | Prototyping | Simplest | Inefficient |
| CEM | Continuous MPC | Good balance of quality/speed | Limited horizon |
| MCTS | Discrete, deep reasoning | Optimal with budget | Expensive, discrete |
| Imagination (Dreamer) | Real-time control | Fast at test time, continuous | Needs good model |
| SVG | Differentiable models | Low variance gradients | Exploding/vanishing gradients |
| Video-based (UniPi) | Rich visual tasks | Leverages pretrained models | Slow, coarse |
Key References¶
- Chua, K., et al. (2018). "Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models." NeurIPS.
- Hafner, D., et al. (2019). "Learning Latent Dynamics for Planning from Pixels." ICML.
- Silver, D., et al. (2018). "A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go Through Self-Play." Science.
- Schrittwieser, J., et al. (2020). "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model." Nature.
- Hafner, D., et al. (2020). "Dream to Control: Learning Behaviors by Latent Imagination." ICLR.
- Du, Y., et al. (2023). "Learning Universal Policies via Text-Guided Video Generation." NeurIPS.