Planning with World Models¶

Given a learned world model, how do we use it for decision-making? This page covers planning algorithms that leverage world models — from simple shooting methods to sophisticated tree search and imagination-based policy learning.

The Planning Problem¶

Given a world model \(f_\theta\) (transition dynamics) and a reward model \(r_\theta\), find an action sequence that maximizes expected cumulative reward:

\[ a_{t:t+H-1}^* = \arg\max_{a_{t:t+H-1}} \sum_{k=0}^{H-1} \gamma^k \hat{r}_{t+k} \]

where \(\hat{s}_{t+k+1} = f_\theta(\hat{s}_{t+k}, a_{t+k})\) and \(\hat{r}_{t+k} = r_\theta(\hat{s}_{t+k}, a_{t+k})\).

Random Shooting¶

The simplest approach: sample many random action sequences and pick the best one.

Sample \(N\) action sequences: \(\{a_{t:t+H-1}^{(i)}\}_{i=1}^{N}\)
Roll out each through the model to get predicted returns
Execute the first action of the best sequence

Pros: Simple, parallelizable Cons: Inefficient in high-dimensional action spaces, scales poorly with horizon

Cross-Entropy Method (CEM)¶

CEM iteratively refines the action distribution:

Pseudocode: CEM Planning

Initialize action distribution: μ, σ (e.g., uniform)
for iteration = 1, ..., I do
    Sample N action sequences from N(μ, σ²)
    Roll out each through the model, compute returns
    Select top-K sequences (elites)
    Update μ, σ to fit the elite set
end for
Return μ (or the best elite sequence)

CEM is widely used in model-based RL (e.g., PlaNet, PETS) due to its simplicity and effectiveness.

Key parameters:

Population size \(N\) (typically 500-1000)
Elite fraction (typically top 10%)
Number of iterations \(I\) (typically 5-10)
Planning horizon \(H\) (typically 5-30 steps)

Model Predictive Control (MPC)¶

MPC is a receding horizon approach:

At each real timestep, plan \(H\) steps ahead using the model
Execute only the first action of the planned sequence
Re-plan at the next timestep with updated state

graph LR
    S[Current State] --> P[Plan H steps]
    P --> E[Execute first action]
    E --> O[Observe new state]
    O --> S

MPC naturally handles model errors by constantly re-planning from the true state.

Used by: PETS (Chua et al., 2018), PlaNet (Hafner et al., 2019)

PETS (Probabilistic Ensemble Trajectory Sampling)¶

PETS combines ensemble models with CEM planning:

Train an ensemble of \(B\) dynamics models
For each CEM sample, use trajectory sampling — randomly switch between ensemble members at each timestep
This propagates both aleatoric (environment) and epistemic (model) uncertainty through the plan

Monte Carlo Tree Search (MCTS)¶

MCTS builds a search tree by balancing exploration and exploitation:

Selection: Traverse tree using UCT (Upper Confidence bound for Trees)
Expansion: Add a new node
Simulation: Roll out a policy from the new node
Backpropagation: Update value estimates up the tree

MCTS with Learned Models¶

AlphaZero (Silver et al., 2018): MCTS with a neural network providing:

Prior policy \(p(a|s)\) for guiding search
Value estimate \(v(s)\) for evaluating leaf nodes

MuZero (Schrittwieser et al., 2020): MCTS with a fully learned model:

No access to true environment rules
Learned dynamics model for tree expansion
Achieved superhuman performance in Go, chess, shogi, and Atari

MCTS vs. Shooting Methods¶

Aspect	Shooting / CEM	MCTS
Action space	Continuous (natural)	Discrete (natural)
Planning depth	Moderate (5-30)	Deep (hundreds)
Computation	Moderate	High
Best for	Continuous control	Games, structured problems

Imagination-Based Policy Learning¶

Instead of planning online (at every timestep), learn a policy by training on imagined rollouts from the model. This amortizes the planning cost into the policy network.

Dreamer's Approach¶

Dreamer (Hafner et al., 2020) trains actor-critic networks purely in imagination:

World model generates imagined trajectories in latent space: \(z_{t+1} = \text{dyn}_\theta(z_t, a_t)\)
Critic estimates values along imagined trajectories: \(V_\psi(z_t) \approx \mathbb{E}\left[\sum_{k=0}^{H} \gamma^k r_{t+k}\right]\)
Actor is updated to maximize imagined returns: \(\max_\phi \mathbb{E}\left[\sum_{k=0}^{H} \gamma^k \left(r_{t+k} + \eta \mathcal{H}(\pi_\phi(\cdot|z_{t+k}))\right)\right]\)
Backpropagate through the entire imagined trajectory (straight-through gradients for discrete actions in DreamerV2/V3)

Key advantage: After training, the policy runs in real-time — no expensive search at decision time.

SVG (Stochastic Value Gradient)¶

SVG (Heess et al., 2015): Differentiates through the learned dynamics model to compute policy gradients:

\[ \nabla_\phi J \approx \nabla_\phi \sum_{t=0}^{H} r(s_t, a_t), \quad \text{where } s_{t+1} = f_\theta(s_t, a_t), \; a_t = \pi_\phi(s_t) \]

This uses the full gradient through the model (unlike REINFORCE-style gradients), reducing variance.

Video Prediction as Planning¶

Recent work uses video prediction models directly for planning:

UniPi (Du et al., 2023):

Use a text-conditioned video diffusion model to generate future video
Extract actions from the generated video using an inverse dynamics model
The video model serves as both the world model and the planner

This approach leverages the rich world knowledge in large video models.

Comparison of Planning Approaches¶

Method	When to Use	Strengths	Weaknesses
Random Shooting	Prototyping	Simplest	Inefficient
CEM	Continuous MPC	Good balance of quality/speed	Limited horizon
MCTS	Discrete, deep reasoning	Optimal with budget	Expensive, discrete
Imagination (Dreamer)	Real-time control	Fast at test time, continuous	Needs good model
SVG	Differentiable models	Low variance gradients	Exploding/vanishing gradients
Video-based (UniPi)	Rich visual tasks	Leverages pretrained models	Slow, coarse

Key References¶

Chua, K., et al. (2018). "Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models." NeurIPS.
Hafner, D., et al. (2019). "Learning Latent Dynamics for Planning from Pixels." ICML.
Silver, D., et al. (2018). "A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go Through Self-Play." Science.
Schrittwieser, J., et al. (2020). "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model." Nature.
Hafner, D., et al. (2020). "Dream to Control: Learning Behaviors by Latent Imagination." ICLR.
Du, Y., et al. (2023). "Learning Universal Policies via Text-Guided Video Generation." NeurIPS.