What Are World Models?¶
A world model is a learned internal representation of environment dynamics that allows an agent to predict future states, reason about consequences of actions, and plan without direct interaction.
Formal Definition¶
A world model can be described as a learned function (or set of functions):
More generally, in the latent space formulation:
- Encoder: \(z_t = \text{enc}_\theta(o_t)\) — maps observations to latent states
- Dynamics: \(z_{t+1} = \text{dyn}_\theta(z_t, a_t)\) — predicts next latent state
- Decoder: \(\hat{o}_t = \text{dec}_\theta(z_t)\) — reconstructs observations (optional)
- Reward predictor: \(\hat{r}_t = \text{rew}_\theta(z_t, a_t)\) — predicts reward
The Cognitive Science Perspective¶
The concept of "mental models" has deep roots in cognitive science:
- Kenneth Craik (1943): Proposed that organisms carry "small-scale models" of the external world, used for prediction and planning
- Predictive processing: The brain is fundamentally a prediction machine, constantly generating and updating predictions about sensory input
- Mental simulation: Humans can "imagine" the consequences of actions before taking them
World models in AI formalize this intuition: equip artificial agents with the ability to simulate the consequences of their actions internally.
Types of World Models¶
By Prediction Space¶
| Type | Predicts | Examples |
|---|---|---|
| Observation-space | Raw pixels/observations \(\hat{o}_{t+1}\) | SVG, SV2P, FitVid |
| Latent-space | Compact latent state \(z_{t+1}\) | Dreamer, RSSM, JEPA |
| Reward/value only | Reward \(\hat{r}_t\) and/or value \(\hat{v}_t\) | MuZero |
By Architecture¶
| Architecture | Description | Examples |
|---|---|---|
| RNN-based | Recurrent dynamics in latent space | World Models (Ha & Schmidhuber), RSSM |
| Transformer-based | Sequence model over state-action tokens | IRIS, TransDreamer, Genie |
| Diffusion-based | Denoising diffusion for future prediction | UniSim, DIAMOND |
| State-space models | Structured state-space layers | S4WM |
By Scope¶
| Scope | Description | Examples |
|---|---|---|
| Task-specific | Trained on one environment | Dreamer on DMControl |
| Domain-specific | Trained on one domain (e.g., driving) | MILE, GAIA-1 |
| Foundation | Trained on diverse data, generalizes broadly | Genie, UniSim |
Core Challenges¶
1. Compounding Error¶
Small prediction errors accumulate over long rollouts:
This limits the useful prediction horizon and is addressed via:
- Short rollouts (MBPO)
- Latent-space prediction (reduces dimensionality of error)
- Ensemble disagreement (quantify and manage uncertainty)
2. Partial Observability¶
Real environments are partially observable — the agent doesn't see the full state. World models must infer latent state from observation history:
This is typically handled with recurrent architectures (GRU, LSTM, RSSM).
3. Multi-Modal Futures¶
The future is often stochastic — multiple outcomes are possible from the same state and action. Deterministic models collapse to the mean prediction. Stochastic models must capture the distribution of possible futures.
Approaches:
- VAE-based: latent stochastic variables \(z \sim q(z|o)\)
- Discrete tokens: categorical distributions (DreamerV2)
- Diffusion models: generate diverse samples
- Mixture models: explicitly model multiple modes
4. Long-Horizon Reasoning¶
Many tasks require reasoning over long time horizons (hundreds to thousands of steps). Key challenges:
- Compounding error over long rollouts
- Memory requirements
- Capturing long-range dependencies
A Brief History¶
| Year | Milestone |
|---|---|
| 1991 | Dyna (Sutton) — model-based RL framework |
| 2015 | PILCO — Gaussian processes for model-based RL |
| 2018 | World Models (Ha & Schmidhuber) — VAE+RNN, learning in dreams |
| 2019 | PlaNet / RSSM (Hafner et al.) — recurrent state-space model |
| 2020 | Dreamer (Hafner et al.) — actor-critic in latent imagination |
| 2020 | MuZero (Schrittwieser et al.) — learned model + MCTS |
| 2021 | DreamerV2 — discrete latent representations |
| 2023 | DreamerV3 — universal world model across domains |
| 2023-24 | Foundation world models — Genie, UniSim, DIAMOND, Cosmos |
What's Next¶
- Representation Learning — How to learn good latent spaces
- Video Prediction — Predicting visual futures
- Planning — Using world models for decision-making
- Foundation World Models — The frontier