Representation Learning for World Models¶

The quality of a world model depends critically on its latent representation — the internal space in which it models dynamics. This page covers the key approaches to learning representations suitable for world modeling.

Why Representation Learning?¶

Operating directly in observation space (e.g., raw pixels) is problematic:

High dimensionality: images have millions of pixels, most of which are redundant
Irrelevant information: background details, textures, lighting — not relevant to dynamics
Prediction difficulty: predicting exact pixel values is extremely hard

A good representation should be:

Compact: low-dimensional, capturing only task-relevant information
Predictive: support accurate dynamics prediction
Disentangled: separate independent factors of variation
Controllable: capture aspects that the agent can influence

Reconstruction-Based Methods¶

Variational Autoencoder (VAE)¶

The most common approach: learn a latent space by jointly training an encoder and decoder.

\[ \mathcal{L}_{\text{VAE}} = \mathbb{E}_{q(z|o)} [\log p(o|z)] - D_{\text{KL}}(q(z|o) \| p(z)) \]

Encoder \(q_\phi(z|o)\): maps observations to a distribution over latent states
Decoder \(p_\theta(o|z)\): reconstructs observations from latent states
Prior \(p(z)\): typically \(\mathcal{N}(0, I)\)

Used by: World Models (Ha & Schmidhuber, 2018), PlaNet, Dreamer (v1)

Pros: Well-understood, produces smooth latent spaces, generates observations

Cons: May learn features irrelevant to dynamics, blurry reconstructions

VQ-VAE (Vector-Quantized VAE)¶

Replaces the continuous latent space with a discrete codebook:

\[ z_q = \text{codebook}[\arg\min_k \| z_e(o) - e_k \|] \]

where \(\{e_k\}\) is a learned set of embedding vectors.

Used by: DreamerV2 (categorical discrete latents), IRIS, Genie

Advantages for world models: Discrete tokens enable Transformer-based dynamics models and avoid posterior collapse.

Self-Supervised Methods (No Reconstruction)¶

Contrastive Learning¶

Learn representations by pulling together positive pairs (augmented views of the same observation) and pushing apart negative pairs.

SimCLR / MoCo style:

\[ \mathcal{L}_{\text{contrastive}} = -\log \frac{\exp(\text{sim}(z_i, z_j^+) / \tau)}{\sum_{k} \exp(\text{sim}(z_i, z_k) / \tau)} \]

For RL: CURL (Laskin et al., 2020) applies contrastive learning to RL observations, showing significant improvement on pixel-based control tasks.

Joint Embedding Predictive Architecture (JEPA)¶

JEPA (LeCun, 2022; Assran et al., 2023) learns by predicting in embedding space rather than pixel space:

\[ \mathcal{L}_{\text{JEPA}} = \| \text{predictor}(z_x, \text{mask}) - \text{sg}[\bar{z}_y] \|^2 \]

where \(z_x\) is the context encoding, \(\bar{z}_y\) is the target encoding (from an EMA encoder), and \(\text{sg}\) denotes stop-gradient.

Key advantages for world models:

Predicts in representation space (not pixel space) — focuses on semantic content
Avoids generating pixels — more computationally efficient
Naturally handles multi-modal futures through the latent space

BYOL and Self-Predictive Methods¶

SPR (Schwarzer et al., 2021): Self-Predictive Representations for RL. Predicts future latent states:

\[ \mathcal{L}_{\text{SPR}} = \sum_{k=1}^{K} \| \text{proj}(\hat{z}_{t+k}) - \text{sg}[\text{proj}(\bar{z}_{t+k})] \|^2 \]

This is a temporal version of BYOL, where the prediction model captures dynamics in latent space.

Structured Representations¶

Object-Centric Representations¶

Decompose scenes into discrete objects with individual properties:

Slot Attention (Locatello et al., 2020): Learns to decompose scenes into \(K\) slots via iterative attention
SAVi (Kipf et al., 2022): Extends slot attention to video with temporal consistency
SLATE (Singh et al., 2022): Combines slot attention with discrete autoencoding

Why for world models: Physical interactions are fundamentally object-centric — objects collide, stack, fall. Object-centric representations can improve generalization and compositionality.

Graph-Based Representations¶

Represent the world as a graph where nodes are entities and edges are relations:

Graph Neural Network Simulators (Sanchez-Gonzalez et al., 2020): Learn to simulate physics by message-passing between particles/objects
C-SWM (Kipf et al., 2020): Contrastively-learned structured world models

The RSSM (Recurrent State-Space Model)¶

The RSSM (Hafner et al., 2019) is the most successful latent dynamics model, used across the Dreamer family. It combines deterministic and stochastic components:

Deterministic path \(h_t = f(h_{t-1}, z_{t-1}, a_{t-1})\): captures long-term dependencies via GRU
Stochastic state \(z_t \sim q(z_t | h_t, o_t)\): captures uncertainty and multi-modal futures
Prior \(z_t \sim p(z_t | h_t)\): predicted without observation (for imagination)

The combined state \((h_t, z_t)\) provides a rich representation for both prediction and planning.

Training Objective¶

\[ \mathcal{L} = \underbrace{\mathbb{E}_q [\log p(o_t | h_t, z_t)]}_{\text{reconstruction}} + \underbrace{\mathbb{E}_q [\log p(r_t | h_t, z_t)]}_{\text{reward}} - \underbrace{\beta \, D_{\text{KL}}[q(z_t|h_t, o_t) \| p(z_t|h_t)]}_{\text{KL regularization}} \]

Comparison of Approaches¶

Approach	Reconstruction	Dynamics-aware	Scalable	Used by
VAE	Yes	No (add separately)	Moderate	Dreamer v1, PlaNet
VQ-VAE	Yes	No	Good	IRIS, Genie
Contrastive	No	Optional	Good	CURL, ATC
JEPA	No	Yes	Excellent	V-JEPA
RSSM	Yes	Built-in	Moderate	Dreamer v1-v3
Object-centric	Yes	Optional	Limited	C-SWM, SLATE

Key References¶

Ha, D. & Schmidhuber, J. (2018). "World Models." NeurIPS.
Hafner, D., et al. (2019). "Learning Latent Dynamics for Planning from Pixels." ICML.
Laskin, M., Srinivas, A., Abbeel, P. (2020). "CURL: Contrastive Unsupervised Representations for Reinforcement Learning." ICML.
Schwarzer, M., et al. (2021). "Data-Efficient Reinforcement Learning with Self-Predictive Representations." ICLR.
LeCun, Y. (2022). "A Path Towards Autonomous Machine Intelligence." Technical report.
Locatello, F., et al. (2020). "Object-Centric Learning with Slot Attention." NeurIPS.