Foundation World Models¶

Foundation world models are large-scale models trained on diverse data that learn general-purpose representations of physical dynamics and world structure. They represent the convergence of world models, video generation, and foundation model scaling.

From Task-Specific to Foundation Models¶

Traditional world models are trained on data from a single environment. Foundation world models aim for broad generalization:

Aspect	Task-Specific	Foundation
Training data	One environment	Diverse videos, games, simulations
Generalization	Single task	Novel environments, tasks, embodiments
Scale	Small (millions params)	Large (billions params)
Interaction	Action-conditioned	Action, text, layout conditioned

Key Models¶

Genie (Google DeepMind, 2024)¶

Genie (Generative Interactive Environment) learns a controllable world model from unlabeled internet videos.

Architecture:

Video Tokenizer: Spatiotemporal VQ-VAE encodes video into discrete tokens
Latent Action Model: Infers latent actions between consecutive frames (no action labels needed!)
Dynamics Model: Transformer predicts next frame tokens given current tokens and latent action

Key insight: By learning latent actions from video alone (no action labels), Genie can create playable environments from a single image prompt.

Capabilities:

Generate interactive 2D environments from a single image
Consistent world dynamics (gravity, collisions, etc.)
Trained on 200K+ hours of 2D platformer gameplay video

UniSim (UC Berkeley, 2023)¶

UniSim is a universal simulator trained on diverse real-world data sources.

Training data: Combines internet video, robotics data, synthetic 3D data, and more.

Conditioning: Supports multiple input types:

Text descriptions ("a ball rolls down the hill")
Actions (robot joint commands)
Camera poses (3D navigation)

Architecture: Diffusion-based video generation model.

Applications:

Simulate outcomes of robot actions
Generate training data for RL policies
Answer "what if" questions about the physical world

DIAMOND (2024)¶

DIAMOND (Diffusion for World Modeling) uses a diffusion model as the core dynamics model for RL:

World model: a diffusion model that generates the next observation given current observation and action
RL agent trained entirely inside the diffusion world model
Achieves strong performance on Atari — competitive with DreamerV3

Significance: Shows that diffusion models can serve as effective world models for RL, not just for visual quality.

NVIDIA Cosmos (2025)¶

Cosmos is NVIDIA's foundation world model platform:

Trained on massive amounts of video data
Designed for physical AI applications (robotics, autonomous driving, industrial simulation)
Provides both pre-trained models and fine-tuning tools
Multiple model sizes for different use cases

Other Notable Models¶

GAIA-1 (Wayve, 2023): World model for autonomous driving, generates driving scenarios conditioned on text, action, and map inputs
Pandora (2024): GPT-style world model enabling interactive generation across multiple domains
GameNGen (Google, 2024): Generates real-time playable Doom using a diffusion model
Oasis (Decart, 2024): Real-time playable Minecraft generation

Technical Foundations¶

Scaling Laws for World Models¶

Like language models, world models exhibit scaling behavior:

More training data → better generalization
Larger models → more accurate predictions
Longer context → better temporal coherence

The optimal scaling strategy trades off model size, data size, and compute budget.

Tokenization for World Models¶

Converting continuous observations to discrete tokens enables Transformer-based architectures:

Method	Approach	Used by
VQ-VAE	Codebook quantization	Genie, IRIS
FSQ	Finite Scalar Quantization	Cosmos
Patch embedding	ViT-style patches	DiT, Sora
Spatial-temporal	3D tokenization	VideoGPT

Conditioning Mechanisms¶

Foundation world models support rich conditioning:

Action conditioning: Robot commands, game controls, latent actions
Text conditioning: Natural language descriptions of desired outcomes
Image conditioning: Generate dynamics from a single starting frame
Layout conditioning: Spatial arrangement of objects
Camera conditioning: Viewpoint and camera trajectory

Challenges and Open Problems¶

1. Physical Consistency¶

Current models often violate physical laws:

Objects disappear or teleport
Gravity is inconsistent
Collisions are not conserved
Long-horizon dynamics diverge

2. Controllability¶

Precise control over generated dynamics remains difficult:

How to specify actions in a general-purpose model?
Latent action discovery is promising but not yet reliable
Text conditioning is imprecise for physical reasoning

3. Evaluation¶

No consensus on how to evaluate foundation world models:

Visual quality metrics (FVD, FID) don't capture physical accuracy
Task performance (RL return) is environment-specific
Human evaluation is expensive and subjective

4. Computational Cost¶

Foundation world models are expensive:

Training: thousands of GPU-hours
Inference: real-time generation is challenging
Memory: long video contexts require significant memory

The Big Picture¶

Foundation world models represent a potential path toward general physical intelligence:

graph TD
    V[Internet Video Data] --> FWM[Foundation World Model]
    R[Robot Data] --> FWM
    S[Simulation Data] --> FWM
    FWM --> RL[RL Policy Training]
    FWM --> SIM[Virtual Simulation]
    FWM --> PLAN[Planning & Prediction]
    FWM --> DATA[Synthetic Data Generation]

The vision: train a single model that understands how the physical world works, then use it for any downstream task — robotics, autonomous driving, game design, scientific simulation.

Key References¶

Bruce, J., et al. (2024). "Genie: Generative Interactive Environments." ICML.
Yang, M., et al. (2023). "Learning Interactive Real-World Simulators." arXiv:2310.06114.
Alonso, E., et al. (2024). "Diffusion for World Modeling: Visual Details Matter in Atari." NeurIPS.
Hu, A., et al. (2023). "GAIA-1: A Generative World Model for Autonomous Driving." arXiv:2309.17080.
Agarwal, A., et al. (2025). "Cosmos World Foundation Model Platform for Physical AI." arXiv.

Work in Progress

Foundation world models is a rapidly evolving field. This section will be updated as new models and results emerge.