Video Prediction¶
Video prediction — forecasting future visual frames from past observations — is a core capability for world models. Accurate video prediction enables planning, anomaly detection, and understanding of physical dynamics.
Problem Formulation¶
Given a sequence of past frames (and optionally actions), predict future frames:
Action-conditioned: predicting outcomes of specific actions (for control) Action-free: predicting general future evolution (for understanding)
Deterministic Models¶
Convolutional Sequence Models¶
Early approaches used convolutional encoder-decoder networks with recurrent dynamics:
- ConvLSTM (Shi et al., 2015): Convolutional LSTM for spatiotemporal prediction
- PredNet (Lotter et al., 2017): Predictive coding architecture inspired by neuroscience
- SVG (Denton & Biber, 2018): Stochastic video generation with learned prior
Limitations¶
- Blurry predictions: MSE loss + deterministic model → average over possible futures
- Short horizon: Quality degrades rapidly beyond a few frames
- Mode collapse: Cannot represent multiple possible futures
Stochastic Models¶
VAE-Based¶
SV2P (Babaeizadeh et al., 2018): Stochastic Variational Video Prediction — adds latent random variables per timestep to capture uncertainty:
FitVid (Babaeizadeh et al., 2021): Scales up stochastic video prediction with architectural improvements for longer, higher-quality predictions.
GAN-Based¶
DVD-GAN (Clark et al., 2019): Uses adversarial training for sharper video generation. Discriminator operates on spatial and temporal dimensions separately.
Pros: Sharper predictions than VAE models Cons: Training instability, mode dropping
Transformer-Based Models¶
Transformers have become dominant in video prediction due to their ability to model long-range dependencies.
Video GPT and Tokenization¶
VideoGPT (Yan et al., 2021): Tokenizes video frames using VQ-VAE, then models token sequences autoregressively with a GPT-like Transformer.
Pipeline:
- Encode frames to discrete tokens via VQ-VAE
- Model the token sequence with a causal Transformer
- Autoregressively sample future tokens
- Decode tokens back to frames
IRIS¶
IRIS (Micheli et al., 2023): An RL agent that uses a Transformer-based world model with discrete tokens.
- Observations → discrete tokens via VQ-VAE
- Dynamics modeled as autoregressive next-token prediction over (observation tokens, action tokens, reward tokens)
- Imagination-based policy training
This bridges video prediction and RL directly.
Diffusion-Based Models¶
Diffusion models have recently achieved state-of-the-art in video prediction, producing high-quality, diverse samples.
How Diffusion Applies to Video¶
Standard diffusion: learn to denoise \(x_t = \sqrt{\alpha_t} x_0 + \sqrt{1-\alpha_t} \epsilon\) back to \(x_0\).
For video prediction, \(x_0\) is a future frame (or sequence of frames), conditioned on past frames:
Key Models¶
- MCVD (Voleti et al., 2022): Masked Conditional Video Diffusion
- RVD (Yang et al., 2023): Scalable video diffusion with recurrent conditioning
- Sora: Video generation model from OpenAI based on diffusion transformers (DiT)
Advantages of Diffusion for Video¶
- High quality: Sharp, detailed predictions
- Diversity: Naturally generates diverse futures
- Controllability: Can condition on text, actions, layout, etc.
- Scalability: DiT architecture scales well with compute
Action-Conditioned Video Prediction¶
For control and robotics, we need action-conditioned video prediction:
Key works:
- Action-Conditioned Video Prediction (Oh et al., 2015): Early work on predicting Atari frames conditioned on actions
- UniPi (Du et al., 2023): Uses video diffusion as a universal policy interface — plan by generating future video, then extract actions
- SuSIE (Black et al., 2023): Uses video prediction for subgoal generation in robot manipulation
Evaluation Metrics¶
| Metric | What It Measures | Notes |
|---|---|---|
| MSE / PSNR | Pixel-level accuracy | Penalizes blurriness less than perceptual methods |
| SSIM | Structural similarity | Better than MSE but still pixel-level |
| LPIPS | Perceptual similarity | Uses deep features, correlates with human judgment |
| FVD | Fréchet Video Distance | Distribution-level metric, the video equivalent of FID |
| FID per frame | Per-frame distribution quality | Useful for long-horizon evaluation |
Metric Limitations
Pixel-level metrics (MSE, SSIM) can be misleading for stochastic environments — a blurry average over possible futures may score well on MSE but be useless for planning. FVD and LPIPS are generally more informative.
Connection to World Models¶
Video prediction models become world models when combined with:
- Action conditioning: predict consequences of actions
- Reward prediction: estimate rewards from predicted futures
- Planning algorithms: use predictions for decision-making
The trend is moving from pure video prediction toward integrated world models that jointly handle perception, prediction, and control.
Key References¶
- Shi, X., et al. (2015). "Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting." NeurIPS.
- Babaeizadeh, M., et al. (2018). "Stochastic Variational Video Prediction." ICLR.
- Yan, W., et al. (2021). "VideoGPT: Video Generation using VQ-VAE and Transformers." arXiv.
- Micheli, V., Alonso, E., Fleuret, F. (2023). "Transformers are Sample-Efficient World Learners." ICLR.
- Du, Y., et al. (2023). "Learning Universal Policies via Text-Guided Video Generation." NeurIPS.