Part IV: Distributed Reinforcement Learning¶
Training RL agents at scale requires distributing computation across many machines. This section covers the system architectures, frameworks, and practical considerations for distributed RL — from the foundational designs to modern systems that power large-scale training.
What You'll Learn¶
- Architecture Paradigms — A3C, Ape-X, IMPALA, SEED RL, and their design trade-offs
- Frameworks & Systems — RLlib, Acme, EnvPool, Sample Factory, and other systems
- Large-Model RL Infrastructure — Modern systems for RLHF, agentic RL, and embodied training
- Practical Scaling Guide — How to scale RL training effectively
The first two pages focus on classical distributed RL patterns. The new infrastructure page bridges those ideas to the trainer-rollout-reward-environment systems used in large-model RL.
Why Distributed RL?¶
RL training has unique computational characteristics that demand distributed systems:
| Aspect | RL vs. Supervised Learning |
|---|---|
| Data generation | Must interact with environment (on-policy) or replay buffer (off-policy) |
| Data freshness | On-policy methods need current-policy data (stale data hurts) |
| Environment cost | Simulation can be the bottleneck (complex physics, rendering) |
| Exploration | Parallel environments provide diverse experience |
| Training loop | Tightly coupled: collect → train → update → collect again |
Single-machine RL hits walls:
- Sample collection is slow: One environment generates data in real-time
- GPU underutilized: Policy network is small, GPU waits for environment data
- Training time: Complex tasks need billions of environment steps
Distributed RL addresses these by parallelizing environment interaction, data collection, and policy training.
The Design Space¶
graph TD
D[Distributed RL Design] --> DP[Data Parallelism]
D --> EP[Environment Parallelism]
D --> PP[Pipeline Parallelism]
DP --> SA[Synchronous<br/>A2C, PPO]
DP --> AA[Asynchronous<br/>A3C, IMPALA]
EP --> VP[Vectorized Envs<br/>GPU-accelerated]
EP --> RP[Remote Envs<br/>Multi-machine]
PP --> SF[Separate Actor/Learner<br/>Ape-X, SEED]
The key design decisions:
- Synchronous vs. asynchronous updates
- Where computation happens (CPU actors, GPU learners)
- How data flows between actors and learners
- How fresh the data needs to be (on-policy vs. off-policy tolerance)
Connection to Other Parts¶
- Part I (RL): The algorithms being scaled — PPO, SAC, etc.
- Part III (Embodied AI): Distributed systems enable massive-scale locomotion training (thousands of parallel environments)