Part IV: Distributed Reinforcement Learning¶

Training RL agents at scale requires distributing computation across many machines. This section covers the system architectures, frameworks, and practical considerations for distributed RL — from the foundational designs to modern systems that power large-scale training.

What You'll Learn¶

Architecture Paradigms — A3C, Ape-X, IMPALA, SEED RL, and their design trade-offs
Frameworks & Systems — RLlib, Acme, EnvPool, Sample Factory, and other systems
Large-Model RL Infrastructure — Modern systems for RLHF, agentic RL, and embodied training
Practical Scaling Guide — How to scale RL training effectively

The first two pages focus on classical distributed RL patterns. The new infrastructure page bridges those ideas to the trainer-rollout-reward-environment systems used in large-model RL.

Why Distributed RL?¶

RL training has unique computational characteristics that demand distributed systems:

Aspect	RL vs. Supervised Learning
Data generation	Must interact with environment (on-policy) or replay buffer (off-policy)
Data freshness	On-policy methods need current-policy data (stale data hurts)
Environment cost	Simulation can be the bottleneck (complex physics, rendering)
Exploration	Parallel environments provide diverse experience
Training loop	Tightly coupled: collect → train → update → collect again

Single-machine RL hits walls:

Sample collection is slow: One environment generates data in real-time
GPU underutilized: Policy network is small, GPU waits for environment data
Training time: Complex tasks need billions of environment steps

Distributed RL addresses these by parallelizing environment interaction, data collection, and policy training.

The Design Space¶

graph TD
    D[Distributed RL Design] --> DP[Data Parallelism]
    D --> EP[Environment Parallelism]
    D --> PP[Pipeline Parallelism]

    DP --> SA[Synchronous<br/>A2C, PPO]
    DP --> AA[Asynchronous<br/>A3C, IMPALA]

    EP --> VP[Vectorized Envs<br/>GPU-accelerated]
    EP --> RP[Remote Envs<br/>Multi-machine]

    PP --> SF[Separate Actor/Learner<br/>Ape-X, SEED]

The key design decisions:

Synchronous vs. asynchronous updates
Where computation happens (CPU actors, GPU learners)
How data flows between actors and learners
How fresh the data needs to be (on-policy vs. off-policy tolerance)

Connection to Other Parts¶

Part I (RL): The algorithms being scaled — PPO, SAC, etc.
Part III (Embodied AI): Distributed systems enable massive-scale locomotion training (thousands of parallel environments)