Architecture Paradigms for Distributed RL¶
This page covers the major distributed RL architectures, from early asynchronous designs to modern high-throughput systems. Understanding these paradigms helps you choose the right system for your workload and design custom systems when needed.
A3C: Asynchronous Advantage Actor-Critic¶
A3C (Mnih et al., 2016) was the first widely successful distributed RL architecture.
Design¶
graph TD
G[Global Parameters θ] --> W1[Worker 1<br/>Env + Policy]
G --> W2[Worker 2<br/>Env + Policy]
G --> W3[Worker 3<br/>Env + Policy]
G --> WN[Worker N<br/>Env + Policy]
W1 -->|async gradient| G
W2 -->|async gradient| G
W3 -->|async gradient| G
WN -->|async gradient| G
Each worker:
- Copies global parameters \(\theta\)
- Runs environment for \(n\) steps
- Computes local gradients
- Asynchronously updates global parameters (no locking)
Properties¶
| Aspect | Detail |
|---|---|
| Parallelism | Multiple CPU workers |
| Update | Asynchronous (no synchronization barrier) |
| Data freshness | Workers may use stale parameters |
| Throughput | Moderate (CPU-bound) |
| Algorithm | On-policy (A2C variant) |
Limitations¶
- Stale gradients: Workers compute gradients with outdated parameters
- CPU-bound: Policy inference on CPU is slow
- No replay: On-policy, so data is discarded immediately
Ape-X: Distributed Prioritized Experience Replay¶
Ape-X (Horgan et al., 2018) introduced the actor-learner separation with a shared replay buffer.
Design¶
graph LR
A1[Actor 1] -->|experience| RB[Replay Buffer<br/>Prioritized]
A2[Actor 2] -->|experience| RB
AN[Actor N] -->|experience| RB
RB -->|sampled batches| L[Learner<br/>GPU]
L -->|updated params| A1
L -->|updated params| A2
L -->|updated params| AN
- Actors (many, CPU): Run environments, generate experience with slightly stale policies
- Learner (one, GPU): Samples from replay buffer, performs gradient updates
- Replay buffer: Centralized, prioritized by TD error
Properties¶
- Massive data throughput (hundreds of actors)
- Off-policy (DQN family) — tolerates stale data naturally
- GPU learner is fully utilized (always training)
- Easily scales actor count
IMPALA: Importance Weighted Actor-Learner Architecture¶
IMPALA (Espeholt et al., 2018) applies the actor-learner separation to on-policy methods using importance sampling correction.
Design¶
Similar to Ape-X but for policy gradient methods:
- Actors: Collect trajectories with behavior policy \(\mu\)
- Learner: Updates policy \(\pi_\theta\) using trajectories from actors
- V-trace: Importance sampling correction for off-policy data
V-Trace¶
The V-trace target corrects for the policy lag between actors and learner:
where:
The truncated importance weights \(\rho_t\) and \(c_i\) bound the variance while correcting for policy lag.
Properties¶
- Works with policy gradient algorithms (not just Q-learning)
- Handles moderate policy lag (actors 1-2 updates behind)
- GPU-efficient: single learner fully utilizes accelerator
- Scales to hundreds of actors
SEED RL: Scalable, Efficient Deep-RL¶
SEED RL (Espeholt et al., 2020) addresses a key bottleneck in IMPALA: policy inference on CPU actors is slow.
Key Insight¶
Move policy inference to the learner (GPU/TPU):
graph LR
E1[Env 1<br/>CPU] -->|obs| INF[Inference Server<br/>GPU/TPU]
E2[Env 2<br/>CPU] -->|obs| INF
EN[Env N<br/>CPU] -->|obs| INF
INF -->|actions| E1
INF -->|actions| E2
INF -->|actions| EN
INF -->|trajectories| L[Learner<br/>GPU/TPU]
- Environment workers (CPU): Only step the environment, send observations to the inference server
- Inference server (GPU/TPU): Batches observations from all workers, runs policy inference
- Learner (GPU/TPU): Co-located with inference server, trains the model
Advantages Over IMPALA¶
| Aspect | IMPALA | SEED RL |
|---|---|---|
| Policy inference | On CPU (slow) | On GPU/TPU (fast, batched) |
| Network traffic | Parameters → actors | Observations ↔ actions |
| Model copies | N copies (one per actor) | 1 copy (on learner) |
| Parameter staleness | 1-2 updates | Near-zero (same device) |
| Large models | Limited by CPU memory | Only limited by GPU memory |
Performance¶
SEED RL achieved 40x speedup over IMPALA on Atari and 80x on Google Football, while using the same hardware budget.
GPU-Accelerated Environments¶
A more recent paradigm: run environments entirely on GPU alongside the policy network.
Isaac Gym / Isaac Lab¶
NVIDIA's GPU-accelerated physics simulation:
- Thousands of parallel environments on a single GPU
- No CPU-GPU data transfer (everything stays on GPU)
- Sub-millisecond per environment step
This eliminates the traditional bottleneck of CPU-based environments and GPU-based policy training:
graph LR
subgraph GPU
E[Parallel Envs<br/>4096-65536] --> P[Policy Network]
P --> E
P --> O[Optimizer]
O --> P
end
Sample Factory¶
Sample Factory (Petrenko et al., 2020) achieves very high throughput on a single machine by:
- Asynchronous vectorized environments
- Careful memory management and batching
- Overlapping environment stepping and policy inference
Achieves 100K+ FPS on Atari on a single GPU machine.
Summary Comparison¶
| Architecture | Year | Actors | Learner | Sync | Best For |
|---|---|---|---|---|---|
| A3C | 2016 | CPU (inference) | CPU (shared) | Async | Simple, small scale |
| Ape-X | 2018 | CPU (inference) | GPU | Async | Off-policy (DQN) |
| IMPALA | 2018 | CPU (inference) | GPU | Async | On-policy at scale |
| SEED RL | 2020 | CPU (env only) | GPU/TPU | Async | Large models, high throughput |
| GPU envs | 2021+ | GPU (Isaac Gym) | GPU | Sync | Robotics, continuous control |
| Sample Factory | 2020 | CPU (vectorized) | GPU | Async | Single-machine, high FPS |
Choosing an Architecture¶
Decision framework:
- On-policy (PPO)? → GPU environments (Isaac Gym) or IMPALA/SEED for complex envs
- Off-policy (SAC, DQN)? → Ape-X style with replay buffer
- Simple environments? → Vectorized on single machine (Sample Factory)
- Complex simulation (rendering, physics)? → Actor-learner split (SEED, IMPALA)
- Large policy network? → SEED RL (inference on GPU)
- Robotics/continuous control? → Isaac Gym + PPO (dominant paradigm)
Key References¶
- Mnih, V., et al. (2016). "Asynchronous Methods for Deep Reinforcement Learning." ICML.
- Horgan, D., et al. (2018). "Distributed Prioritized Experience Replay." ICLR.
- Espeholt, L., et al. (2018). "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures." ICML.
- Espeholt, L., et al. (2020). "SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference." ICLR.
- Makoviychuk, V., et al. (2021). "Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning." NeurIPS.
- Petrenko, A., et al. (2020). "Sample Factory: Egocentric 3D Control from Pixels at 100000 FPS with Asynchronous Reinforcement Learning." ICML.