Architecture Paradigms for Distributed RL¶

This page covers the major distributed RL architectures, from early asynchronous designs to modern high-throughput systems. Understanding these paradigms helps you choose the right system for your workload and design custom systems when needed.

A3C: Asynchronous Advantage Actor-Critic¶

A3C (Mnih et al., 2016) was the first widely successful distributed RL architecture.

Design¶

graph TD
    G[Global Parameters θ] --> W1[Worker 1<br/>Env + Policy]
    G --> W2[Worker 2<br/>Env + Policy]
    G --> W3[Worker 3<br/>Env + Policy]
    G --> WN[Worker N<br/>Env + Policy]
    W1 -->|async gradient| G
    W2 -->|async gradient| G
    W3 -->|async gradient| G
    WN -->|async gradient| G

Each worker:

Copies global parameters \(\theta\)
Runs environment for \(n\) steps
Computes local gradients
Asynchronously updates global parameters (no locking)

Properties¶

Aspect	Detail
Parallelism	Multiple CPU workers
Update	Asynchronous (no synchronization barrier)
Data freshness	Workers may use stale parameters
Throughput	Moderate (CPU-bound)
Algorithm	On-policy (A2C variant)

Limitations¶

Stale gradients: Workers compute gradients with outdated parameters
CPU-bound: Policy inference on CPU is slow
No replay: On-policy, so data is discarded immediately

Ape-X: Distributed Prioritized Experience Replay¶

Ape-X (Horgan et al., 2018) introduced the actor-learner separation with a shared replay buffer.

Design¶

graph LR
    A1[Actor 1] -->|experience| RB[Replay Buffer<br/>Prioritized]
    A2[Actor 2] -->|experience| RB
    AN[Actor N] -->|experience| RB
    RB -->|sampled batches| L[Learner<br/>GPU]
    L -->|updated params| A1
    L -->|updated params| A2
    L -->|updated params| AN

Actors (many, CPU): Run environments, generate experience with slightly stale policies
Learner (one, GPU): Samples from replay buffer, performs gradient updates
Replay buffer: Centralized, prioritized by TD error

Properties¶

Massive data throughput (hundreds of actors)
Off-policy (DQN family) — tolerates stale data naturally
GPU learner is fully utilized (always training)
Easily scales actor count

IMPALA: Importance Weighted Actor-Learner Architecture¶

IMPALA (Espeholt et al., 2018) applies the actor-learner separation to on-policy methods using importance sampling correction.

Design¶

Similar to Ape-X but for policy gradient methods:

Actors: Collect trajectories with behavior policy \(\mu\)
Learner: Updates policy \(\pi_\theta\) using trajectories from actors
V-trace: Importance sampling correction for off-policy data

V-Trace¶

The V-trace target corrects for the policy lag between actors and learner:

\[ v_s = V(s) + \sum_{t=s}^{s+n-1} \gamma^{t-s} \left(\prod_{i=s}^{t-1} c_i\right) \delta_t V \]

where:

\[ \delta_t V = \rho_t (r_t + \gamma V(s_{t+1}) - V(s_t)) \]

\[ \rho_t = \min\left(\bar{\rho}, \frac{\pi(a_t|s_t)}{\mu(a_t|s_t)}\right), \quad c_i = \min\left(\bar{c}, \frac{\pi(a_i|s_i)}{\mu(a_i|s_i)}\right) \]

The truncated importance weights \(\rho_t\) and \(c_i\) bound the variance while correcting for policy lag.

Properties¶

Works with policy gradient algorithms (not just Q-learning)
Handles moderate policy lag (actors 1-2 updates behind)
GPU-efficient: single learner fully utilizes accelerator
Scales to hundreds of actors

SEED RL: Scalable, Efficient Deep-RL¶

SEED RL (Espeholt et al., 2020) addresses a key bottleneck in IMPALA: policy inference on CPU actors is slow.

Key Insight¶

Move policy inference to the learner (GPU/TPU):

graph LR
    E1[Env 1<br/>CPU] -->|obs| INF[Inference Server<br/>GPU/TPU]
    E2[Env 2<br/>CPU] -->|obs| INF
    EN[Env N<br/>CPU] -->|obs| INF
    INF -->|actions| E1
    INF -->|actions| E2
    INF -->|actions| EN
    INF -->|trajectories| L[Learner<br/>GPU/TPU]

Environment workers (CPU): Only step the environment, send observations to the inference server
Inference server (GPU/TPU): Batches observations from all workers, runs policy inference
Learner (GPU/TPU): Co-located with inference server, trains the model

Advantages Over IMPALA¶

Aspect	IMPALA	SEED RL
Policy inference	On CPU (slow)	On GPU/TPU (fast, batched)
Network traffic	Parameters → actors	Observations ↔ actions
Model copies	N copies (one per actor)	1 copy (on learner)
Parameter staleness	1-2 updates	Near-zero (same device)
Large models	Limited by CPU memory	Only limited by GPU memory

Performance¶

SEED RL achieved 40x speedup over IMPALA on Atari and 80x on Google Football, while using the same hardware budget.

GPU-Accelerated Environments¶

A more recent paradigm: run environments entirely on GPU alongside the policy network.

Isaac Gym / Isaac Lab¶

NVIDIA's GPU-accelerated physics simulation:

Thousands of parallel environments on a single GPU
No CPU-GPU data transfer (everything stays on GPU)
Sub-millisecond per environment step

This eliminates the traditional bottleneck of CPU-based environments and GPU-based policy training:

graph LR
    subgraph GPU
        E[Parallel Envs<br/>4096-65536] --> P[Policy Network]
        P --> E
        P --> O[Optimizer]
        O --> P
    end

Sample Factory¶

Sample Factory (Petrenko et al., 2020) achieves very high throughput on a single machine by:

Asynchronous vectorized environments
Careful memory management and batching
Overlapping environment stepping and policy inference

Achieves 100K+ FPS on Atari on a single GPU machine.

Summary Comparison¶

Architecture	Year	Actors	Learner	Sync	Best For
A3C	2016	CPU (inference)	CPU (shared)	Async	Simple, small scale
Ape-X	2018	CPU (inference)	GPU	Async	Off-policy (DQN)
IMPALA	2018	CPU (inference)	GPU	Async	On-policy at scale
SEED RL	2020	CPU (env only)	GPU/TPU	Async	Large models, high throughput
GPU envs	2021+	GPU (Isaac Gym)	GPU	Sync	Robotics, continuous control
Sample Factory	2020	CPU (vectorized)	GPU	Async	Single-machine, high FPS

Choosing an Architecture¶

Decision framework:

On-policy (PPO)? → GPU environments (Isaac Gym) or IMPALA/SEED for complex envs
Off-policy (SAC, DQN)? → Ape-X style with replay buffer
Simple environments? → Vectorized on single machine (Sample Factory)
Complex simulation (rendering, physics)? → Actor-learner split (SEED, IMPALA)
Large policy network? → SEED RL (inference on GPU)
Robotics/continuous control? → Isaac Gym + PPO (dominant paradigm)

Key References¶

Mnih, V., et al. (2016). "Asynchronous Methods for Deep Reinforcement Learning." ICML.
Horgan, D., et al. (2018). "Distributed Prioritized Experience Replay." ICLR.
Espeholt, L., et al. (2018). "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures." ICML.
Espeholt, L., et al. (2020). "SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference." ICLR.
Makoviychuk, V., et al. (2021). "Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning." NeurIPS.
Petrenko, A., et al. (2020). "Sample Factory: Egocentric 3D Control from Pixels at 100000 FPS with Asynchronous Reinforcement Learning." ICML.