Practical Scaling Guide¶
This page provides practical advice for scaling RL training — from single-GPU experiments to multi-machine distributed training. We cover common bottlenecks, optimization strategies, and debugging tips.
Identifying Your Bottleneck¶
Before scaling, identify what's limiting throughput:
The Three Bottlenecks¶
graph LR
ENV[Environment<br/>Stepping] --> DAT[Data<br/>Transfer] --> TRN[Policy<br/>Training]
TRN --> ENV
| Bottleneck | Symptom | Solution |
|---|---|---|
| Environment | GPU utilization low, environment steps/sec low | More parallel envs, GPU envs, faster sim |
| Data transfer | High CPU-GPU transfer time, network latency | Vectorized envs, shared memory, GPU envs |
| Training | GPU at 100%, want faster convergence | Larger batches, mixed precision, model optimization |
Profiling Checklist¶
- GPU utilization:
nvidia-smi— if GPU is idle most of the time, you're environment-bottlenecked - Environment throughput: Measure steps per second per environment
- Data pipeline: Check for unnecessary copies, serialization overhead
- Training throughput: Measure gradient steps per second
Scaling On-Policy Methods (PPO)¶
PPO is the most commonly scaled RL algorithm, especially for robotics.
Single-GPU Scaling¶
Environments: 4096 - 65536 (GPU-accelerated)
Batch size: environments × horizon_length
Mini-batch size: batch_size / num_mini_batches
PPO epochs: 3-5
Key insight: With GPU environments (Isaac Gym), a single GPU can run thousands of environments. The bottleneck becomes the policy training step.
Effective Batch Size¶
The effective batch size for PPO:
Larger effective batch size → more stable gradients → can use larger learning rates.
Multi-GPU Scaling¶
For PPO across multiple GPUs:
# Data parallel PPO
# Each GPU runs a fraction of the environments
# Gradients are synchronized via AllReduce
# GPU 0: envs 0-1023, GPU 1: envs 1024-2047, ...
# After forward pass: AllReduce gradients
# Synchronized update
Scaling Rule of Thumb
When doubling the number of environments (and thus batch size), increase the learning rate by ~\(\sqrt{2}\) to maintain similar convergence.
Scaling Off-Policy Methods (SAC, TD3)¶
Off-policy methods have different scaling characteristics due to the replay buffer.
Key Considerations¶
- Replay buffer memory: Can be the limiting factor (millions of transitions × observation size)
- Update-to-data ratio: How many gradient steps per environment step
- Data freshness: Off-policy methods tolerate stale data, but there are limits
Architecture¶
Actors: N parallel workers collecting experience
Buffer: Centralized or distributed replay buffer
Learner: GPU training from buffer samples
Ratio: update_steps / env_steps = 1-20
Distributed Replay Buffer¶
For large-scale off-policy training:
- Prioritized: Rank transitions by TD error for efficient sampling
- Distributed: Shard the buffer across machines for memory
- Ring buffer: Fixed-size, overwrites oldest data
Common Pitfalls¶
1. Learning Rate Scaling¶
Problem: Increasing batch size without adjusting learning rate leads to poor convergence.
Solution: Linear scaling rule — multiply LR by the batch size increase factor. For PPO, square-root scaling often works better.
2. Observation Normalization¶
Problem: Observation statistics change during training, causing instability.
Solution: Running mean/variance normalization:
In distributed settings, synchronize normalization statistics across workers.
3. Reward Scaling¶
Problem: Reward magnitudes vary across tasks or change during training.
Solution: Normalize returns or use symmetric reward functions. DreamerV3's symlog transformation:
4. Seed Variance¶
Problem: RL results vary significantly across random seeds.
Solution: Always report results over 3-5 seeds. Use statistical tests for comparisons.
5. Gradient Staleness¶
Problem: In async systems, workers train on stale parameters.
Solution: V-trace correction (IMPALA), importance sampling, or limit staleness to 1-2 updates.
Debugging Distributed RL¶
Key Metrics to Monitor¶
| Metric | What to Watch For |
|---|---|
| Episode return | Upward trend, not oscillating wildly |
| Policy entropy | Gradual decrease (not collapse) |
| Value function loss | Decreasing, not exploding |
| KL divergence | Within target range for PPO |
| Gradient norm | Stable, no spikes |
| Environment FPS | Consistent, no degradation |
| GPU utilization | High (>80%) |
| Data queue length | Not backing up |
Common Failure Modes¶
- Reward hacking: Agent exploits simulation bugs. Fix: review reward function, add constraints.
- Policy collapse: Entropy drops to zero. Fix: increase entropy bonus, check learning rate.
- NaN/Inf values: Numerical instability. Fix: gradient clipping, observation normalization, check reward magnitudes.
- Slow convergence: May be normal for hard tasks. Check: hyperparameters, reward shaping, curriculum.
- Memory leak: Common in long training runs. Profile memory usage over time.
Hardware Recommendations¶
Single-Machine Setup¶
| Use Case | Hardware | Expected FPS |
|---|---|---|
| Prototyping | 1x RTX 4090 | 10K-50K (Atari) |
| Locomotion (Isaac Gym) | 1x RTX 4090 | 100K-500K |
| Large-scale single-machine | 1x A100 80GB | 200K-1M |
Multi-Machine Setup¶
| Use Case | Hardware | Expected FPS |
|---|---|---|
| Large-scale on-policy | 8x A100 + 32 CPU workers | 1M+ |
| Massive off-policy | 256 CPU actors + 8 GPU learners | 5M+ |
| Robotics research | 1-4 GPUs (Isaac Gym) | Sufficient for most tasks |
Key References¶
- Espeholt, L., et al. (2020). "SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference." ICLR.
- Petrenko, A., et al. (2020). "Sample Factory: Egocentric 3D Control from Pixels at 100000 FPS." ICML.
- Makoviychuk, V., et al. (2021). "Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning." NeurIPS.
- Weng, J., et al. (2022). "EnvPool: A Highly Parallel Reinforcement Learning Environment Execution Engine." NeurIPS.
Work in Progress
This section will be expanded with more detailed benchmarks, configuration examples, and case studies.