Trust Region Methods: TRPO and PPO¶
Vanilla policy gradient methods suffer from a critical problem: a bad gradient step can catastrophically degrade the policy, and there's no easy way to recover. Trust region methods address this by constraining how much the policy can change in a single update, leading to dramatically more stable training.
The Problem: Policy Collapse¶
In standard policy gradient, the update \(\theta \leftarrow \theta + \alpha \hat{g}\) has no guarantee about how much the policy actually changes. A seemingly small parameter update can cause a large change in the action distribution, leading to:
- Collecting bad data with the degraded policy
- Further degrading the policy from that bad data
- Irreversible collapse
Monotonic Improvement Theory¶
Kakade & Langford (2002) showed that policy improvement can be guaranteed if we control the KL divergence between consecutive policies. Specifically:
This motivates constraining or penalizing the KL divergence during optimization.
TRPO (Trust Region Policy Optimization)¶
TRPO (Schulman et al., 2015) formalizes the trust region idea as a constrained optimization problem.
Objective¶
Maximize the surrogate objective subject to a KL constraint:
where \(\delta\) is the trust region radius (typically \(\delta = 0.01\)).
Solution Method¶
TRPO uses a second-order approximation:
- Compute the policy gradient \(g = \nabla_\theta L(\theta)\big|_{\theta_\text{old}}\)
- Compute the Fisher information matrix \(F = \nabla_\theta^2 D_{\text{KL}}\big|_{\theta_\text{old}}\)
- Compute the natural gradient direction: \(\hat{s} = F^{-1}g\)
- Line search with step size \(\beta = \sqrt{2\delta / (g^T F^{-1} g)}\)
- Update: \(\theta \leftarrow \theta_{\text{old}} + \beta \hat{s}\)
In practice, \(F^{-1}g\) is computed using the conjugate gradient method (avoiding explicit matrix inversion).
Properties¶
- Guaranteed monotonic improvement (under certain assumptions)
- Stable training — no catastrophic policy collapse
- Expensive: requires conjugate gradient solver and line search
- Typical trust region: \(\delta \in [0.001, 0.05]\)
PPO (Proximal Policy Optimization)¶
PPO (Schulman et al., 2017) achieves similar stability to TRPO but with a much simpler implementation using only first-order optimization.
PPO-Clip¶
The key idea: instead of a hard KL constraint, clip the importance sampling ratio to prevent large policy changes:
where \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)}\) is the probability ratio, and \(\epsilon\) is the clip range (typically 0.1--0.2).
Intuition: The clipping prevents the ratio from moving too far from 1 in either direction, effectively creating a trust region without second-order optimization.
How the clipping works:
| Advantage \(\hat{A}_t\) | Ratio \(r_t\) | Effect |
|---|---|---|
| \(\hat{A}_t > 0\) (good action) | \(r_t > 1+\epsilon\) | Clipped — don't over-exploit |
| \(\hat{A}_t > 0\) (good action) | \(r_t < 1+\epsilon\) | Not clipped — encourage this action |
| \(\hat{A}_t < 0\) (bad action) | \(r_t < 1-\epsilon\) | Clipped — already moved away enough |
| \(\hat{A}_t < 0\) (bad action) | \(r_t > 1-\epsilon\) | Not clipped — discourage this action |
Full PPO Objective¶
In practice, PPO optimizes a combined objective:
- \(L^{\text{VF}}\): Value function loss (MSE between \(V_\phi(s)\) and returns)
- \(\mathcal{H}\): Entropy bonus for exploration
- \(c_1 \approx 0.5\), \(c_2 \approx 0.01\) (typical)
Pseudocode: PPO-Clip
Initialize policy θ, value function φ
for iteration = 1, 2, ... do
Collect T timesteps with current policy π_θ
Compute advantages Â_t using GAE(γ, λ)
for epoch = 1, ..., K do // K = 3-10 epochs per iteration
for minibatch in random_split(data) do
Compute ratio r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t)
L_clip = min(r_t Â_t, clip(r_t, 1-ε, 1+ε) Â_t)
L_vf = (V_φ(s_t) - G_t)²
L = -L_clip + c1·L_vf - c2·H(π_θ)
Update θ, φ with gradient descent on L
end for
end for
θ_old ← θ
end for
Why PPO is So Popular¶
- Simple implementation — no conjugate gradient, no line search
- Stable — clipping provides soft trust region guarantees
- General — works for discrete and continuous action spaces
- Scalable — easily parallelized (see Distributed RL)
- Robust — relatively insensitive to hyperparameters
PPO is the default choice for many applications: robotics (locomotion, manipulation), game AI (Dota 2, StarCraft), RLHF for language models, and more.
Key Hyperparameters¶
| Parameter | Typical Range | Notes |
|---|---|---|
| Clip range \(\epsilon\) | 0.1 -- 0.3 | 0.2 is a common default |
| GAE \(\lambda\) | 0.9 -- 0.99 | 0.95 is typical |
| Discount \(\gamma\) | 0.99 -- 0.999 | Task-dependent |
| Epochs \(K\) | 3 -- 10 | More epochs → risk of overfitting to batch |
| Minibatch size | 32 -- 4096 | Larger for more complex tasks |
| Learning rate | 1e-4 -- 3e-4 | Often linearly annealed |
| Entropy coeff \(c_2\) | 0.0 -- 0.01 | Higher for exploration-heavy tasks |
TRPO vs. PPO¶
| Aspect | TRPO | PPO |
|---|---|---|
| Constraint | Hard KL constraint | Soft (clipping) |
| Optimization | 2nd order (conjugate gradient) | 1st order (Adam) |
| Implementation | Complex | Simple |
| Performance | Slightly more principled | Comparable or better in practice |
| Scalability | Harder to distribute | Easy to distribute |
| Usage | Research, theory | Industry standard |
Key References¶
- Kakade, S. & Langford, J. (2002). "Approximately Optimal Approximate Reinforcement Learning." ICML.
- Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P. (2015). "Trust Region Policy Optimization." ICML.
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O. (2017). "Proximal Policy Optimization Algorithms." arXiv:1707.06347.