Introduction to Policy Optimization¶
Policy optimization is one of the most important families of RL algorithms. Instead of learning a value function and deriving a policy from it, policy optimization methods directly optimize the policy to maximize expected return. This page derives the core result — the Policy Gradient Theorem — and builds intuition for how and why it works.
The Policy Optimization Problem¶
We parameterize a policy \(\pi_\theta\) (typically a neural network) and seek to maximize the expected return:
where \(\tau = (s_0, a_0, r_0, \ldots, s_T)\) is a trajectory. We want to compute \(\nabla_\theta J(\theta)\) so we can perform gradient ascent:
The Policy Gradient Theorem¶
The central result: we can compute the gradient of \(J(\theta)\) without differentiating through the environment dynamics.
where \(\Phi_t\) can be any of:
| Choice of \(\Phi_t\) | Name | Variance |
|---|---|---|
| \(R(\tau)\) | Total trajectory return | Highest |
| \(\sum_{t'=t}^{T} \gamma^{t'-t} r_{t'}\) | Reward-to-go | Lower |
| \(A^{\pi_\theta}(s_t, a_t)\) | Advantage function | Lowest (with baseline) |
Derivation Sketch¶
The key identity (the "log-derivative trick"):
Since \(\pi_\theta(\tau) = p(s_0) \prod_{t=0}^{T} \pi_\theta(a_t|s_t) P(s_{t+1}|s_t,a_t)\), taking the log:
Only \(\log \pi_\theta(a_t|s_t)\) depends on \(\theta\), so:
The full gradient becomes:
Key Insight
The environment dynamics \(P(s'|s,a)\) cancel out! This means we can estimate policy gradients without knowing or differentiating through the environment model — we only need to be able to sample trajectories and evaluate \(\log \pi_\theta(a|s)\).
Variance Reduction¶
The basic policy gradient estimator has high variance, making learning slow and unstable. Several techniques reduce variance:
Reward-to-Go¶
Actions at time \(t\) can only affect future rewards. Replacing \(R(\tau)\) with the reward-to-go:
Baselines¶
Subtracting a state-dependent baseline \(b(s_t)\) from \(\Phi_t\) reduces variance without introducing bias:
The optimal baseline is close to \(V^{\pi}(s_t)\), which is why actor-critic methods learn a value function as the baseline.
Generalized Advantage Estimation (GAE)¶
GAE (Schulman et al., 2016) provides a smooth interpolation between low-bias/high-variance (Monte Carlo) and high-bias/low-variance (TD) advantage estimates:
where \(\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\) is the TD error.
- \(\lambda = 0\): pure TD (high bias, low variance)
- \(\lambda = 1\): pure Monte Carlo (low bias, high variance)
- Typical: \(\lambda = 0.95\)
From Theory to Algorithms¶
The policy gradient theorem leads to a family of practical algorithms:
graph TD
PGT[Policy Gradient Theorem] --> REINFORCE
PGT --> VPG[Vanilla Policy Gradient]
VPG --> NPG[Natural Policy Gradient]
NPG --> TRPO
TRPO --> PPO
VPG --> A2C[Advantage Actor-Critic]
A2C --> A3C[Asynchronous A3C]
Each algorithm adds constraints or techniques to make the basic policy gradient more stable and efficient:
- REINFORCE: Direct application with Monte Carlo returns
- VPG: Adds a learned baseline (value function)
- TRPO: Constrains the KL divergence between old and new policies
- PPO: Approximates the trust region with a simpler clipped objective
Continue to the individual algorithm pages for details:
- Policy Gradient Methods (REINFORCE, VPG)
- Trust Region Methods (TRPO, PPO)