Offline Reinforcement Learning¶

Offline RL (also called Batch RL) learns a policy entirely from a fixed dataset without any further environment interaction. This paradigm is critical for settings where online exploration is expensive, dangerous, or impractical — such as healthcare, autonomous driving, and robotics.

The Offline RL Problem¶

Given a static dataset \(\mathcal{D} = \{(s_i, a_i, r_i, s_i')\}_{i=1}^{N}\) collected by one or more behavior policies \(\pi_\beta\), learn a policy \(\pi\) that maximizes expected return.

Why Is Offline RL Hard?¶

The fundamental challenge is distribution shift between the learned policy \(\pi\) and the behavior policy \(\pi_\beta\):

The Q-function is trained on state-action pairs from \(\pi_\beta\)
During evaluation (or Bellman backup), we query \(Q(s', a')\) where \(a' \sim \pi(s')\)
If \(\pi\) selects actions rarely seen in \(\mathcal{D}\), the Q-values are unreliable (extrapolation error)
The \(\max\) operator in the Bellman backup amplifies overestimation for out-of-distribution actions

This is known as the distributional shift problem or action extrapolation error.

graph LR
    A[Behavior Policy π_β] -->|collects| D[Dataset D]
    D -->|trains| Q[Q-function]
    Q -->|overestimates OOD actions| P[Learned Policy π]
    P -->|selects OOD actions| Q
    style Q fill:#ff6b6b,color:white

Key Approaches¶

1. Policy Constraint Methods¶

Constrain \(\pi\) to stay close to \(\pi_\beta\):

BCQ (Fujimoto et al., 2019): Uses a generative model of \(\pi_\beta\) and only considers actions within its support:

\[ \pi(s) = \arg\max_{a_i + \xi_\phi(s, a_i)} Q_\theta(s, a_i + \xi_\phi(s, a_i)) \]

where \(\{a_i\}\) are sampled from a learned VAE of \(\pi_\beta\), and \(\xi_\phi\) is a small perturbation.

BEAR (Kumar et al., 2019): Constrains the learned policy to have support within the data distribution using MMD:

\[ \max_\pi \mathbb{E}_{s \sim \mathcal{D}} \left[ \mathbb{E}_{a \sim \pi} [Q(s,a)] \right] \quad \text{s.t.} \quad \text{MMD}(\pi(\cdot|s), \pi_\beta(\cdot|s)) \leq \epsilon \]

2. Conservative Value Estimation¶

Learn pessimistic Q-values that underestimate OOD actions:

CQL (Kumar et al., 2020): Adds a regularizer that pushes down Q-values for actions not in the dataset:

\[ \min_Q \; \alpha \left( \mathbb{E}_{s \sim \mathcal{D}, a \sim \mu} [Q(s,a)] - \mathbb{E}_{s,a \sim \mathcal{D}} [Q(s,a)] \right) + \frac{1}{2} \mathbb{E}_{s,a,s' \sim \mathcal{D}} \left[ (Q(s,a) - \hat{\mathcal{B}}^{\pi_k} Q(s,a))^2 \right] \]

The first term penalizes high Q-values for OOD actions (sampled from \(\mu\), e.g., the current policy) and boosts Q-values for in-distribution actions.

Key property: CQL learns a Q-function that is a lower bound on the true Q-value, ensuring conservative policy selection.

3. Implicit Methods¶

IQL (Kostrikov et al., 2022): Avoids querying OOD actions entirely by using expectile regression on the value function:

\[ \mathcal{L}_V(\psi) = \mathbb{E}_{(s,a) \sim \mathcal{D}} \left[ L_2^\tau (Q_\theta(s,a) - V_\psi(s)) \right] \]

where \(L_2^\tau(u) = |\tau - \mathbb{1}(u < 0)| \cdot u^2\) is the expectile loss.

With \(\tau \to 1\), \(V_\psi(s) \approx \max_a Q(s,a)\) over the data distribution — approximating the optimal value without explicitly maximizing over actions.

Advantages of IQL:

Never queries Q-values for OOD actions
Simple implementation (just regression)
Works with both continuous and discrete actions
Can be combined with advantage-weighted extraction for the policy

4. RL as Sequence Modeling¶

Decision Transformer (Chen et al., 2021): Reframes offline RL as a sequence prediction problem. A Transformer model predicts actions conditioned on desired returns:

\[ a_t = \text{Transformer}(\hat{R}_1, s_1, a_1, \hat{R}_2, s_2, a_2, \ldots, \hat{R}_t, s_t) \]

where \(\hat{R}_t\) is the return-to-go (desired future return).

At test time, conditioning on high return-to-go values elicits high-return behavior.

Key insight: No Bellman equations, no TD learning, no value functions — just supervised learning on sequences. The Transformer implicitly learns which actions lead to high returns.

Comparison of Approaches¶

Method	Approach	Strengths	Weaknesses
BCQ	Policy constraint	Conservative, stable	Needs generative model
CQL	Conservative Q	Strong theory, versatile	Hyperparameter sensitive (\(\alpha\))
IQL	Implicit Q	Simple, no OOD queries	Approximate maximization
DT	Sequence modeling	Simple (just supervised learning)	Struggles with stitching

Trajectory Stitching

A key capability that distinguishes offline RL from imitation learning: the ability to combine parts of different trajectories in the dataset to create a policy better than any single trajectory. CQL and IQL can do this; Decision Transformer struggles with it because it mainly reproduces trajectory-level patterns.

Practical Considerations¶

Dataset Quality Matters¶

Offline RL performance depends heavily on dataset composition:

Expert data: High quality, but offline RL adds little over imitation learning
Mixed data (expert + suboptimal): Best setting for offline RL — can stitch good parts together
Random data: Challenging — limited coverage of good behaviors

Evaluation¶

Standard evaluation protocol:

Train on fixed dataset (e.g., D4RL benchmarks)
Evaluate the learned policy online in the environment
Report normalized scores relative to expert and random baselines

D4RL Benchmark¶

D4RL (Fu et al., 2020) is the standard benchmark for offline RL, providing datasets of varying quality across environments:

MuJoCo: HalfCheetah, Hopper, Walker2d with random/medium/expert/medium-expert datasets
Antmaze: Navigation with sparse rewards
Kitchen: Multi-task manipulation

Connection to Other Topics¶

Embodied AI: Offline RL enables learning from demonstration datasets collected via teleoperation and data collection without requiring online robot interaction.
World Models: Offline model-based methods (e.g., COMBO, MOPO) learn world models from offline data and use them for policy optimization.

Key References¶

Fujimoto, S., Meger, D., Precup, D. (2019). "Off-Policy Deep Reinforcement Learning without Exploration." ICML.
Kumar, A., et al. (2019). "Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction." NeurIPS.
Kumar, A., Zhou, A., Tucker, G., Levine, S. (2020). "Conservative Q-Learning for Offline Reinforcement Learning." NeurIPS.
Kostrikov, I., Nair, A., Levine, S. (2022). "Offline Reinforcement Learning with Implicit Q-Learning." ICLR.
Chen, L., et al. (2021). "Decision Transformer: Reinforcement Learning via Sequence Modeling." NeurIPS.
Fu, J., Kumar, A., Nachum, O., Tucker, G., Levine, S. (2020). "D4RL: Datasets for Deep Data-Driven Reinforcement Learning." arXiv:2004.06729.
Levine, S., et al. (2020). "Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems." arXiv:2005.01643.