Key Papers in Deep RL¶
A curated reading list of foundational and influential papers in deep reinforcement learning, organized by topic. For each paper, we provide a brief description of its contribution.
How to Use This List
Start with papers marked with (essential) — these are foundational works that every RL researcher should read. Then explore specific areas based on your research interests. Papers are listed roughly in chronological order within each section.
Foundations¶
-
Playing Atari with Deep Reinforcement Learning — Mnih et al., 2013. (essential) The original DQN paper. Demonstrated that a single architecture can learn to play Atari games from raw pixels.
-
Human-level Control through Deep Reinforcement Learning — Mnih et al., Nature 2015. (essential) The Nature DQN paper with experience replay and target networks.
-
Policy Gradient Methods for Reinforcement Learning with Function Approximation — Sutton et al., NeurIPS 1999. (essential) The policy gradient theorem.
Policy Optimization¶
-
Trust Region Policy Optimization — Schulman et al., ICML 2015. (essential) Monotonic improvement via KL-constrained policy updates.
-
High-Dimensional Continuous Control Using Generalized Advantage Estimation — Schulman et al., ICLR 2016. (essential) GAE — the standard advantage estimation technique used by PPO and many others.
-
Proximal Policy Optimization Algorithms — Schulman et al., 2017. (essential) PPO — the de facto standard on-policy algorithm. Clipped surrogate objective.
Value-Based¶
-
Deep Reinforcement Learning with Double Q-learning — van Hasselt et al., AAAI 2016. Fixes DQN overestimation with double Q-learning.
-
Prioritized Experience Replay — Schaul et al., ICLR 2016. Sample replay transitions proportional to TD error.
-
Rainbow: Combining Improvements in Deep Reinforcement Learning — Hessel et al., AAAI 2018. Combines 6 DQN improvements into one agent.
-
A Distributional Perspective on Reinforcement Learning — Bellemare et al., ICML 2017. Learn the full return distribution (C51), not just the mean.
Actor-Critic and Off-Policy¶
-
Asynchronous Methods for Deep Reinforcement Learning — Mnih et al., ICML 2016. A3C — parallel actor-critic training.
-
Continuous Control with Deep Reinforcement Learning — Lillicrap et al., ICLR 2016. (essential) DDPG — extends DQN to continuous actions with deterministic policy gradient.
-
Addressing Function Approximation Error in Actor-Critic Methods — Fujimoto et al., ICML 2018. (essential) TD3 — twin critics, delayed updates, target smoothing.
-
Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL — Haarnoja et al., ICML 2018. (essential) SAC — maximum entropy RL, automatic temperature tuning.
Model-Based RL¶
-
World Models — Ha & Schmidhuber, NeurIPS 2018. VAE + RNN world model for learning in imagination.
-
When to Trust Your Model: Model-Based Policy Optimization — Janner et al., NeurIPS 2019. MBPO — principled short-horizon model rollouts for policy optimization.
-
Dream to Control: Learning Behaviors by Latent Imagination — Hafner et al., ICLR 2020. Dreamer — RSSM world model with imagination-based actor-critic.
-
Mastering Diverse Domains through World Models — Hafner et al., 2023. (essential) DreamerV3 — single hyperparameter set works across many domains.
-
Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model — Schrittwieser et al., Nature 2020. (essential) MuZero — learned model + MCTS, no observation prediction needed.
Offline RL¶
-
Off-Policy Deep Reinforcement Learning without Exploration — Fujimoto et al., ICML 2019. BCQ — first to clearly identify and address extrapolation error in offline RL.
-
Conservative Q-Learning for Offline Reinforcement Learning — Kumar et al., NeurIPS 2020. (essential) CQL — pessimistic value function, strong theoretical guarantees.
-
Offline Reinforcement Learning with Implicit Q-Learning — Kostrikov et al., ICLR 2022. IQL — avoids OOD action queries entirely via expectile regression.
-
Decision Transformer: Reinforcement Learning via Sequence Modeling — Chen et al., NeurIPS 2021. RL as sequence prediction with Transformers, conditioned on desired returns.
Exploration¶
-
Curiosity-driven Exploration by Self-Supervised Prediction — Pathak et al., ICML 2017. ICM — intrinsic reward from prediction error in feature space.
-
Exploration by Random Network Distillation — Burda et al., ICLR 2019. RND — simple and effective exploration via random network prediction error.
-
Go-Explore: A New Approach for Hard-Exploration Problems — Ecoffet et al., Nature 2021. Archive-based exploration for extremely sparse reward environments.
Multi-Agent RL¶
-
Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments — Lowe et al., NeurIPS 2017. MADDPG — centralized training, decentralized execution.
-
QMIX: Monotonic Value Function Factorisation for DMARL — Rashid et al., ICML 2018. Value decomposition for cooperative multi-agent RL.
-
The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games — Yu et al., NeurIPS 2022. MAPPO — simple PPO adaptation works remarkably well for multi-agent tasks.
RL for Real-World Applications¶
-
Sim-to-Real: Learning Agile Locomotion For Quadruped Robots — Tan et al., RSS 2018. Early demonstration of sim-to-real transfer for robot locomotion.
-
Learning Agile and Dynamic Motor Skills for Legged Robots — Hwangbo et al., Science Robotics 2019. Actuator network + RL for ANYmal locomotion.
-
Learning Dexterous In-Hand Manipulation — OpenAI et al., IJRR 2020. Rubik's cube solving with a robot hand via massive domain randomization.
Surveys and Tutorials¶
-
An Introduction to Deep Reinforcement Learning — François-Lavet et al., 2018. Comprehensive introduction to deep RL methods.
-
Offline Reinforcement Learning: Tutorial, Review, and Perspectives — Levine et al., 2020. Excellent survey of offline RL.
-
A Survey on Model-based Reinforcement Learning — Moerland et al., 2023. Comprehensive survey of model-based approaches.