Skip to content

Key Papers in Deep RL

A curated reading list of foundational and influential papers in deep reinforcement learning, organized by topic. For each paper, we provide a brief description of its contribution.

How to Use This List

Start with papers marked with (essential) — these are foundational works that every RL researcher should read. Then explore specific areas based on your research interests. Papers are listed roughly in chronological order within each section.

Foundations

  • Playing Atari with Deep Reinforcement Learning — Mnih et al., 2013. (essential) The original DQN paper. Demonstrated that a single architecture can learn to play Atari games from raw pixels.

  • Human-level Control through Deep Reinforcement Learning — Mnih et al., Nature 2015. (essential) The Nature DQN paper with experience replay and target networks.

  • Policy Gradient Methods for Reinforcement Learning with Function Approximation — Sutton et al., NeurIPS 1999. (essential) The policy gradient theorem.

Policy Optimization

  • Trust Region Policy Optimization — Schulman et al., ICML 2015. (essential) Monotonic improvement via KL-constrained policy updates.

  • High-Dimensional Continuous Control Using Generalized Advantage Estimation — Schulman et al., ICLR 2016. (essential) GAE — the standard advantage estimation technique used by PPO and many others.

  • Proximal Policy Optimization Algorithms — Schulman et al., 2017. (essential) PPO — the de facto standard on-policy algorithm. Clipped surrogate objective.

Value-Based

  • Deep Reinforcement Learning with Double Q-learning — van Hasselt et al., AAAI 2016. Fixes DQN overestimation with double Q-learning.

  • Prioritized Experience Replay — Schaul et al., ICLR 2016. Sample replay transitions proportional to TD error.

  • Rainbow: Combining Improvements in Deep Reinforcement Learning — Hessel et al., AAAI 2018. Combines 6 DQN improvements into one agent.

  • A Distributional Perspective on Reinforcement Learning — Bellemare et al., ICML 2017. Learn the full return distribution (C51), not just the mean.

Actor-Critic and Off-Policy

  • Asynchronous Methods for Deep Reinforcement Learning — Mnih et al., ICML 2016. A3C — parallel actor-critic training.

  • Continuous Control with Deep Reinforcement Learning — Lillicrap et al., ICLR 2016. (essential) DDPG — extends DQN to continuous actions with deterministic policy gradient.

  • Addressing Function Approximation Error in Actor-Critic Methods — Fujimoto et al., ICML 2018. (essential) TD3 — twin critics, delayed updates, target smoothing.

  • Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL — Haarnoja et al., ICML 2018. (essential) SAC — maximum entropy RL, automatic temperature tuning.

Model-Based RL

  • World Models — Ha & Schmidhuber, NeurIPS 2018. VAE + RNN world model for learning in imagination.

  • When to Trust Your Model: Model-Based Policy Optimization — Janner et al., NeurIPS 2019. MBPO — principled short-horizon model rollouts for policy optimization.

  • Dream to Control: Learning Behaviors by Latent Imagination — Hafner et al., ICLR 2020. Dreamer — RSSM world model with imagination-based actor-critic.

  • Mastering Diverse Domains through World Models — Hafner et al., 2023. (essential) DreamerV3 — single hyperparameter set works across many domains.

  • Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model — Schrittwieser et al., Nature 2020. (essential) MuZero — learned model + MCTS, no observation prediction needed.

Offline RL

  • Off-Policy Deep Reinforcement Learning without Exploration — Fujimoto et al., ICML 2019. BCQ — first to clearly identify and address extrapolation error in offline RL.

  • Conservative Q-Learning for Offline Reinforcement Learning — Kumar et al., NeurIPS 2020. (essential) CQL — pessimistic value function, strong theoretical guarantees.

  • Offline Reinforcement Learning with Implicit Q-Learning — Kostrikov et al., ICLR 2022. IQL — avoids OOD action queries entirely via expectile regression.

  • Decision Transformer: Reinforcement Learning via Sequence Modeling — Chen et al., NeurIPS 2021. RL as sequence prediction with Transformers, conditioned on desired returns.

Exploration

  • Curiosity-driven Exploration by Self-Supervised Prediction — Pathak et al., ICML 2017. ICM — intrinsic reward from prediction error in feature space.

  • Exploration by Random Network Distillation — Burda et al., ICLR 2019. RND — simple and effective exploration via random network prediction error.

  • Go-Explore: A New Approach for Hard-Exploration Problems — Ecoffet et al., Nature 2021. Archive-based exploration for extremely sparse reward environments.

Multi-Agent RL

  • Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments — Lowe et al., NeurIPS 2017. MADDPG — centralized training, decentralized execution.

  • QMIX: Monotonic Value Function Factorisation for DMARL — Rashid et al., ICML 2018. Value decomposition for cooperative multi-agent RL.

  • The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games — Yu et al., NeurIPS 2022. MAPPO — simple PPO adaptation works remarkably well for multi-agent tasks.

RL for Real-World Applications

  • Sim-to-Real: Learning Agile Locomotion For Quadruped Robots — Tan et al., RSS 2018. Early demonstration of sim-to-real transfer for robot locomotion.

  • Learning Agile and Dynamic Motor Skills for Legged Robots — Hwangbo et al., Science Robotics 2019. Actuator network + RL for ANYmal locomotion.

  • Learning Dexterous In-Hand Manipulation — OpenAI et al., IJRR 2020. Rubik's cube solving with a robot hand via massive domain randomization.

Surveys and Tutorials

  • An Introduction to Deep Reinforcement Learning — François-Lavet et al., 2018. Comprehensive introduction to deep RL methods.

  • Offline Reinforcement Learning: Tutorial, Review, and Perspectives — Levine et al., 2020. Excellent survey of offline RL.

  • A Survey on Model-based Reinforcement Learning — Moerland et al., 2023. Comprehensive survey of model-based approaches.