Reinforcement Learning Papers

Related papers for Reinforcement Learning (we mainly focus on single-agent).

Since there are tens of thousands of new papers on reinforcement learning at each conference every year, we are only able to list those we read and consider as insightful.

We have added some ICLR22, ICML22, NeurIPS22, ICLR23, ICML23, NeurIPS23, ICLR24, ICML24 papers on RL

Model Free (Online) RL
Model Based (Online) RL
(Model Free) Offline RL
- Current methods
- Combined with Diffusion Models
Model Based Offline RL
Meta RL
Adversarial RL
Genaralisation in RL
- Environments
- Methods
RL with Transformer/LLM
Tutorial and Lesson
ICLR22
ICML22
NeurIPS22
ICLR23
ICML23
NeurIPS23
ICLR24
ICML24

Model Free (Online) RL

Classic Methods

Title	Method	Conference	on/off policy	Action Space	Policy	Description
Human-level control through deep reinforcement learning, [other link]	DQN	Nature15	off	Discrete	based on value function	use deep neural network to train q learning and reach the human level in the Atari games; mainly two trick: replay buffer for improving sample efficiency, decouple target network and behavior network
Deep reinforcement learning with double q-learning	Double DQN	AAAI16	off	Discrete	based on value function	find that the Q function in DQN may overestimate; decouple calculating q function and choosing action with two neural networks
Dueling network architectures for deep reinforcement learning	Dueling DQN	ICML16	off	Discrete	based on value function	use the same neural network to approximate q function and value function for calculating advantage function
Prioritized Experience Replay	Priority Sampling	ICLR16	off	Discrete	based on value function	give different weights to the samples in the replay buffer (e.g. TD error)
Rainbow: Combining Improvements in Deep Reinforcement Learning	Rainbow	AAAI18	off	Discrete	based on value function	combine different improvements to DQN: Double DQN, Dueling DQN, Priority Sampling, Multi-step learning, Distributional RL, Noisy Nets
Policy Gradient Methods for Reinforcement Learning with Function Approximation	PG	NeurIPS99	on/off	Continuous or Discrete	function approximation	propose Policy Gradient Theorem: how to calculate the gradient of the expected cumulative return to policy
----	AC/A2C	----	on/off	Continuous or Discrete	parameterized neural network	AC: replace the return in PG with q function approximator to reduce variance; A2C: replace the q function in AC with advantage function to reduce variance
Asynchronous Methods for Deep Reinforcement Learning	A3C	ICML16	on/off	Continuous or Discrete	parameterized neural network	propose three tricks to improve performance: (i) use different agents to interact with the environment; (ii) value function and policy share network parameters; (iii) modify the loss function (mse of value function + pg loss + policy entropy)
Trust Region Policy Optimization	TRPO	ICML15	on	Continuous or Discrete	parameterized neural network	introduce trust region to policy optimization for guaranteed monotonic improvement
Proximal Policy Optimization Algorithms	PPO	arxiv17	on	Continuous or Discrete	parameterized neural network	replace the hard constraint of TRPO with a penalty by clipping the coefficient
Deterministic Policy Gradient Algorithms	DPG	ICML14	off	Continuous	function approximation	consider deterministic policy for continuous action space and prove Deterministic Policy Gradient Theorem; use a stochastic behaviour policy for encouraging exploration
Continuous Control with Deep Reinforcement Learning	DDPG	ICLR16	off	Continuous	parameterized neural network	adapt the ideas of DQN to DPG: (i) deep neural network function approximators, (ii) replay buffer, (iii) fix the target q function at each epoch
Addressing Function Approximation Error in Actor-Critic Methods	TD3	ICML18	off	Continuous	parameterized neural network	adapt the ideas of Double DQN to DDPG: taking the minimum value between a pair of critics to limit overestimation
Reinforcement Learning with Deep Energy-Based Policies	SQL	ICML17	off	main for Continuous	parameterized neural network	consider max-entropy rl and propose soft q iteration as well as soft q learning
Soft Actor-Critic Algorithms and Applications, Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, [appendix]	SAC	ICML18	off	main for Continuous	parameterized neural network	base the theoretical analysis of SQL and extend soft q iteration (soft q evaluation + soft q improvement); reparameterize the policy and use two parameterized value functions; propose SAC

Exploration

Title	Method	Conference	Description
Curiosity-driven Exploration by Self-supervised Prediction	ICM	ICML17	propose that curiosity can serve as an intrinsic reward signal to enable the agent to explore its environment and learn skills when rewards are sparse; formulate curiosity as the error in an agent’s ability to predict the consequence of its own actions in a visual feature space learned by a self-supervised inverse dynamics model
Curiosity-Driven Exploration via Latent Bayesian Surprise	LBS	AAAI22	apply Bayesian surprise in a latent space representing the agent’s current understanding of the dynamics of the system
Automatic Intrinsic Reward Shaping for Exploration in Deep Reinforcement Learning	AIRS	ICML23	select shaping function from a predefined set based on the estimated task return in real-time, providing reliable exploration incentives and alleviating the biased objective problem; develop a toolkit that provides highquality implementations of various intrinsic reward modules based on PyTorch
Curiosity in Hindsight: Intrinsic Exploration in Stochastic Environments	Curiosity in Hindsight	ICML23	consider exploration in stochastic environments; learn representations of the future that capture precisely the unpredictable aspects of each outcome—which we use as additional input for predictions, such that intrinsic rewards only reflect the predictable aspects of world dynamics
Maximize to Explore: One Objective Function Fusing Estimation, Planning, and Exploration		NeurIPS23 spotlight
MIMEx: Intrinsic Rewards from Masked Input Modeling	MIMEx	NeurIPS23	propose that the mask distribution can be flexibly tuned to control the difficulty of the underlying conditional prediction task

Representation Learning

Note: representation learning with MBRL is in the part World Models

Title	Method	Conference	Description
CURL: Contrastive Unsupervised Representations for Reinforcement Learning	CURL	ICML20	extracts high-level features from raw pixels using contrastive learning and performs offpolicy control on top of the extracted features
Learning Invariant Representations for Reinforcement Learning without Reconstruction	DBC	ICLR21	propose using Bisimulation to learn robust latent representations which encode only the task-relevant information from observations
Reinforcement Learning with Prototypical Representations	Proto-RL	ICML21	pre-train task-agnostic representations and prototypes on environments without downstream task information
Understanding the World Through Action	----	CoRL21	discusse how self-supervised reinforcement learning combined with offline RL can enable scalable representation learning
Flow-based Recurrent Belief State Learning for POMDPs	FORBES	ICML22	incorporate normalizing flows into the variational inference to learn general continuous belief states for POMDPs
Contrastive Learning as Goal-Conditioned Reinforcement Learning	Contrastive RL	NeurIPS22	show (contrastive) representation learning methods can be cast as RL algorithms in their own right
Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?	----	NeurIPS22	conduct an extensive comparison of various self-supervised losses under the existing joint learning framework for pixel-based reinforcement learning in many environments from different benchmarks, including one real-world environment
Reinforcement Learning with Automated Auxiliary Loss Search	A2LS	NeurIPS22	propose to automatically search top-performing auxiliary loss functions for learning better representations in RL; define a general auxiliary loss space of size 7.5 × 1020 based on the collected trajectory data and explore the space with an efficient evolutionary search strategy
Mask-based Latent Reconstruction for Reinforcement Learning	MLR	NeurIPS22	propose an effective self-supervised method to predict complete state representations in the latent space from the observations with spatially and temporally masked pixels
Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training	VIP	ICLR23 Spotlight	cast representation learning from human videos as an offline goal-conditioned reinforcement learning problem; derive a self-supervised dual goal-conditioned value-function objective that does not depend on actions, enabling pre-training on unlabeled human videos
Latent Variable Representation for Reinforcement Learning	----	ICLR23	provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle in the face of uncertainty for exploration
Spectral Decomposition Representation for Reinforcement Learning		ICLR23
Become a Proficient Player with Limited Data through Watching Pure Videos	FICC	ICLR23	consider the setting where the pre-training data are action-free videos; introduce a two-phase training pipeline; pre-training phase: implicitly extract the hidden action embedding from videos and pre-train the visual representation and the environment dynamics network based on vector quantization; down-stream tasks: finetune with small amount of task data based on the learned models
Bootstrapped Representations in Reinforcement Learning	----	ICML23	provide a theoretical characterization of the state representation learnt by temporal difference learning; find that this representation differs from the features learned by Monte Carlo and residual gradient algorithms for most transition structures of the environment in the policy evaluation setting
[Representation-Driven Reinforcement