Awesome Exploration Methods in Reinforcement Learning

Updated on 2024.06.12

Here is a collection of research papers for Exploration methods in Reinforcement Learning (ERL). The repository will be continuously updated to track the frontier of ERL. Welcome to follow and star!
The balance of exploration and exploitation is one of the most central problems in reinforcement learning. In order to give readers an intuitive feeling for exploration, we provide a visualization of a typical hard exploration environment in MiniGrid below. In this task, a series of actions to achieve the goal often require dozens or even hundreds of steps, in which the agent needs to fully explore different state-action spaces in order to learn the skills required to achieve the goal.

<p align="center"> <img src="./assets/minigrid_hard_exploration.png" alt="minigrid_hard_exploration" width="40%" height="40%" /><br> <em style="display: inline-block;">A typical hard-exploration environment: MiniGrid-ObstructedMaze-Full-v0.</em> </p>

A Taxonomy of Exploration RL Methods
Papers
Contributing

A Taxonomy of Exploration RL Methods

<details open> <summary>(Click to Collapse)</summary>

In general, we can divide reinforcement learning process into two phases: collect phase and train phase. In the collect phase, the agent chooses actions based on the current policy and then interacts with the environment to collect useful experience. In the train phase, the agent uses the collected experience to update the current policy to obtain a better performing policy.

According to the phase the exploration component is explicitly applied, we simply divide the methods in Exploration RL into two main categories: Augmented Collecting Strategy, Augmented Training Strategy:

Augmented Collecting Strategy represents a variety of different exploration strategies commonly used in the collect phase, which we further divide into four categories:
- Action Selection Perturbation
- Action Selection Guidance
- State Selection Guidance
- Parameter Space Perturbation
Augmented Training Strategy represents a variety of different exploration strategies commonly used in the train phase, which we further divide into seven categories:
- Count Based
- Prediction Based
- Information Theory Based
- Entropy Augmented
- Bayesian Posterior Based
- Goal Based
- (Expert) Demo Data

Note that there may be overlap between these categories, and an algorithm may belong to several of them. For other detailed survey on exploration methods in RL, you can refer to Tianpei Yang et al and Susan Amin et al.

<center> <figure> <img style="border-radius: 0.3125em; box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);" src="./assets/erl_taxonomy.png" width=100% height=100%> <br> <figcaption align = "center"><b>A non-exhaustive, but useful taxonomy of methods in Exploration RL. We provide some example methods for each of the different categories, shown in blue area above. </b></figcaption> </figure> </center>

Here are the links to the papers that appeared in the taxonomy:

[1] Go-Explore: Adrien Ecoffet et al, 2021
[2] NoisyNet, Meire Fortunato et al, 2018
[3] DQN-PixelCNN: Marc G. Bellemare et al, 2016
[4] #Exploration Haoran Tang et al, 2017
[5] EX2: Justin Fu et al, 2017
[6] ICM: Deepak Pathak et al, 2018
[7] RND: Yuri Burda et al, 2018
[8] NGU: Adrià Puigdomènech Badia et al, 2020
[9] Agent57: Adrià Puigdomènech Badia et al, 2020
[10] VIME: Rein Houthooft et al, 2016
[11] EMI: Wang et al, 2019
[12] DIYAN: Benjamin Eysenbach et al, 2019
[13] SAC: Tuomas Haarnoja et al, 2018
[14] BootstrappedDQN: Ian Osband et al, 2016
[15] PSRL: Ian Osband et al, 2013
[16] HER Marcin Andrychowicz et al, 2017
[17] DQfD: Todd Hester et al, 2018
[18] R2D3: Caglar Gulcehre et al, 2019

</details>

Papers

format:
- [title](paper link) (presentation type, openreview score [if the score is public])
  - author1, author2, author3, ...
  - Key: key problems and insights
  - ExpEnv: experiment environments

ICLR 2024

<details open> <summary>(Click to Collapse)</summary>

Unlocking the Power of Representations in Long-term Novelty-based Exploration
- Alaa Saade, Steven Kapturowski, Daniele Calandriello, Charles Blundell, Pablo Sprechmann, Leopoldo Sarra, Oliver Groth, Michal Valko, Bilal Piot
- Key: Robust Exploration via Clustering-based Online Density Estimation
- ExpEnv: Atari, DM-HARD-8
A Theoretical Explanation of Deep RL Performance in Stochastic Environments
- Cassidy Laidlaw, Banghua Zhu, Stuart Russell, Anca Dragan
- Key: Stochastic Environments, effective horizon, RL theory, instance-dependent bounds, empirical validation of theory
- ExpEnv: BRIDGE
DrM: Mastering Visual Reinforcement Learning through Dormant Ratio Minimization
- Guowei Xu, Ruijie Zheng, Yongyuan Liang, Xiyao Wang, Zhecheng Yuan, Tianying Ji, Yu Luo, Xiaoyu Liu, Jiaxin Yuan, Pu Hua, Shuzhen Li, Yanjie Ze, Hal Daumé III, Furong Huang, Huazhe Xu
- Key: Visual RL, Dormant Ratio Minimization, Exploration
- ExpEnv:DeepMind Control Suite, MetaWorld, and Adroit
METRA: Scalable Unsupervised RL with Metric-Aware Abstraction
- Seohong Park, Oleh Rybkin, Sergey Levine
- Key: unsupervised RL, metric-aware abstraction, scalable exploration
- ExpEnv: state-based Ant and HalfCheetah, Kitchen
Text2Reward: Reward Shaping with Language Models for Reinforcement Learning
- Tianbao Xie, Siheng Zhao, Chen Henry Wu, Yitao Liu, Qian Luo, Victor Zhong, Yanchao Yang, Tao Yu
- Key: reward shaping, language models, text-based reward shaping
- ExpEnv: MUJOCO, MANISKILL2, METAWORLD
Pre-Training Goal-based Models for Sample-Efficient Reinforcement Learning
- Haoqi Yuan, Zhancun Mu, Feiyang Xie, Zongqing Lu
- Key: goal-based models, pre-training, sample efficiency
- ExpEnv: Kitchen, Minecraft.
Efficient Episodic Memory Utilization of Cooperative Multi-Agent Reinforcement Learning
- Hyungho Na, Yunkyeong Seo, Il-chul Moon
- Key: episodic memory, cooperative multi-agent, efficient utilization
- ExpEnv: StarCraft II and Google Research Football
Simple Hierarchical Planning with Diffusion
- Chang Chen, Fei Deng, Kenji Kawaguchi, Caglar Gulcehre, Sungjin Ahn
- Key: hierarchical planning, diffusion, exploration
- ExpEnv: Maze2D and AntMaze
Sample Efficient Myopic Exploration Through Multitask Reinforcement Learning with Diverse Tasks
- Ziping Xu, Zifan Xu, Runxuan Jiang, Peter Stone, Ambuj Tewari
- Key: myopic exploration, multitask reinforcement learning, diverse tasks
- ExpEnv: synthetic robotic control environment
PAE: Reinforcement Learning from External Knowledge for Efficient Exploration
- Zhe Wu, Haofei Lu, Junliang Xing, You Wu, Renye Yan, Yaozhong Gan, Yuanchun Shi
- Key: external knowledge, efficient exploration, reinforcement learning
- ExpEnv: BabyAI and MiniHack
In-context Exploration-Exploitation for Reinforcement Learning
- Zhenwen Dai, Federico Tomasi, Sina Ghiassian
- Key: in-context exploration-exploitation, reinforcement learning, exploration-exploitation trade-off
- ExpEnv: Dark Room, Dark Key-to-Door, Dark Room (Biased).
Transformers as Decision Makers: Provable In-Context Reinforcement Learning via Supervised Pretraining
- Licong Lin, Yu Bai, Song Mei
- Key: transformers, decision makers, in-context reinforcement learning
- ExpEnv: Linear bandit, Bernoulli bandits.
Learning to Act without Actions
- Dominik Schmidt, Minqi Jiang
- Key: recovering latent action information, video, pre-training
- ExpEnv: Procgen
Consciousness-Inspired Spatio-Temporal Abstractions for Better Generalization in Reinforcement Learning
- Mingde Zhao, Safa Alver, Harm van Seijen, Romain Laroche, Doina Precup, Yoshua Bengio
- Key: spatio-temporal abstractions, hierarchical planning, task/goal decomposition
- ExpEnv: MiniGrid-BabyAI

</details>

NeurIPS 2023

<details open> <summary>(Click to Collapse)</summary>

Maximize to Explore: One Objective Function Fusing Estimation, Planning, and Exploration
- Zhihan Liu, Miao Lu, Wei Xiong, Han Zhong, Hao Hu, Shenao Zhang, Sirui Zheng, Zhuoran Yang, Zhaoran Wang
- Key: a single objective that integrates the estimation and planning components, balancing exploration and exploitation automatically, sublinear regret
- ExpEnv: MuJoCo with sparse reward
On the Importance of Exploration for Generalization in Reinforcement Learning
- Yiding Jiang, J Zico Kolter, Roberta Raileanu
- Key: exploration, generalization, Exploration via Distributional Ensemble
- ExpEnv: tabular contextual MDP, Procgen and Crafter
Monte Carlo Tree Search with Boltzmann Exploration
- Michael Painter, Mohamed Baioumy, Nick Hawes, Bruno Lacerda
- Key: Boltzmann exploration with MCTS, optimal actions for the maximum entropy objective do not necessarily correspond to optimal actions for the original objective, two improved algorithms.
- ExpEnv: the Frozen Lake environment, the Sailing Problem, Go
Breadcrumbs to the Goal: Supervised Goal Selection from Human-in-the-Loop Feedback
- Marcel Torne Villasevil, Max Balsells I Pamies, Zihan Wang, Samedh Desai, Tao Chen, Pulkit Agrawal, Abhishek Gupta
- Key: human-in-the-loop feedback, bifurcating human feedback and policy learning
- ExpEnv: Bandu, Block Stacking, Kitchen, and Pusher，Four rooms and Maze
MIMEx: Intrinsic Rewards from Masked Input Modeling
- Toru Lin, Allan Jabri
- Key: pseudo-likelihood estimation with different mask distributions,
- ExpEnv: PixMC-Sparse, DeepMind Control suite
Accelerating Exploration with Unlabeled Prior Data
- Qiyang Li, Jason Zhang, Dibya Ghosh, Amy Zhang, Sergey Levine
- Key: prior data without reward labels, learns a reward model from online experience, labels the unlabeled prior data with optimistic rewards
- ExpEnv: AntMaze domain, Adroit hand manipulation domain, and a visual simulated robotic manipulation domain.
On the Convergence and Sample Complexity Analysis of Deep Q-Networks with ε-Greedy Exploration
- Shuai Zhang, Hongkang Li, Meng Wang, Miao Liu, Pin-Yu Chen, Songtao Lu, Sijia Liu, Keerthiram Murugesan, Subhajit Chaudhury
- Key: ε-greedy exploration, convergence, sample complexity
- ExpEnv: Numerical Experiments
Pitfall of Optimism: Distributional Reinforcement Learning by Randomizing Risk Criterion
- Taehyun Cho, Seungyub Han, Heesoo Lee, Kyungjae Lee, Jungwoo Lee
- Key: distributional reinforcement learning, randomizing risk criterion, optimistic exploration
- ExpEnv: Atari 55 games.
CQM: Curriculum Reinforcement Learning with a Quantized World Model
- Seungjae Lee, Daesol Cho, Jonghae Park, H. Jin Kim
- Key: curriculum reinforcement learning, quantized world model, quantized world model
- ExpEnv: PointNMaze
Safe Exploration in Reinforcement Learning: A Generalized Formulation and Algorithms
- Akifumi Wachi, Wataru Hashimoto, Xun Shen, Kazumune Hashimoto
- Key: safe exploration, generalized formulation, safe exploration algorithms, Meta-Algorithm for Safe Exploration
- ExpEnv: grid-world and Safety Gym
Successor-Predecessor Intrinsic Exploration
- Changmin Yu, Neil Burgess, Maneesh Sahani, Samuel J. Gershman
- Key: retrospective