Text2Reward：自动生成强化学习的密集奖励函数

论文Text2Reward：自动生成强化学习的密集奖励函数的代码。请参考我们的项目页面以获取更多演示和最新的相关资源。

更新

2023-10-09：我们发布了代码。
2023-09-20：我们发布了text2reward的论文和网站。

依赖项

要建立环境，请在shell中运行以下代码：

# 设置conda
conda create -n text2reward python=3.7
conda activate text2reward
# 设置ManiSkill2环境
cd ManiSkill2
pip install -e .
pip install stable-baselines3==1.8.0 wandb tensorboard
cd ..
cd run_maniskill
bash download_data.sh
# 设置MetaWorld环境
cd ..
cd Metaworld
pip install -e .
# 设置代码生成
pip install langchain chromadb==0.4.0

故障排除

如果您还没有安装mujoco，请按照这里的说明进行安装。之后，请尝试以下命令以确认安装成功：

$ python3
>>> import mujoco_py

如果在运行ManiSkill2时遇到以下错误，我们建议您阅读这里的文档。
- RuntimeError: vk::Instance::enumeratePhysicalDevices: ErrorInitializationFailed
- Some required Vulkan extension is not present. You may not use the renderer to render, however, CPU resources will be still available.
- Segmentation fault (core dumped)

使用方法

复现

要复现我们的实验结果，您可以运行以下脚本：

ManiSkill2：

bash run_oracle.sh
bash run_zero_shot.sh
bash run_few_shot.sh

遇到以下警告是正常的：

[svulkan2] [error] GLFW error: X11: The DISPLAY environment variable is missing
[svulkan2] [warning] Continue without GLFW.

MetaWorld：

bash run_oracle.sh
bash run_zero_shot.sh

生成新的奖励代码

首先请将以下环境变量添加到您的.bashrc（或.zshrc等）中。

export PYTHONPATH=$PYTHONPATH:~/path/to/text2reward

然后导航到text2reward/code_generation/single_flow目录并运行以下脚本：

# 为Maniskill生成奖励代码
bash run_maniskill_zeroshot.sh
bash run_maniskill_fewshot.sh
# 为MetaWorld生成奖励代码
bash run_metaworld_zeroshot.sh

运行新实验

默认情况下，上面的run_oracle.sh脚本使用环境提供的专家编写的奖励；run_zero_shot.sh和run_few_shot.sh脚本使用我们实验中生成的奖励。如果您想根据自己提供的奖励运行新实验，只需按照上面的bash脚本，并将--reward_path参数修改为您自己的奖励路径即可。

引用

如果您觉得我们的工作有帮助，请引用我们：

@article{text2reward,
  title={Text2Reward: Automated Dense Reward Function Generation for Reinforcement Learning},
  author={Xie, Tianbao and Zhao, Siheng and Wu, Chen Henry and Liu, Yitao and Luo, Qian and Zhong, Victor and Yang, Yanchao and Yu, Tao},
  journal={arXiv preprint arXiv:2309.11489},
  year={2023}
}