⚡ SheepRL 🐑

<p align="center"> <img src="https://yellow-cdn.veclightyear.com/835a84d5/34821b1c-7d81-42bb-9391-1c22d032240c.svg" style="width:40%"> </p> <div align="center"> <table> <tr> <td><img src="https://github.com/Eclectic-Sheep/sheeprl/assets/18405289/6efd09f0-df91-4da0-971d-92e0213b8835" width="200px"></td> <td><img src="https://github.com/Eclectic-Sheep/sheeprl/assets/18405289/dbba57db-6ef5-4db4-9c53-d7b5f303033a" width="200px"></td> <td><img src="https://github.com/Eclectic-Sheep/sheeprl/assets/18405289/3f38e5eb-aadd-4402-a698-695d1f99c048" width="200px"></td> <td><img src="https://github.com/Eclectic-Sheep/sheeprl/assets/18405289/93749119-fe61-44f1-94bb-fdb89c1869b5" width="200px"></td> </tr> </table> </div> <div align="center"> <table> <thead> <tr> <th>环境</th> <th>总帧数</th> <th>训练时间</th> <th>测试奖励</th> <th>论文奖励</th> <th>GPU</th> </tr> </thead> <tbody> <tr> <td>Crafter</td> <td>1M</td> <td>1天3小时</td> <td>12.1</td> <td>11.7</td> <td>1-V100</td> </tr> <tr> <td>Atari-吃豆人</td> <td>100K</td> <td>14小时</td> <td>1542</td> <td>1327</td> <td>1-3080</td> </tr> <tr> <td>Atari-拳击</td> <td>100K</td> <td>14小时</td> <td>84</td> <td>78</td> <td>1-3080</td> </tr> <tr> <td>DOA++(无优化)<sup>1</sup></td> <td>7M</td> <td>18天22小时</td> <td>2726/3328<sup>2</sup></td> <td>不适用</td> <td>1-3080</td> </tr> <tr> <td>Minecraft-导航(无优化)</td> <td>8M</td> <td>16天4小时</td> <td>27% >= 70<br>14% >= 100</td> <td>不适用</td> <td>1-V100</td> </tr> </tbody> </table> </div>

比较：1M帧在优化前需2天7小时，优化后需1天5小时
DIAMBRA排行榜最高分（2023年11月7日）

基准测试

我们的实现与Stable Baselines3的训练时间对比如下：

<div align="center"> <table> <thead> <tr> <th colspan="2"></th> <th>SheepRL v0.4.0</th> <th>SheepRL v0.4.9</th> <th>SheepRL v0.5.2<br />（Numpy缓冲区）</th> <th>SheepRL v0.5.5<br />（Numpy缓冲区）</th> <th>StableBaselines3<sup>1</sup></th> </tr> </thead> <tbody> <tr> <td rowspan="2"><b>PPO</b></td> <td><i>1个设备</i></td> <td>192.31秒 ± 1.11</td> <td>138.3秒 ± 0.16</td> <td>80.81秒 ± 0.68</td> <td>81.27秒 ± 0.47</td> <td>77.21秒 ± 0.36</td> </tr> <tr> <td><i>2个设备</i></td> <td>85.42秒 ± 2.27</td> <td>59.53秒 ± 0.78</td> <td>46.09秒 ± 0.59</td> <td>36.88秒 ± 0.30</td> <td>无数据</td> </tr> <tr> <td rowspan="2"><b>A2C</b></td> <td><i>1个设备</i></td> <td>无数据</td> <td>无数据</td> <td>无数据</td> <td>84.76秒 ± 0.37</td> <td>84.22秒 ± 0.99</td> </tr> <tr> <td><i>2个设备</i></td> <td>无数据</td> <td>无数据</td> <td>无数据</td> <td>28.95秒 ± 0.75</td> <td>无数据</td> </tr> <tr> <td rowspan="2"><b>SAC</b></td> <td><i>1个设备</i></td> <td>421.37秒 ± 5.27</td> <td>363.74秒 ± 3.44</td> <td>318.06秒 ± 4.46</td> <td>320.21 ± 6.29</td> <td>336.06秒 ± 12.26</td> </tr> <tr> <td><i>2个设备</i></td> <td>264.29秒 ± 1.81</td> <td>238.88秒 ± 4.97</td> <td>210.07秒 ± 27</td> <td>225.95 ± 3.65</td> <td>无数据</td> </tr> <tr> <td><b>Dreamer V1</b></td> <td><i>1个设备</i></td> <td>4201.23秒</td> <td>无数据</td> <td>2921.38秒</td> <td>2207.13秒</td> <td>无数据</td> </tr> <tr> <td><b>Dreamer V2</b></td> <td><i>1个设备</i></td> <td>1874.62秒</td> <td>无数据</td> <td>1148.1秒</td> <td>906.42秒</td> <td>无数据</td> </tr> <tr> <td><b>Dreamer V3</b></td> <td><i>1个设备</i></td> <td>2022.99秒</td> <td>无数据</td> <td>1378.01秒</td> <td>1589.30秒</td> <td>无数据</td> </tr> </tbody> </table> </div>

[!注意]

所有实验都在Lightning Studio上使用4个CPU运行。除了Dreamers的基准测试外，所有基准测试都运行了5次，我们取了运行的平均值和标准差。我们禁用了测试功能、日志记录和检查点。此外，模型未使用MLFlow进行注册。

Dreamers的基准测试运行了1次，包括日志记录和检查点，但没有运行测试功能。

StableBaselines3的版本是v2.2.1，请使用pip install stable-baselines3==2.2.1安装该包

内容

一个基于PyTorch的易用型强化学习框架，由Lightning Fabric加速。 sheeprl现成提供的算法包括：

算法	耦合	解耦	循环	向量观测	像素观测	状态
A2C	:heavy_check_mark:	:x:	:x:	:heavy_check_mark:	:x:	:heavy_check_mark:
A3C	:heavy_check_mark:	:x:	:x:	:heavy_check_mark:	:x:	:construction:
PPO	:heavy_check_mark:	:heavy_check_mark:	:x:	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:
PPO循环	:heavy_check_mark:	:x:	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:
SAC	:heavy_check_mark:	:heavy_check_mark:	:x:	:heavy_check_mark:	:x:	:heavy_check_mark:
SAC-AE	:heavy_check_mark:	:x:	:x:	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:
DroQ	:heavy_check_mark:	:x:	:x:	:heavy_check_mark:	:x:	:heavy_check_mark:
Dreamer-V1	:heavy_check_mark:	:x:	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:
Dreamer-V2	:heavy_check_mark:	:x:	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:
Dreamer-V3	:heavy_check_mark:	:x:	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:
Plan2Explore (Dreamer V1)	:heavy_check_mark:	:x:	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:
Plan2Explore (Dreamer V2)	:heavy_check_mark:	:x:	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:
Plan2Explore (Dreamer V3)	:heavy_check_mark:	:x:	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:

更多算法即将推出！如果您有任何特殊要求，请提交PR :sheep: sheeprl代理支持的动作类型如下:

算法	连续	离散	多离散
A2C	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:
A3C	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:
PPO	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:
PPO循环	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:
SAC	:heavy_check_mark:	:x:	:x:
SAC-AE	:heavy_check_mark:	:x:	:x:
DroQ	:heavy_check_mark:	:x:	:x:
Dreamer-V1	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:
Dreamer-V2	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:
Dreamer-V3	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:
Plan2Explore (Dreamer V1)	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:
Plan2Explore (Dreamer V2)	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:
Plan2Explore (Dreamer V3)	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:

sheeprl支持的环境如下:

环境	安装命令	更多信息	状态
经典控制	`pip install sheeprl`		:heavy_check_mark:
Box2D	`pip install sheeprl[box2d]`	请先使用 `pip install swig` 安装 `swig`	:heavy_check_mark:
Mujoco (Gymnasium)	`pip install sheeprl[mujoco]`	how_to/mujoco	:heavy_check_mark:
Atari	`pip install sheeprl[atari]`	how_to/atari	:heavy_check_mark:
DeepMind Control	`pip install sheeprl[dmc]`	how_to/dmc	:heavy_check_mark:
MineRL	`pip install sheeprl[minerl]`	how_to/minerl	:heavy_check_mark:
MineDojo	`pip install sheeprl[minedojo]`	how_to/minedojo	:heavy_check_mark:
DIAMBRA	`pip install sheeprl[diambra]`	how_to/diambra	:heavy_check_mark:
Crafter	`pip install sheeprl[crafter]`	https://github.com/danijar/crafter	:heavy_check_mark:
超级马里奥兄弟	`pip install sheeprl[supermario]`	https://github.com/Kautenja/gym-super-mario-bros/tree/master	:heavy_check_mark:

为什么

我们希望提供一个既简单又可扩展的强化学习算法框架，这要归功于Lightning Fabric。

此外，在许多强化学习代码库中，强化学习算法与环境紧密耦合，这使得将它们扩展到gym接口之外变得更加困难。我们希望提供一个框架，允许轻松地将强化学习算法与环境解耦，从而可以与任何环境一起使用。

如何使用

安装

安装SheepRL有三种选择

直接从PyPi索引安装最新版本
克隆仓库并安装本地版本
使用GitHub克隆URL通过pip安装框架

以下是三种方法的说明。

从PyPi安装SheepRL

你可以通过以下命令安装SheepRL的最新版本

pip install sheeprl

[!注意]

要安装可选依赖项，可以运行例如 pip install sheeprl[atari,box2d,dev,mujoco,test]

有关可以安装的所有可选依赖项的详细信息，请查看What部分

克隆并安装本地版本

首先，使用以下命令克隆仓库：

git clone https://github.com/Eclectic-Sheep/sheeprl.git
cd sheeprl

在新创建的文件夹内运行

pip install .

[!注意]

要安装可选依赖项，可以运行例如 pip install .[atari,box2d,dev,mujoco,test]

从GitHub仓库安装框架

如果你还没有这样做，请使用venv或conda创建一个环境。

示例将使用Python标准的venv模块，并假设使用macOS或Linux。

# 创建虚拟环境
python3 -m venv .venv

# 激活环境
source .venv/bin/activate

# 如果你不想安装额外的内容，如mujuco、atari，请执行
pip install "sheeprl @ git+https://github.com/Eclectic-Sheep/sheeprl.git"

# 或者，要安装支持atari和mujuco环境，请执行
pip install "sheeprl[atari,mujoco,dev] @ git+https://github.com/Eclectic-Sheep/sheeprl.git"

# 或者，要安装支持box2d环境，请执行
pip install swig
pip install "sheeprl[box2d] @ git+https://github.com/Eclectic-Sheep/sheeprl.git"

# 或者，要安装支持minedojo环境，请执行
pip install "sheeprl[minedojo,dev] @ git+https://github.com/Eclectic-Sheep/sheeprl.git"

# 或者，要安装支持minerl环境，请执行
pip install "sheeprl[minerl,dev] @ git+https://github.com/Eclectic-Sheep/sheeprl.git"

# 或者，要安装支持diambra环境，请执行
pip install "sheeprl[diambra,dev] @ git+https://github.com/Eclectic-Sheep/sheeprl.git"

# 或者，要安装支持超级马里奥兄弟环境，请执行
pip install "sheeprl[supermario,dev] @ git+https://github.com/Eclectic-Sheep/sheeprl.git"

# 或者，要安装所有额外内容，请执行
pip install swig
pip install "sheeprl[box2d,atari,mujoco,minerl,supermario,dev,test] @ git+https://github.com/Eclectic-Sheep/sheeprl.git"

额外说明：在M系列Mac上安装

[!警告]

如果你使用的是M系列Mac，并在安装过程中遇到归因于box2dpy的错误，你需要按照下面的说明安装SWIG。

建议使用homebrew安装SWIG以支持Gym。

# 如果需要，安装homebrew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# 然后执行
brew install swig

# 然后尝试使用首选方法进行pip安装，例如
pip install "sheeprl[atari,box2d,mujoco,dev,test] @ git+https://github.com/Eclectic-Sheep/sheeprl.git"

额外说明：MineRL和MineDojo

[!注意]

如果你想安装minedojo或minerl环境支持，需要Java JDK 8：你可以按照此链接的说明进行安装。 [!注意]

MineRL 和 MineDojo 环境的要求存在冲突，因此请勿使用 pip install sheeprl[minerl,minedojo] 命令同时安装它们，而是在运行 MineRL 或 MineDojo 环境的实验之前，分别使用 pip install sheeprl[minerl] 或 pip install sheeprl[minedojo] 命令单独安装。

使用 SheepRL 运行实验

现在你可以使用已有的算法之一，或创建自己的算法。例如，要在 CartPole 环境中训练一个只使用向量类观察的 PPO 代理，只需运行

python sheeprl.py exp=ppo env=gym env.id=CartPole-v1

如果你是从克隆的仓库安装的，或者

sheeprl exp=ppo env=gym env.id=CartPole-v1

如果你是从 PyPi 安装的 SheepRL。

同样，你可以通过以下命令查看所有可用的算法

python sheeprl/available_agents.py

如果你是从克隆的仓库安装的，或者

sheeprl-agents

如果你是从 PyPi 安装的 SheepRL。

使用 SheepRL 训练代理就是这么简单！🎉

在开始使用 SheepRL 框架之前，强烈建议你阅读以下说明文档：

如何运行实验

如何修改默认配置

如何处理步骤

如何选择观察

此外，howto 文件夹中还有其他有用的文档，这些文档包含了如何正确使用该框架的一些指导。

:chart_with_upwards_trend: 检查你的结果

一旦你训练了一个代理，将会创建一个名为 logs 的新文件夹，其中包含训练的日志。你可以使用 TensorBoard 来可视化它们：

tensorboard --logdir logs

https://github.com/Eclectic-Sheep/sheeprl/assets/7341604/46ad4acd-180d-449d-b46a-25b4a1f038d9

:nerd_face: 更多关于运行算法的信息

你运行的是使用默认配置的 PPO 算法。但你也可以通过向脚本传递参数来更改配置。

例如，在默认配置中，并行环境的数量是 4。让我们尝试通过传递 --num_envs 参数将其更改为 8：

sheeprl exp=ppo env=gym env.id=CartPole-v1 env.num_envs=8

所有可用的参数及其描述都列在 sheeprl/config 目录中。你可以在这里找到更多关于配置层次结构的信息。

使用 Lightning Fabric 运行

要使用 Lightning Fabric 运行算法，你需要通过 CLI 指定 Fabric 参数。例如，要在 2 个节点上使用 4 个并行环境运行 PPO 算法，你可以运行：

sheeprl fabric.accelerator=cpu fabric.strategy=ddp fabric.devices=2 exp=ppo env=gym env.id=CartPole-v1

你可以在这里查看 Lightning Fabric 的可用参数。

评估你的代理

你可以轻松地从检查点评估训练好的代理：训练配置会自动检索。

sheeprl-eval checkpoint_path=/path/to/checkpoint.ckpt fabric.accelerator=gpu env.capture_video=True

有关更多信息，请查看相应的使用说明。

:book: 仓库结构

仓库的结构如下：

algos：包含算法的实现。每个算法都在一个单独的文件夹中，并（可能）包含以下文件：
- <algorithm>.py：包含算法的实现。
- <algorithm>_decoupled.py：包含算法的解耦版本的实现（如果存在）。
- agent：可选，包含代理的实现。
- loss.py：包含算法的损失函数的实现。
- utils.py：包含算法的实用函数。
configs：包含算法的默认配置。
data：包含数据缓冲区的实现。
envs：包含环境包装器的实现。
models：包含一些标准模型（构建块）的实现，如多层感知器（MLP）或简单的卷积网络（NatureCNN）
utils：包含框架的实用函数。

耦合与解耦

在算法的耦合版本中，代理与环境交互并执行训练循环。

在解耦版本中，一个进程只负责与环境交互，所有其他进程负责执行训练循环。两个进程通过分布式集体，采用 Fabric 的 TorchCollective 提供的抽象进行通信。

耦合

算法在 <algorithm>.py 文件中实现。

这个脚本中有 2 个函数：

main()：初始化算法的所有组件，并执行与环境的交互。一旦收集了足够的数据，就通过调用 train() 函数执行训练循环。
train()：执行训练循环。它从缓冲区采样一批数据，计算损失，并更新代理的参数。

解耦

算法的解耦版本在 <algorithm>_decoupled.py 文件中实现。

这个脚本中有 3 个函数：

main()：初始化算法的所有组件，玩家和训练器之间通信的集体，并调用 player() 和 trainer() 函数。
player()：执行与环境的交互。它从策略网络采样一个动作，在环境中执行它，并将转换存储在缓冲区中。在与环境进行预定义次数的交互后，玩家将收集的数据随机分成几乎相等的块，并分别发送给训练器。然后等待训练器完成代理更新。
trainer()：执行训练循环。它从玩家接收一块数据，计算损失，并更新代理的参数。代理更新后，第一个训练器将更新后的代理权重发送回玩家，玩家可以再次与环境交互。

算法实现

你可以查看每个算法文件夹中的 README.md 文件，了解实现的详细信息。

所有算法都尽可能保持简单，采用 CleanRL 的风格。但为了提高灵活性和清晰度，我们尝试抽象出与算法训练循环不直接相关的任何内容。

例如，我们决定创建一个 models 文件夹，其中包含已经制作好的模型，可以组合起来创建代理的模型。

对于每个算法，损失函数都保存在单独的模块中，以便它们的实现清晰，并且可以轻松用于算法的解耦版本或循环版本。

:card_index_dividers: 缓冲区

对于缓冲区的实现，我们选择使用一个包装 Numpy 数组字典的方式。

为了实现一种简单的方式来使用 numpy 内存映射数组，我们实现了 sheeprl.utils.memmap.MemmapArray，这是一个处理内存映射数组的容器。这种灵活性使得实现变得非常简单，通过使用 ReplayBuffer、SequentialReplayBuffer、EpisodeBuffer 和 EnvIndependentReplayBuffer 这些类，可以实现所有用于在线策略和离线策略算法所需的缓冲区。