`ffcv` ImageNet 训练

一个简洁的、单文件的 PyTorch ImageNet 训练脚本，专为易于修改而设计。运行 train_imagenet.py 可以...

...在 ImageNet 上获得高准确率
...使用与 PyTorch ImageNet 示例相同行数的代码
...用十分之一的时间完成。

结果

更高效地训练模型，可以并行使用 8 个 GPU，或同时训练 8 个 ResNet-18。 <img src="https://yellow-cdn.veclightyear.com/ab5030c0/103021f8-231b-4198-8807-c46e476eba04.svg" width='830px'/>

查看基准测试设置：https://docs.ffcv.io/benchmarks.html。

引用

如果您在研究中使用了这个设置，请引用：

@misc{leclerc2022ffcv,
    author = {Guillaume Leclerc and Andrew Ilyas and Logan Engstrom and Sung Min Park and Hadi Salman and Aleksander Madry},
    title = {ffcv},
    year = {2022},
    howpublished = {\url{https://github.com/libffcv/ffcv/}},
    note = {commit xxxxxxx}
}

（请确保将上面的 xxxxxxx 替换为所使用的提交哈希值！）

配置

与上述结果相对应的配置文件是：

配置链接	top_1	top_5	训练轮数	时间（分钟）	架构	设置
<a href='https://github.com/libffcv/ffcv-imagenet/tree/main/rn50_configs/rn50_88_epochs.yaml'>链接</a>	0.784	0.941	88	77.2	ResNet-50	8 x A100
<a href='https://github.com/libffcv/ffcv-imagenet/tree/main/rn50_configs/rn50_56_epochs.yaml'>链接</a>	0.780	0.937	56	49.4	ResNet-50	8 x A100
<a href='https://github.com/libffcv/ffcv-imagenet/tree/main/rn50_configs/rn50_40_epochs.yaml'>链接</a>	0.772	0.932	40	35.6	ResNet-50	8 x A100
<a href='https://github.com/libffcv/ffcv-imagenet/tree/main/rn50_configs/rn50_32_epochs.yaml'>链接</a>	0.766	0.927	32	28.7	ResNet-50	8 x A100
<a href='https://github.com/libffcv/ffcv-imagenet/tree/main/rn50_configs/rn50_24_epochs.yaml'>链接</a>	0.756	0.921	24	21.7	ResNet-50	8 x A100
<a href='https://github.com/libffcv/ffcv-imagenet/tree/main/rn50_configs/rn50_16_epochs.yaml'>链接</a>	0.738	0.908	16	14.9	ResNet-50	8 x A100
<a href='https://github.com/libffcv/ffcv-imagenet/tree/main/rn18_configs/rn18_88_epochs.yaml'>链接</a>	0.724	0.903	88	187.3	ResNet-18	1 x A100
<a href='https://github.com/libffcv/ffcv-imagenet/tree/main/rn18_configs/rn18_56_epochs.yaml'>链接</a>	0.713	0.899	56	119.4	ResNet-18	1 x A100
<a href='https://github.com/libffcv/ffcv-imagenet/tree/main/rn18_configs/rn18_40_epochs.yaml'>链接</a>	0.706	0.894	40	85.5	ResNet-18	1 x A100
<a href='https://github.com/libffcv/ffcv-imagenet/tree/main/rn18_configs/rn18_32_epochs.yaml'>链接</a>	0.700	0.889	32	68.9	ResNet-18	1 x A100
<a href='https://github.com/libffcv/ffcv-imagenet/tree/main/rn18_configs/rn18_24_epochs.yaml'>链接</a>	0.688	0.881	24	51.6	ResNet-18	1 x A100
<a href='https://github.com/libffcv/ffcv-imagenet/tree/main/rn18_configs/rn18_16_epochs.yaml'>链接</a>	0.669	0.868	16	35.0	ResNet-18	1 x A100

训练模型

首先使用pip安装此目录中的requirements文件：

pip install -r requirements.txt

然后，生成一个ImageNet数据集；使用以下命令创建上述结果中使用的数据集（IMAGENET_DIR应指向PyTorch风格的ImageNet数据集）：

# 脚本所需的环境变量：
export IMAGENET_DIR=/path/to/pytorch/format/imagenet/directory/
export WRITE_DIR=/your/path/here/

# 从Git仓库的根目录开始：
cd examples;
# 使用以下参数序列化图像：
# - 最大边长为500像素
# - 50% JPEG编码
# - JPEG质量为90
./write_imagenet.sh 500 0.50 90

然后，从[配置表](#configurations)中选择一个配置。获得配置文件路径后，按以下方式进行训练：

```bash
# 8 GPU训练（ResNet-18训练仅使用1个）
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

# 根据`world_size`配置参数设置可见的GPU
# 根据你的机器修改`data.in_memory`和`data.num_workers`
python train_imagenet.py --config-file rn50_configs/<你的配置文件>.yaml \
    --data.train_dataset=/训练数据集路径/dataset.ffcv \
    --data.val_dataset=/验证数据集路径/dataset.ffcv \
    --data.num_workers=12 --data.in_memory=1 \
    --logging.folder=/你的路径/here