从分析合成的自顶向下视觉注意力

这是 AbSViT 的官方代码库,来自以下论文:

从分析合成的自顶向下视觉注意力,CVPR 2023 Baifeng Shi, Trevor Darrell, 和 Xin Wang 加州大学伯克利分校,微软研究院

待办事项

在视觉语言数据集上微调

环境

从官方网站安装 PyTorch 1.7.0+ 和 torchvision 0.8.1+。

requirements.txt 列出了所有依赖项:

pip install -r requirements.txt

此外,还请安装 magickwand 库:

apt-get install libmagickwand-dev

演示

ImageNet 演示: demo/demo.ipynb 给出了在 ImageNet 单目标和多目标图像上可视化 AbSViT 注意力图的示例。由于模型仅在单目标识别上训练,自顶向下的注意力相当弱。

VQA 演示: vision_language/demo/visualize_attention.ipynb 给出了 AbSViT 的自顶向下注意力如何适应同一图像上不同问题的示例。

模型库

名称	ImageNet	ImageNet-C (↓)	PASCAL VOC	Cityscapes	ADE20K	权重
ViT-Ti	72.5	71.1	-	-	-	模型
AbSViT-Ti	74.1	66.7	-	-	-	模型
ViT-S	80.1	54.6	-	-	-	模型
AbSViT-S	80.7	51.6	-	-	-	模型
ViT-B	80.8	49.3	80.1	75.3	45.2	模型
AbSViT-B	81.0	48.3	81.3	76.8	47.2	模型

图像分类评估

例如,要在 ImageNet 上评估 AbSViT_small,运行

python main.py --model absvit_small_patch16_224 --data-path path/to/imagenet --eval --resume path/to/checkpoint

要在鲁棒性基准上评估,请添加 --inc_path /path/to/imagenet-c、--ina_path /path/to/imagenet-a、--inr_path /path/to/imagenet-r 或 --insk_path /path/to/imagenet-sketch 之一来测试 ImageNet-C、ImageNet-A、ImageNet-R 或 ImageNet-Sketch。

如果要测试对抗性攻击下的准确性,请添加 --fgsm_test 或 --pgd_test。

语义分割评估

请参阅 segmentation 以获取说明。

训练

以 AbSViT_small 为例。我们使用单节点 8 个 GPU 进行训练:

python -m torch.distributed.launch --nproc_per_node=8 --master_port 12345  main.py --model absvit_small_patch16_224 --data-path path/to/imagenet  --output_dir output/here  --num_workers 8 --batch-size 128 --warmup-epochs 10

要训练不同的模型架构,请更改参数 --model。我们提供 ViT_{tiny, small, base} 和 AbSViT_{tiny, small, base} 的选择。

视觉语言数据集微调

请参阅 vision_language 以获取说明。

链接

此代码库基于 "Visual Attention Emerges from Recurrent Sparse Reconstruction" 和 "Towards Robust Vision Transformer" 的官方代码构建。

引用

如果您发现此代码有帮助,请考虑引用我们的工作:

@inproceedings{shi2023top,
  title={Top-Down Visual Attention from Analysis by Synthesis},
  author={Shi, Baifeng and Darrell, Trevor and Wang, Xin},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={2102--2112},
  year={2023}
}