ImageBind：将所有内容绑定到一个嵌入空间

FAIR, Meta AI

Rohit Girdhar*, Alaaeldin El-Nouby*, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra*

将在CVPR 2023上发表（重点论文）

[论文] [博客] [演示] [补充视频] [引用]

ImageBind的PyTorch实现和预训练模型。详情请参阅论文：ImageBind：将所有内容绑定到一个嵌入空间。

ImageBind学习了六种不同模态的联合嵌入 - 图像、文本、音频、深度、热成像和IMU数据。它能够实现新颖的"开箱即用"应用，包括跨模态检索、模态算术组合、跨模态检测和生成。

ImageBind

ImageBind模型

零样本分类的突现性能。

<table style="margin: auto"> <tr> <th>模型</th> <th>IN1k</th> <th>K400</th> <th>NYU-D</th> <th>ESC</th> <th>LLVIP</th> <th>Ego4D</th> <th>下载</th> </tr> <tr> <td>imagebind_huge</td> <td align="right">77.7</td> <td align="right">50.0</td> <td align="right">54.0</td> <td align="right">66.9</td> <td align="right">63.4</td> <td align="right">25.0</td> <td><a href="https://dl.fbaipublicfiles.com/imagebind/imagebind_huge.pth">检查点</a></td> </tr> </table>

使用方法

安装pytorch 1.13+和其他第三方依赖项。

conda create --name imagebind python=3.10 -y
conda activate imagebind

pip install .

对于Windows用户，您可能需要安装soundfile来读取/写入音频文件。（感谢@congyue1977）

pip install soundfile

提取并比较跨模态特征（例如图像、文本和音频）。

from imagebind import data
import torch
from imagebind.models import imagebind_model
from imagebind.models.imagebind_model import ModalityType

text_list=["一只狗。", "一辆车", "一只鸟"]
image_paths=[".assets/dog_image.jpg", ".assets/car_image.jpg", ".assets/bird_image.jpg"]
audio_paths=[".assets/dog_audio.wav", ".assets/car_audio.wav", ".assets/bird_audio.wav"]

device = "cuda:0" if torch.cuda.is_available() else "cpu"

# 实例化模型
model = imagebind_model.imagebind_huge(pretrained=True)
model.eval()
model.to(device)

# 加载数据
inputs = {
    ModalityType.TEXT: data.load_and_transform_text(text_list, device),
    ModalityType.VISION: data.load_and_transform_vision_data(image_paths, device),
    ModalityType.AUDIO: data.load_and_transform_audio_data(audio_paths, device),
}

with torch.no_grad():
    embeddings = model(inputs)

print(
    "视觉 x 文本: ",
    torch.softmax(embeddings[ModalityType.VISION] @ embeddings[ModalityType.TEXT].T, dim=-1),
)
print(
    "音频 x 文本: ",
    torch.softmax(embeddings[ModalityType.AUDIO] @ embeddings[ModalityType.TEXT].T, dim=-1),
)
print(
    "视觉 x 音频: ",
    torch.softmax(embeddings[ModalityType.VISION] @ embeddings[ModalityType.AUDIO].T, dim=-1),
)

# 预期输出：
#
# 视觉 x 文本:
# tensor([[9.9761e-01, 2.3694e-03, 1.8612e-05],
#         [3.3836e-05, 9.9994e-01, 2.4118e-05],
#         [4.7997e-05, 1.3496e-02, 9.8646e-01]])
#
# 音频 x 文本:
# tensor([[1., 0., 0.],
#         [0., 1., 0.],
#         [0., 0., 1.]])
#
# 视觉 x 音频:
# tensor([[0.8070, 0.1088, 0.0842],
#         [0.1036, 0.7884, 0.1079],
#         [0.0018, 0.0022, 0.9960]])

模型卡片

请参阅模型卡片了解详情。

许可证

ImageBind代码和模型权重在CC-BY-NC 4.0许可下发布。有关其他详情，请参阅许可证。

贡献

请参阅贡献和行为准则。

引用ImageBind

如果您发现此仓库有用，请考虑给予星标:star:并引用

@inproceedings{girdhar2023imagebind,
  title={ImageBind: One Embedding Space To Bind Them All},
  author={Girdhar, Rohit and El-Nouby, Alaaeldin and Liu, Zhuang
and Singh, Mannat and Alwala, Kalyan Vasudev and Joulin, Armand and Misra, Ishan},
  booktitle={CVPR},
  year={2023}
}