ViT-L-16-HTxt-Recap-CLIP

项目介绍：ViT-L-16-HTxt-Recap-CLIP

ViT-L-16-HTxt-Recap-CLIP 是一个采用对比学习技术的模型，专门用于图像与文本的零样本分类任务。这意味着它可以在毫无训练样本的情况下，直接对图像进行分类。项目采用了 Recap-DataComp-1B 数据集进行训练，数据集链接为这里。

模型详情

模型类型：对比图像-文本、零样本图像分类。
源码链接：项目的相关代码可以在GitHub上找到。
相关论文：项目的背景研究可以在《What If We Recaption Billions of Web Images with LLaMA-3?》中找到, 请点击这里查看。

模型使用方法

用户可以通过 OpenCLIP 库来应用该模型。以下是具体代码示例：

import torch
import torch.nn.functional as F
from urllib.request import urlopen
from PIL import Image
from open_clip import create_model_from_pretrained, get_tokenizer

model, preprocess = create_model_from_pretrained('hf-hub:UCSC-VLAA/ViT-L-16-HTxt-Recap-CLIP')
tokenizer = get_tokenizer('hf-hub:UCSC-VLAA/ViT-L-16-HTxt-Recap-CLIP')

image = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
image = preprocess(image).unsqueeze(0)

text = tokenizer(["a diagram", "a dog", "a cat", "a beignet"], context_length=model.context_length)

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features = F.normalize(image_features, dim=-1)
    text_features = F.normalize(text_features, dim=-1)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)  # 输出: [[0., 0., 0., 1.0]]

偏见、风险和局限性

这个模型的训练数据来自于使用 LLaVA-1.5-LLaMA3-8B 生成的图像-文本数据集，可能包含源自网络抓取数据的固有偏见和不准确性。用户在使用该模型时应注意这些偏见、风险或局限性。详细信息和数据集说明请参考数据集卡片。

引用

如果使用了该项目的相关成果，请引用以下文献：

@article{li2024recaption,
      title={What If We Recaption Billions of Web Images with LLaMA-3?}, 
      author={Xianhang Li and Haoqin Tu and Mude Hui and Zeyu Wang and Bingchen Zhao and Junfei Xiao and Sucheng Ren and Jieru Mei and Qing Liu and Huangjie Zheng and Yuyin Zhou and Cihang Xie},
      journal={arXiv preprint arXiv:2406.08478},
      year={2024}
}