VideoLLaMA2

<p align="center"> <img src="https://yellow-cdn.veclightyear.com/835a84d5/efd17735-0e5d-4732-bb4f-d00629e3a21f.png" width="150" style="margin-bottom: 0.2;"/> <p> <h3 align="center"><a href="https://arxiv.org/abs/2406.07476" style="color:#9C276A"> VideoLLaMA 2：提升视频-大语言模型中的时空建模和音频理解能力</a></h3> <h5 align="center"> 如果我们的项目对您有帮助，请在GitHub上给我们一个星标⭐来支持我们。🙏🙏 </h2> <h5 align="center">

</h5>

<details open><summary>💡 我们团队的一些其他多模态大语言模型项目可能会引起您的兴趣 ✨。 </summary><p>

Video-LLaMA：一个用于视频理解的指令调优音视频语言模型 <br> 张航, 李鑫, 冰立冬 <br> <br>

VCD：通过视觉对比解码缓解大型视觉语言模型中的物体幻觉 <br> 冷思聪, 张航, 陈冠政, 李鑫, 卢仕健, 缪春燕, 冰立冬 <br> <br>

</p></details> <div align="center"><video src="https://github.com/DAMO-NLP-SG/VideoLLaMA2/assets/18526640/e0e7951c-f392-42ed-afad-b2c7984d3e38" width="800"></div> ## 📰 新闻 * **[2024.07.30]** 发布 [VideoLLaMA2-8x7B-Base](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-8x7B-Base) 和 [VideoLLaMA2-8x7B](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-8x7B) 的检查点。 * **[2024.06.25]** 🔥🔥 截至6月25日，我们的 [VideoLLaMA2-7B-16F](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-7B-16F) 在 [MLVU 排行榜](https://github.com/JUNJIE99/MLVU?tab=readme-ov-file#trophy-mini-leaderboard) 上是**排名第一**的约7B规模的VideoLLM。 * **[2024.06.18]** 🔥🔥 截至6月18日，我们的 [VideoLLaMA2-7B-16F](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-7B-16F) 在 [VideoMME 排行榜](https://video-mme.github.io/home_page.html#leaderboard) 上是**排名第一**的约7B规模的VideoLLM。 * **[2024.06.17]** 👋👋 更新技术报告，包含最新结果和之前遗漏的参考文献。如果您有与VideoLLaMA 2密切相关但未在论文中提及的工作，欢迎告知我们。 * **[2024.06.14]** 🔥🔥 [在线演示](https://huggingface.co/spaces/lixin4ever/VideoLLaMA2) 已可用。 * **[2024.06.03]** 发布VideoLLaMA 2的训练、评估和服务代码。 <img src="https://github.com/DAMO-NLP-SG/VideoLLaMA2/assets/18526640/b9faf24f-bdd2-4728-9385-acea17ea086d" width="800" />

🛠️ 要求和安装

基本依赖：

Python >= 3.8
Pytorch >= 2.2.0
CUDA版本 >= 11.8
transformers >= 4.41.2 (用于mistral分词器)
tokenizers >= 0.19.1 (用于mistral分词器)

[在线模式] 安装所需包（更适合开发）：

git clone https://github.com/DAMO-NLP-SG/VideoLLaMA2
cd VideoLLaMA2
pip install -r requirements.txt
pip install flash-attn==2.5.8 --no-build-isolation

[离线模式] 将VideoLLaMA2作为Python包安装（更适合直接使用）：

git clone https://github.com/DAMO-NLP-SG/VideoLLaMA2
cd VideoLLaMA2
pip install --upgrade pip  # 启用PEP 660支持
pip install -e .
pip install flash-attn==2.5.8 --no-build-isolation

🚀 主要结果

多选视频问答和视频描述

开放式视频问答

:earth_americas: 模型库

模型名称	模型类型	视觉编码器	语言解码器	训练帧数
VideoLLaMA2-7B-Base	基础	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	8
VideoLLaMA2-7B	对话	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	8
VideoLLaMA2-7B-16F-Base	基础	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	16
VideoLLaMA2-7B-16F	对话	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	16
VideoLLaMA2-8x7B-Base	基础	clip-vit-large-patch14-336	Mixtral-8x7B-Instruct-v0.1	8
VideoLLaMA2-8x7B	对话	clip-vit-large-patch14-336	Mixtral-8x7B-Instruct-v0.1	8
VideoLLaMA2-72B-Base	基础	clip-vit-large-patch14-336	Qwen2-72B-Instruct	8
VideoLLaMA2-72B	对话	clip-vit-large-patch14-336	Qwen2-72B-Instruct	8

🤗 演示

强烈建议先试用我们的在线演示。

要在您的设备上运行基于视频的LLM（大型语言模型）网页演示，首先需要确保您已准备好必要的模型检查点，然后按照概述的步骤成功启动演示。

单模型版本

直接启动gradio应用（默认采用VideoLLaMA2-7B）：

python videollama2/serve/gradio_web_server_adhoc.py

多模型版本

启动全局控制器

cd /path/to/VideoLLaMA2
python -m videollama2.serve.controller --host 0.0.0.0 --port 10000

启动gradio网页服务器

python -m videollama2.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload

启动一个或多个模型工作器

#  export HF_ENDPOINT=https://hf-mirror.com  # 如果您无法访问 Hugging Face，请尝试取消注释此行。
python -m videollama2.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path /PATH/TO/MODEL1
python -m videollama2.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40001 --worker http://localhost:40001 --model-path /PATH/TO/MODEL2
python -m videollama2.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40002 --worker http://localhost:40002 --model-path /PATH/TO/MODEL3
...

🗝️ 训练和评估

快速入门

为了方便在我们的代码基础上进行进一步开发，我们提供了一个快速入门指南，介绍如何使用VideoLLaVA数据集训练定制的VideoLLaMA2，并在主流视频-语言模型基准上评估训练后的模型。

训练数据结构：

VideoLLaMA2
├── datasets
│   ├── videollava_pt
|   |   ├── llava_image/ # 可在以下链接获取：https://pan.baidu.com/s/17GYcE69FcJjjUM0e4Gad2w?pwd=9ga3 或 https://drive.google.com/drive/folders/1QmFj2FcMAoWNCUyiUtdcW0-IOhLbOBcf?usp=drive_link
|   |   ├── valley/      # 可在以下链接获取：https://pan.baidu.com/s/1jluOimE7mmihEBfnpwwCew?pwd=jyjz 或 https://drive.google.com/drive/folders/1QmFj2FcMAoWNCUyiUtdcW0-IOhLbOBcf?usp=drive_link
|   |   └── valley_llavaimage.json # 可在以下链接获取：https://drive.google.com/file/d/1zGRyVSUMoczGq6cjQFmT0prH67bu2wXD/view，包含703K视频-文本对和558K图像-文本对
│   ├── videollava_sft
|   |   ├── llava_image_tune/  # 可在以下链接获取：https://pan.baidu.com/s/1l-jT6t_DlN5DTklwArsqGw?pwd=o6ko
|   |   ├── videochatgpt_tune/ # 可在以下链接获取：https://pan.baidu.com/s/10hJ_U7wVmYTUo75YHc_n8g?pwd=g1hf
|   |   └── videochatgpt_llavaimage_tune.json # 可在以下链接获取：https://drive.google.com/file/d/1zGRyVSUMoczGq6cjQFmT0prH67bu2wXD/view，包含100K以视频为中心、625K以图像为中心和40K纯文本对话

命令：

# VideoLLaMA2-vllava 预训练
bash scripts/vllava/pretrain.sh
# VideoLLaMA2-vllava 微调
bash scripts/vllava/finetune.sh

评估数据结构：

VideoLLaMA2
├── eval
│   ├── egoschema # 官方网站：https://github.com/egoschema/EgoSchema
|   |   ├── good_clips_git/ # 可在以下链接获取：https://drive.google.com/drive/folders/1SS0VVz8rML1e5gWq7D7VtP1oxE2UtmhQ
|   |   └── questions.json  # 可在以下链接获取：https://github.com/egoschema/EgoSchema/blob/main/questions.json
│   ├── mvbench # 官方网站：https://huggingface.co/datasets/OpenGVLab/MVBench
|   |   ├── video/
|   |   |   ├── clever/
|   |   |   └── ...
|   |   └── json/
|   |   |   ├── action_antonym.json
|   |   |   └── ...
│   ├── perception_test_mcqa # 官方网站：https://huggingface.co/datasets/OpenGVLab/MVBench
|   |   ├── videos/ # 可在以下链接获取：https://storage.googleapis.com/dm-perception-test/zip_data/test_videos.zip
|   |   └── mc_question_test.json # 从以下链接下载：https://storage.googleapis.com/dm-perception-test/zip_data/mc_question_test_annotations.zip
│   ├── videomme # 官方网站：https://video-mme.github.io/home_page.html#leaderboard
|   |   ├── test-00000-of-00001.parquet
|   |   ├── videos/
|   |   └── subtitles/
│   ├── Activitynet_Zero_Shot_QA # 官方网站：https://github.com/MILVLG/activitynet-qa
|   |   ├── all_test/   # 可在以下链接获取：https://mbzuaiac-my.sharepoint.com/:u:/g/personal/hanoona_bangalath_mbzuai_ac_ae/EatOpE7j68tLm2XAd0u6b8ABGGdVAwLMN6rqlDGM_DwhVA?e=90WIuW
|   |   ├── test_q.json # 可在以下链接获取：https://github.com/MILVLG/activitynet-qa/tree/master/dataset
|   |   └── test_a.json # 可在以下链接获取：https://github.com/MILVLG/activitynet-qa/tree/master/dataset
│   ├── MSVD_Zero_Shot_QA # 官方网站：https://github.com/xudejing/video-question-answering
|   |   ├── videos/     
|   |   ├── test_q.json 
|   |   └── test_a.json
│   ├── videochatgpt_gen # 官方网站：https://github.com/mbzuai-oryx/Video-ChatGPT/tree/main/quantitative_evaluation
|   |   ├── Test_Videos/ # 可在以下链接获取：https://mbzuaiac-my.sharepoint.com/:u:/g/personal/hanoona_bangalath_mbzuai_ac_ae/EatOpE7j68tLm2XAd0u6b8ABGGdVAwLMN6rqlDGM_DwhVA?e=90WIuW
|   |   ├── Test_Human_Annotated_Captions/ # 可在以下链接获取：https://mbzuaiac-my.sharepoint.com/personal/hanoona_bangalath_mbzuai_ac_ae/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fhanoona%5Fbangalath%5Fmbzuai%5Fac%5Fae%2FDocuments%2FVideo%2DChatGPT%2FData%5FCode%5FModel%5FRelease%2FQuantitative%5FEvaluation%2Fbenchamarking%2FTest%5FHuman%5FAnnotated%5FCaptions%2Ezip&parent=%2Fpersonal%2Fhanoona%5Fbangalath%5Fmbzuai%5Fac%5Fae%2FDocuments%2FVideo%2DChatGPT%2FData%5FCode%5FModel%5FRelease%2FQuantitative%5FEvaluation%2Fbenchamarking&ga=1
|   |   ├── generic_qa.json     # 这三个json文件可在以下链接获取：https://mbzuaiac-my.sharepoint.com/personal/hanoona_bangalath_mbzuai_ac_ae/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fhanoona%5Fbangalath%5Fmbzuai%5Fac%5Fae%2FDocuments%2FVideo%2DChatGPT%2FData%5FCode%5FModel%5FRelease%2FQuantitative%5FEvaluation%2Fbenchamarking%2FBenchmarking%5FQA&ga=1
|   |   ├── temporal_qa.json
|   |   └── consistency_qa.json

命令：

# mvbench 评估
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/eval/eval_video_qa_mvbench.sh
# activitynet-qa 评估（需要设置azure openai key/endpoint/deployname）
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/eval/eval_video_qa_mvbench.sh

数据格式

如果您想在自己的数据上训练视频-语言模型，您需要按照以下步骤准备视频/图像微调数据：

假设你的数据结构如下：

VideoLLaMA2
├── datasets
│   ├── custom_sft
│   |   ├── images
│   |   ├── videos
|   |   └── custom.json

然后你应该按照以下格式重新组织标注的视频/图像 sft 数据：

[
    {
        "id": 0,
        "video": "images/xxx.jpg",
        "conversations": [
            {
                "from": "human",
                "value": "<image>\n图像中巴士的颜色是什么？"
            },
            {
                "from": "gpt",
                "value": "图像中的巴士是白色和红色的。"
            },
            ...
        ],
    }
    {
        "id": 1,
        "video": "videos/xxx.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\n视频中主要发生了哪些活动？"
            },
            {
                "from": "gpt",
                "value": "视频中主要发生的活动包括一个男人准备摄影设备，一群男人乘坐直升机，以及一个男人驾驶船在水中航行。"
            },
            ...
        ],
    },
    ...
]

修改 scripts/custom/finetune.sh：

...
--data_path datasets/custom_sft/custom.json
--data_folder datasets/custom_sft/
--pretrain_mm_mlp_adapter CONNECTOR_DOWNLOAD_PATH （例如：DAMO-NLP-SG/VideoLLaMA2-7B-Base）
...

🤖 推理

视频/图像推理：

import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init


def inference():
    disable_torch_init()

    # 视频推理
    modal = 'video'
    modal_path = 'assets/cat_and_chicken.mp4' 
    instruct = '视频中有哪些动物，它们在做什么，视频给人什么感觉？'
    # 回复：
    # 视频中有一只小猫和一只小鸡在一起玩耍。小猫躺在地板上，而小鸡在它周围跳来跳去。两只动物之间有互动和玩耍的行为，视频给人一种可爱和温馨的感觉。

    # 图像推理
    modal = 'image'
    modal_path = 'assets/sora.png'
    instruct = '这个女人穿着什么，在做什么，图片给人什么感觉？'
    # 回复：
    # 图片中的女人穿着黑色外套和墨镜，正在雨后的城市街道上行走。这张图片给人一种充满活力和生机的感觉，明亮的城市灯光反射在湿漉漉的人行道上，营造出一种视觉上富有吸引力的氛围。女人的存在为场景增添了一种时尚感和自信感，她正在繁华的都市环境中从容前行。

    model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B'
    # 基础模型推理（只需替换 model_path）
    # model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B-Base'
    model, processor, tokenizer = model_init(model_path)
    output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

    print(output)

if __name__ == "__main__":
    inference()

📑 引用

如果您发现 VideoLLaMA 对您的研究和应用有用，请使用以下 BibTeX 进行引用：

@article{damonlpsg2024videollama2,
  title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
  author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
  journal={arXiv preprint arXiv:2406.07476},
  year={2024},
  url = {https://arxiv.org/abs/2406.07476}
}

@article{damonlpsg2023videollama,
  title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
  author = {Zhang, Hang and Li, Xin and Bing, Lidong},
  journal = {arXiv preprint arXiv:2306.02858},
  year = {2023},
  url = {https://arxiv.org/abs/2306.02858}
}