LLaVAR

LLaVAR: 增强视觉指令微调以理解富文本图像

Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, Tong Sun

替代文本

@misc{zhang2023llavar,
    title={LLaVAR: 增强视觉指令微调以理解富文本图像},
    author={Yanzhe Zhang and Ruiyi Zhang and Jiuxiang Gu and Yufan Zhou and Nedim Lipka and Diyi Yang and Tong Sun},
    year={2023},
    eprint={2306.17107},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

[更新 08/01] 查看社区在Huggingface上提供的即用型模型检查点和微调数据集！

[更新 07/21] 发布使用的LAION图像元数据：预训练/微调。

[更新 07/12] 在MME基准测试上发布OCR评估结果/脚本。LLaVAR将LLaVA的OCR分数从50提高到80。

[更新 07/05] 数据可在Huggingface :hugs:上获取。

[更新 07/05] 模型权重增量在Huggingface :hugs:上发布。

[更新 06/29] 初始发布。

我们的代码与LLaVA代码的主要区别在于，我们修改了训练/测试/服务文件以支持Vicuna v1.1，它使用'</s>'作为分隔符而不是'###'。

环境设置

请按照LLaVA的说明准备环境/合并模型权重。

模型权重增量：Google Drive，Huggingface

这应该与LLaMA-13B合并。

合并后，请在文件夹名称中添加"v1"，并确保使用"llava_v1"对话模式。

训练数据（Huggingface）

我们的图像数据已转换为LLaVA预训练/微调格式（它们具有CC3M和COCO格式的"假"文件名）。您可以下载它们并将其合并到LLaVA训练集中。

另一方面，我们的指令已包含LLaVA的指令。

预训练图像：Google Drive

预训练指令（595K + 422K）：Google Drive

微调图像：Google Drive

微调指令（158K + 16K）：Google Drive

微调指令（158K + 20K）：Google Drive

评估数据（Huggingface）

我们收集了50个基于LAION富文本图像的指令遵循问题和答案，可用于基于GPT-4的指令遵循评估。

评估图像：Google Drive

GPT-4评估上下文（595K + 422K）：文件

GPT-4评估规则：文件

问题：文件

GPT-4答案：文件

训练脚本

您应该将我们的预训练图像合并到cc3m文件夹中。

torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
   /path/to/LLaVA/llava/train/train_mem.py \
    --model_name_or_path /path/to/models/vicuna_13b_v1_1 \
    --data_path /path/to/chat_llavar.json \
    --image_folder /path/to/cc3m \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --tune_mm_mlp_adapter True \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end \
    --bf16 True \
    --output_dir /path/to/checkpoint \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 4000 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 1024 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --image_aspect_ratio 'pad' \
    --report_to wandb

您应该将我们的微调图像合并到coco2017文件夹中。

torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
    /path/to/LLaVA/llava/train/train_mem.py \
    --model_name_or_path /path/to/models/vicuna_13b_v1_1 \
    --data_path /path/to/llava_instruct_150k_llavar_16k.json \
    --image_folder /path/to/coco/images/train2017 \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --pretrain_mm_mlp_adapter /path/to/mm_proj/llava-13b-pretrain.bin \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end True \
    --bf16 True \
    --output_dir /path/to/checkpoint \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 8000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --image_aspect_ratio 'pad' \
    --report_to wandb

评估脚本

在COCO图像上进行指令跟随评估。

python /path/to/LLaVA/llava/eval/model_vqa.py \
    --model-name /path/to/checkpoint \
    --question-file \
    /path/to/LLaVA/playground/data/coco2014_val_qa_eval/qa90_questions.jsonl \
    --image-folder \
    /path/to/coco2014/val2014 \
    --answers-file \
    /path/to/qa90-answer-file.jsonl \
    --conv-mode "llava_v1"

对给定图像URL进行指令跟随评估。

python -m llava.eval.run_llava \
    --model-name /path/to/checkpoint \
    --image-file "https://cdn.shopify.com/s/files/1/0057/3728/3618/products/a-man-called-otto_ezrjr0pm_480x.progressive.jpg" \
    --query "电影中的主演是谁？"

对于基于文本的VQA（来自MultimodalOCR）：克隆他们的仓库并准备数据后，你可以将./MultimodalOCR/Eval_LLaVAR.py放在/your/path/to/MultimodalOCR/models/LLaVA/中，并在/your/path/to/MultimodalOCR/eval.py中添加我们的模型进行评估。

致谢

代码库主要来自LLaVA项目。我们的评估也基于MultimodalOCR项目。

为了获得更好的语言解码器，你也可以关注最近的Vicuna模型更新。

@article{liu2023llava,
    author      = {Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
    title       = {Visual Instruction Tuning},
    publisher   = {arXiv:2304.08485},
    year        = {2023}
}

@misc{liu2023hidden,
    title={On the Hidden Mystery of OCR in Large Multimodal Models},
    author={Yuliang Liu and Zhang Li and Hongliang Li and Wenwen Yu and Yang Liu and Biao Yang and Mingxin Huang and Dezhi Peng and Mingyu Liu and Mingrui Chen and Chunyuan Li and Xucheng Yin and Cheng-lin Liu and Lianwen Jin and Xiang Bai},
    year={2023},
    eprint={2305.07895},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

@misc{vicuna2023,
    title = {Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\%* ChatGPT Quality},
    url = {https://lmsys.org/blog/2023-03-30-vicuna/},
    author = {Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P.},
    month = {March},
    year = {2023}
}