<center><img src="https://yellow-cdn.veclightyear.com/0a4dffa0/744c3f7f-c925-468b-8d92-bf5083cc8592.png" style="width: 5%"> GraphGPT: 大型语言模型的图指令微调</center>

<div align='center'> <a href='https://tjb-tech.github.io/'>唐嘉斌</a>, <a href='http://yuh-yang.github.io'>杨宇昊</a>, <a href='#'>魏威</a>, <a href='#'>石磊</a>, <a href='#'>程素琪</a>, <a href='https://www.yindawei.com/'>尹大伟</a> 和 <a='https://sites.google.com/view/chaoh/home'>黄超*</a>. (*通讯作者)

<strong><a href='https://sites.google.com/view/chaoh/home'>数据智能实验室</a>@<a href='https://www.hku.hk/'>香港大学</a></strong>, 百度公司

本仓库包含 GraphGPT (SIGIR'24 长文) 的代码、数据和模型权重。

</div>

🎉 新闻

[2024.03.26]🎯🎯📢📢我们的 GraphGPT 被 SIGIR'24 长文轨道接收（20.1% 接受率）！恭喜 GraphGPT 团队的所有成员！🎉🎉🎉
[2023.12.26]🎯🎯📢📢我们已更新了高效且轻量级的训练代码。使用更新后的脚本，可以在两块 Nvidia 3090 GPU（每块 24 GB）上进行两阶段指令微调。具体部署和微调方法如下：🎄🎄

0. 环境更新：

轻量级训练需要 PyTorch 2.1+，因此我们需要更新相应的库：

# 如果您之前已经为 GraphGPT 设置了环境
pip uninstall torch
pip uninstall torchvision
pip uninstall torchaudio
# CUDA 11.8
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118

# 为 PyTorch 2.1+ 更新 pyg
pip install torch_geometric
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.1.0+cu118.html

# 安装 lightning
pip install lightning

1. 更新图数据

由于兼容性问题，如果您使用之前发布的图数据，我们建议按照提供的链接下载并更新：更新后的图数据。

2. 运行脚本

您可以按如下方式运行脚本：

第一阶段：

cd path/to/GraphGPT
sh ./scripts/tune_script/graphgpt_stage1.sh

第二阶段：

cd path/to/GraphGPT
sh ./scripts/tune_script/graphgpt_stage2.sh

[2023.12.14]📢📢感谢研究社区的支持。我们已在以下 常见问题 列表中汇总了有关运行和环境问题的常见问题（FAQs）。请查看。祝大家圣诞快乐！🎄🎄

<details> <summary> <b>常见问题</b> </summary> - 对于未定义的'pretrain_graph_model_path'，请参考问题[#7](https://github.com/HKUDS/GraphGPT/issues/7)。 - 如果使用flash attention时出现问题，只需注释掉https://github.com/HKUDS/GraphGPT/blob/main/graphgpt/train/train_mem.py中第8行的`replace_llama_attn_with_flash_attn()`。更多详情，请参考[#17](https://github.com/HKUDS/GraphGPT/issues/17)。 - 如果遇到包冲突或环境设置问题（特别是fastchat），请参考问题[#9](https://github.com/HKUDS/GraphGPT/issues/9)和问题[#11](https://github.com/HKUDS/GraphGPT/issues/11)。 - 如果遇到`No module named 'graphgpt'`错误，可以参考问题[#56](https://github.com/HKUDS/GraphGPT/issues/56)。

🎯🎯📢📢 我们已经在🤗 Huggingface上对GraphGPT使用的模型和数据进行了重大更新。我们强烈建议参考下表以获取更多详情：

🤗 Huggingface地址	🎯 描述
huggingface.co/Jiabin99/GraphGPT-7B-mix-all	这是我们基于Vicuna-7B-v1.5的GraphGPT的检查点，在指令数据Arxiv-PubMed-mix-NC-LP上进行了微调
huggingface.co/Jiabin99/Arxiv-PubMed-GraphCLIP-GT	这是使用文本-图接地在Arxiv和PubMed上预训练的图转换器（GT）的检查点
huggingface.co/datasets/Jiabin99/Arxiv-PubMed-mix-NC-LP	这是在Arxiv和PubMed上进行节点分类（NC）和链接预测（LP）的混合指令数据集
huggingface.co/datasets/Jiabin99/GraphGPT-eval-instruction	我们发布了用于评估的所有指令数据集
huggingface.co/datasets/Jiabin99/All_pyg_graph_data	我们合并了所有使用的图数据
huggingface.co/datasets/Jiabin99/graph-matching	这是在图匹配阶段使用的指令数据

[2023.10.28]📢📢有关中文版解释，请参考这篇文章。
[2023.10.26]🔥🔥发布我们使用的指令数据。
[2023.10.26]🔥🔥发布我们的GraphGPT和预训练图编码器的检查点。
[2023.10.23] 🚀🚀 我们GraphGPT的完整论文已在https://arxiv.org/abs/2310.13023上发布。请查看并给我们更多反馈！
[2023.10.15] 🚀🚀 发布GraphGPT的代码。

👉 待办事项

探索我们的GraphGPT在更多图学习任务中的潜力。
...

简介

我们提出了GraphGPT框架，通过图指令调优范式将LLM与图结构知识对齐。

使用文本-图接地进行结构信息编码。 为了增强大型语言模型对图结构信息的理解，我们的框架强调将图结构的编码与自然语言空间对齐。这种对齐旨在使语言模型能够有效地理解和解释图的结构元素，利用其固有的语言理解能力。为实现这一目标，我们引入了一种文本-图接地范式，生成旨在为语言模型保留图的结构上下文的提示。这种范式充当桥梁，连接文本信息的语义理解与图中固有的结构关系。
双阶段图指令调优。 本工作提出的双阶段图指令调优范式建立在指令调优的概念之上，该概念最近被引入以增强语言模型在特定领域的适应性。在这种范式中，我们旨在将模型的语言能力与图学习任务的细微差别对齐，使语言模型能够为图结构数据生成更准确和上下文适当的响应。
思维链（CoT）蒸馏。 在面对多样化的图数据时，语言模型可能会遇到新的或不熟悉的模式和结构。这种分布偏移可能会在生成准确和连贯的响应时带来挑战，特别是当不同类型的图数据中节点类的数量变化时。为了应对这一挑战并在存在分布偏移的情况下提高准确性，必须为我们的GraphGPT配备逐步推理能力。在这方面，我们建议使用思维链（COT）技术[47]，该技术明确地模拟思维和推理步骤的流程。通过纳入COT，我们的语言模型改善了生成文本的连贯性和一致性。它使模型能够遵循逻辑思路的进展，增强其理解和推理给定图数据的能力。

有关更多技术细节，请参阅我们Graph的论文和项目网站。

入门

<a href='#代码结构'>1. 代码结构</a>
<a href='#环境准备'>2. 环境准备</a>
<a href='#训练GraphGPT'>3. 训练GraphGPT</a>
- <a href='#准备预训练检查点'>3.1. 准备预训练检查点</a>
- <a href='#自监督指令微调'>3.2. 自监督指令微调</a>
- <a href='#提取训练好的投影器'>3.3. 提取训练好的投影器</a>
- <a href='#任务特定指令微调'>3.4. 任务特定指令微调</a>
<a href='#评估GraphGPT'>4. 评估GraphGPT</a>
- <a href='#准备检查点和数据'>4.1. 准备检查点和数据</a>
- <a href='#运行评估'>4.2. 运行评估</a>

1. 代码结构 <a href='#all_catelogue'>[返回顶部]</a>

.
├── README.md
├── assets
│   ├── demo_narrow.gif
│   ├── screenshot_cli.png
│   ├── screenshot_gui.png
│   ├── server_arch.png
│   └── vicuna_logo.jpeg
├── format.sh
├── graphgpt
│   ├── __init__.py
│   ├── constants.py
│   ├── conversation.py
│   ├── eval
│   │   ├── README.md
│   │   ├── requirements.txt
│   │   ├── run_graphgpt.py
│   │   ├── run_graphgpt_LP.py
│   │   ├── run_vicuna.py
│   │   └── script
│   │       └── run_model_qa.yaml
│   ├── model
│   │   ├── GraphLlama.py
│   │   ├── __init__.py
│   │   ├── apply_delta.py
│   │   ├── apply_lora.py
│   │   ├── builder.py
│   │   ├── compression.py
│   │   ├── convert_fp16.py
│   │   ├── graph_layers
│   │   │   ├── __init__.py
│   │   │   ├── bpe_simple_vocab_16e6.txt.gz
│   │   │   ├── clip_graph.py
│   │   │   ├── graph_transformer.py
│   │   │   ├── mpnn.py
│   │   │   └── simple_tokenizer.py
│   │   ├── make_delta.py
│   │   ├── model_adapter.py
│   │   ├── model_registry.py
│   │   ├── monkey_patch_non_inplace.py
│   │   └── utils.py
│   ├── protocol
│   │   └── openai_api_protocol.py
│   ├── serve
│   │   ├── __init__.py
│   │   ├── api_provider.py
│   │   ├── bard_worker.py
│   │   ├── cacheflow_worker.py
│   │   ├── cli.py
│   │   ├── controller.py
│   │   ├── gateway
│   │   │   ├── README.md
│   │   │   └── nginx.conf
│   │   ├── gradio_block_arena_anony.py
│   │   ├── gradio_block_arena_named.py
│   │   ├── gradio_css.py
│   │   ├── gradio_patch.py
│   │   ├── gradio_web_server.py
│   │   ├── gradio_web_server_multi.py
│   │   ├── huggingface_api.py
│   │   ├── inference.py
│   │   ├── model_worker.py
│   │   ├── monitor
│   │   │   ├── basic_stats.py
│   │   │   ├── clean_battle_data.py
│   │   │   ├── elo_analysis.py
│   │   │   ├── hf_space_leaderboard_app.py
│   │   │   └── monitor.py
│   │   ├── openai_api_server.py
│   │   ├── register_worker.py
│   │   ├── test_message.py
│   │   └── test_throughput.py
│   ├── train
│   │   ├── graphchat_trainer.py
│   │   ├── llama_flash_attn_monkey_patch.py
│   │   ├── train_graph.py
│   │   ├── train_lora.py
│   │   └── train_mem.py
│   └── utils.py
├── playground
│   ├── inspect_conv.py
│   ├── test_embedding
│   │   ├── README.md
│   │   ├── test_classification.py
│   │   ├── test_semantic_search.py
│   │   └── test_sentence_similarity.py
│   └── test_openai_api
│       ├── anthropic_api.py
│       └── openai_api.py
├── pyproject.toml
├── scripts
│   ├── eval_script
│   │   └── graphgpt_eval.sh
│   ├── extract_graph_projector.py
│   ├── serving
│   │   ├── controller.yaml
│   │   └── model_worker.yaml
│   └── tune_script
│       ├── extract_projector.sh
│       ├── graphgpt_stage1.sh
│       └── graphgpt_stage2.sh
└── tests
    ├── test_openai_curl.sh
    ├── test_openai_langchain.py
    └── test_openai_sdk.py

2. 环境准备 <a href='#all_catelogue'>[返回顶部]</a>

请首先克隆仓库并安装所需环境，可以通过运行以下命令完成：

conda create -n graphgpt python=3.8

conda activate graphgpt

# 安装支持CUDA 11.7的Torch
pip install torch==1.13.0+cu117 torchvision==0.14.0+cu117 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu117
# 支持vicuna基础模型
pip3 install "fschat[model_worker,webui]"
# 安装pyg和pyg相关包
pip install torch_geometric
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-1.13.0+cu117.html
# 克隆我们的GraphGPT
git clone https://github.com/HKUDS/GraphGPT.git
cd GraphGPT
# 安装所需库
pip install -r requirements.txt

<span id='训练GraphGPT'/> ### 3. 训练GraphGPT <a href='#all_catelogue'>[返回顶部]</a>

GraphGPT的调优范式包括两个阶段：(1) 自监督指令调优；(2) 任务特定指令调优。

3.1. 准备预训练检查点 <a href='#all_catelogue'>[返回顶部]</a>

GraphGPT基于以下优秀的现有模型进行训练。请按照说明准备检查点。

Vicuna: 准备我们的基础模型Vicuna，它是一个经过指令调优的聊天机器人，也是我们实现中的基础模型。请在这里下载其权重。我们通常使用v1.1和v1.5版本的7B参数模型。
图编码器: 用于编码图结构。我们采用文本-图对齐方法来获得预训练的图转换器模型，您可以通过图转换器下载并将其放置在[./GraphGPT]。我们还在[./text-graph-grounding]提供了文本-图对齐的源代码和Cora数据示例供您参考。
图数据: 是包含节点特征、边索引等所有使用的pyg图数据的组合。您可以通过all_graph_data.pt下载并将其放置在[./GraphGPT/graph_data]。

3.2. 自监督指令调优 <a href='#all_catelogue'>[返回顶部]</a>

准备数据: 请下载我们用于图匹配任务的指令调优数据graph_matching.json。
开始调优: 完成上述步骤后，您可以通过填写graphgpt_stage1.sh中的空白来开始第一阶段调优。以下是一个示例：

# 填写以下路径以运行我们GraphGPT的第一阶段！
model_path=../vicuna-7b-v1.5-16k
instruct_ds=./data/stage_1/graph_matching.json
graph_data_path=./graph_data/all_graph_data.pt
pretra_gnn=clip_gt_arxiv
output_model=./checkpoints/stage_1

wandb offline
python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 --master_port=20001 \
    graphgpt/train/train_mem.py \
    --model_name_or_path ${model_path} \
    --version v1 \
    --data_path ${instruct_ds} \
    --graph_content ./arxiv_ti_ab.json \
    --graph_data_path ${graph_data_path} \
    --graph_tower ${pretra_gnn} \
    --tune_graph_mlp_adapter True \
    --graph_select_layer -2 \
    --use_graph_start_end \
    --bf16 True \
    --output_dir ${output_model} \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb

3.3. 提取训练好的投影器 <a href='#all_catelogue'>[返回顶部]</a>

我们可以通过填写extract_projector.sh中的空白来提取第1阶段训练的投影器。以下是一个示例：

# 填写以下路径以提取第一阶段调优的投影器！
src_model=./checkpoints/stage_1
output_proj=./checkpoints/stage_1_projector/stage_1_projector.bin

python3.8 ./scripts/extract_graph_projector.py \
  --model_name_or_path ${src_model} \
  --output ${output_proj}

3.4. 任务特定指令调优 <a href='#all_catelogue'>[返回顶部]</a>

准备数据: 我们的任务特定指令数据可以多种多样，例如标准或COT（思维链）节点分类、链接预测或混合数据用于多任务学习。请参考task_specific。
开始调优: 完成上述步骤后，您可以通过填写graphgpt_stage2.sh中的空白来开始第二阶段调优。以下是一个示例：

# 填写以下路径以运行我们GraphGPT的第二阶段！
model_path=../vicuna-7b-v1.5-16k
instruct_ds=./data/stage_2/data_all_mix.json
graph_data_path=./graph_data/all_graph_data.pt
pretra_gnn=clip_gt_arxiv
tuned_proj=./checkpoints/stage_1_projector/stage_1_projector.bin
output_model=./checkpoints/stage_2
wandb离线
python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 --master_port=20001 \
    graphgpt/train/train_mem.py \
    --model_name_or_path ${model_path} \
    --version v1 \
    --data_path ${instruct_ds} \
    --graph_content ./arxiv_ti_ab.json \
    --graph_data_path ${graph_data_path} \
    --graph_tower ${pretra_gnn} \
    --pretrain_graph_mlp_adapter ${tuned_proj} \
    --tune_graph_mlp_adapter True \
    --graph_select_layer -2 \
    --use_graph_start_end True\
    --bf16 True \
    --output_dir ${output_model} \
    --num_train_epochs 2 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

4. 评估GraphGPT

4.1. 准备检查点和数据

检查点: 您可以尝试使用自己的模型或我们发布的检查点来评估GraphGPT。
数据: 我们为不同的图数据集拆分了测试集，并制作了用于评估的指令数据。请参考evaluating。

4.2. 运行评估

您可以通过填写graphgpt_eval.sh中的空白处来开始第二阶段的调优。以下是一个示例：

# 填写以下路径以提取第二调优阶段的投影器！
output_model=./checkpoints/stage_2
datapath=./data/eval/arxiv_nc.json
graph_data_path=./graph_data/all_graph_data.pt
res_path=./output_stage_2_arxiv_nc
start_id=0
end_id=20000
num_gpus=2

python3.8 ./graphgpt/eval/run_graphgpt.py --model-name ${output_model}  --prompting_file ${datapath} --graph_data_path ${graph_data_path} --output_res_path ${res_path} --start_id ${start_id} --end_id ${end_id} --num_gpus ${num_gpus}

联系方式

如有任何问题或反馈，请随时联系Jiabin Tang。

其他

</div>

引用

如果您在研究或应用中发现GraphGPT有用，请引用：

@articles{tang2023graphgpt,
title={GraphGPT: Graph Instruction Tuning for Large Language Models}, 
author={Jiabin Tang and Yuhao Yang and Wei Wei and Lei Shi and Lixin Su and Suqi Cheng and Dawei Yin and Chao Huang},
year={2023},
eprint={2310.13023},
archivePrefix={arXiv},
primaryClass={cs.CL}
}