llama-classification

项目简介：LLaMA文本分类

LLaMA文本分类项目是一个使用LLaMA模型进行文本分类的基础代码库。此项目提供了如何设置开发环境、准备数据、执行推断以及进行实验的方法，以便用户在不同的文本数据集上实现分类任务。

开发环境

项目建议使用以下硬件环境进行开发：

设备：Nvidia 1xV100 GPU
设备内存：34G
主机内存：252G

如需更多硬件信息，可以通过开源库中的问题板块咨询。

使用方法

实验设置

首先，从官方LLaMA库中获取检查点文件，将其放置在项目的根目录中，目录结构应如下所示：

checkpoints
├── llama
│   ├── 7B
│   │   ├── checklist.chk
│   │   ├── consolidated.00.pth
│   │   └── params.json
│   └── tokenizer.model

准备Python开发环境，推荐使用Anaconda隔离本地的CUDA版本：

conda create -y -n llama-classification python=3.8
conda activate llama-classification
conda install cudatoolkit=11.7 -y -c nvidia
conda list cudatoolkit # 检查已安装的CUDA版本 (11.7)
pip install -r requirements.txt

方法

直接法

直接法用于比较条件概率p(y|x)。

使用以下脚本预处理来自huggingface的数据集，使用ag_news数据集进行示例：

python run_preprocess_direct_ag_news.py
python run_preprocess_direct_ag_news.py --sample=False --data_path=real/inputs_direct_ag_news.json # 完整评估

使用LLaMA进行推断，计算条件概率并预测类别：

torchrun --nproc_per_node 1 run_evaluate_direct_llama.py \
   --data_path samples/inputs_direct_ag_news.json \
   --output_path samples/outputs_direct_ag_news.json \
   --ckpt_dir checkpoints/llama/7B \
   --tokenizer_path checkpoints/llama/tokenizer.model

校准法通过以下命令提高直接方法的性能：

torchrun --nproc_per_node 1 run_evaluate_direct_calibrate_llama.py \
   --direct_input_path samples/inputs_direct_ag_news.json \
   --direct_output_path samples/outputs_direct_ag_news.json \
   --output_path samples/outputs_direct_calibrate_ag_news.json \
   --ckpt_dir checkpoints/llama/7B \
   --tokenizer_path checkpoints/llama/tokenizer.model

通道法

通道法用于比较条件概率p(x|y)。

使用以下脚本预处理数据：

python run_preprocess_channel_ag_news.py
python run_preprocess_channel_ag_news.py --sample=False --data_path=real/inputs_channel_ag_news.json # 完整评估

使用LLaMA进行推断，计算条件概率并预测类别：

torchrun --nproc_per_node 1 run_evaluate_channel_llama.py \
   --data_path samples/inputs_channel_ag_news.json \
   --output_path samples/outputs_channel_ag_news.json \
   --ckpt_dir checkpoints/llama/7B \
   --tokenizer_path checkpoints/llama/tokenizer.model

纯生成法

通过生成模式进行评估，使用预处理好的直接法数据：

torchrun --nproc_per_node 1 run_evaluate_generate_llama.py \
   --data_path samples/inputs_direct_ag_news.json \
   --output_path samples/outputs_generate_ag_news.json \
   --ckpt_dir checkpoints/llama/7B \
   --tokenizer_path checkpoints/llama/tokenizer.model

实验结果

以下是一些实验结果的示例：

数据集	样本数量	k	方法	准确率	推断时间
ag_news	7600	1	直接法	0.7682	00:38:40
ag_news	7600	1	直接法+校准	0.8567	00:38:40
ag_news	7600	1	通道法	0.7825	00:38:37

待办事项

已实现通道法
实验报告
- 完成：直接法，通道法
- 未完成：生成法
实现其他校准方法
支持其他huggingface数据集
实现LLM.int8
评估其他基模型(LLaMA)的特性

最后声明

感谢LLaMA项目团队发布的检查点及其高效的推断代码，本项目主要基于官方库开发。对于有任何问题或建议的读者，欢迎提出issue或pull requests，具体包括功能需求、详细实现问题或研究方向讨论。

引用

如果使用此代码库进行研究，欢迎引用：

@software{Lee_Simple_Text_Classification_2023,
    author = {Lee, Seonghyeon},
    month = {3},
    title = {{Simple Text Classification Codebase using LLaMA}},
    version = {1.1.0},
    year = {2023}
}