MobileLLM

本仓库包含了我们在ICML 2024发表的论文"MobileLLM: 优化小于十亿参数的语言模型用于设备端场景"中介绍的MobileLLM的训练代码。

在这项工作中，我们全面考虑了多个设计因素，以获得高质量的、参数少于十亿的大语言模型。我们整合了(1) SwiGLU激活函数、(2)深而窄的架构、(3)嵌入共享、(4)分组查询注意力来构建MobileLLM。MobileLLM-125M/350M在零样本常识推理任务上比之前的125M/350M最先进模型分别提高了2.7%/4.3%的准确率。在我们更新的版本中，我们进一步证明了我们的设计理念可以有效扩展到更大的模型，MobileLLM-600M/1B/1.5B取得了最先进的结果。

引用

如果您发现我们的代码对您的研究有用，请考虑引用：

@article{liu2024mobilellm,
    title={MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases},
    author={Liu, Zechun and Zhao, Changsheng and Iandola, Forrest和Lai, Chen和Tian, Yuandong和Fedorov, Igor和Xiong, Yunyang和Chang, Ernie和Shi, Yangyang和Krishnamoorthi, Raghuraman和others},
    journal={arXiv preprint arXiv:2402.14905},
    year={2024}
}

运行

步骤1. 要求：

python 3.9, pytorch >= 2.0
pip install -r requirement.txt

步骤2. 数据预处理

将分词后的数据集划分或对您自己的数据集进行分词，并均匀分布到总训练节点数量上，每个节点由1x8个GPU组成。然后，将数据组织成以下结构：

basepath
- 1
  - xxx.jsonl
- 2
  - xxx.jsonl
- ...
- #nodes
  - xxx.jsonl

jsonl文件的每一行是一个分词数据的键值对{"token_ids": [1,2,3,4,...]}。

我们的训练代码与https://github.com/LLM360/amber-data-prep中的数据预处理方法兼容。

步骤3. 训练脚本

提供了pretrain.sh脚本，用于在1x8节点设置上使用torchrun启动训练。可以修改此脚本以调整--nnodes参数和其他设置，以适应不同的多节点配置，如使用slurm或torchx。脚本中的学习率适用于1x8节点，批量大小为32。如果您增加节点数量或批量大小，需要线性增加学习率。

运行步骤：

在pretrain.sh文件中，指定--train_data_local_path为步骤2中预处理的数据路径，并将--input_model_filename指定为./configs/{model_size}/。
运行bash pretrain.sh

其他

模型权重仍在法律审查中。如有任何问题，请随时发送电子邮件至(zechunliu at meta dot com)和(cszhao at meta dot com)

训练成本

使用32个NVIDIA A100 80G GPU在1T个token上训练MobileLLM需要以下天数。

125M	350M	600M	1B	1.5B
~3天	~6天	~8天	~12天	~18天

零样本常识推理任务结果

MobileLLM-125M

模型	boolq	piqa	siqa	hellaswag	winogrande	arc_easy	arc_challenge	obqa	平均
OPT-125M	41.3	25.2	57.5	62.0	41.9	31.1	31.2	50.8	42.6
GPT-neo-125M	40.7	24.8	61.3	62.5	41.9	29.7	31.6	50.7	42.9
Pythia-160M	40.0	25.3	59.5	62.0	41.5	29.9	31.2	50.9	42.5
MobileLLM-125M	43.9	27.1	60.2	65.3	42.4	38.9	39.5	53.1	46.3
MobileLLM-LS-125M	45.8	28.7	60.4	65.7	42.9	39.5	41.1	52.1	47.0

MobileLLM-350M

模型	boolq	piqa	siqa	hellaswag	winogrande	arc_easy	arc_challenge	obqa	平均
OPT-350M	41.9	25.7	54.0	64.8	42.6	36.2	33.3	52.4	43.9
Pythia-410M	47.1	30.3	55.3	67.2	43.1	40.1	36.2	53.4	46.6
MobileLLM-350M	53.8	33.5	62.4	68.6	44.7	49.6	40.0	57.6	51.3
MobileLLM-LS-350M	54.4	32.5	62.8	69.8	44.1	50.6	45.8	57.2	52.1

MobileLLM-600M

模型	boolq	piqa	siqa	hellaswag	winogrande	arc_easy	arc_challenge	obqa	平均
Qwen1.5-500M	54.7	32.1	46.9	68.9	46.0	48.8	37.7	55.0	48.8
BLOOM-560M	43.7	27.5	53.7	65.1	42.5	36.5	32.6	52.2	44.2
MobiLlama-800M	52.0	31.7	54.6	73.0	43.3	52.3	42.5	56.3	50.7
MobileLLM-600M	58.1	35.8	61.0	72.3	44.9	55.9	47.9	58.6	54.3

MobileLLM-1B

模型	boolq	piqa	siqa	hellaswag	winogrande	arc_easy	arc_challenge	obqa	平均
Pythia-1B	49.9	30.4	58.7	69.2	43.3	47.4	38.6	52.2	48.7
MobiLlama-1B	59.7	38.4	59.2	74.5	44.9	62.0	43.7	59.0	55.2
Falcon-1B	59.5	38.4	63.9	74.6	44.6	62.9	45.6	60.9	56.3
BLOOM-1.1B	47.6	27.3	58.6	67.0	42.4	42.2	36.6	53.8	46.9
TinyLlama-1.1B	59.2	37.1	58.1	72.9	43.9	59.1	44.7	58.8	54.2
MobileLLM-1B	63.0	39.0	66.7	74.4	45.0	61.4	46.8	62.3	57.3

MobileLLM-1.5B

模型	boolq	piqa	siqa	hellaswag	winogrande	arc_easy	arc_challenge	obqa	平均
GPT-neo-1.3B	51.3	33.0	61.8	70.9	43.7	48.6	41.2	54.5	50.6
OPT-1.3B	54.4	31.7	58.4	71.5	44.7	53.7	44.6	59.1	52.3
BLOOM-1.7B	50.9	31.2	61.7	70.0	43.2	47.2	36.2	56.1	49.6
Qwen1.5-1.8B	61.1	36.5	68.3	74.1	47.2	60.4	42.9	61.2	56.5
GPT-neo-2.7B	55.8	34.3	62.4	72.9	43.6	55.6	40.0	57.9	52.8
OPT-2.7B	56.6	34.6	61.8	74.5	45.6	60.2	48.2	59.6	55.1
Pythia-2.8B	59.4	38.9	66.1	73.8	44.5	59.6	45.0	59.4	55.8
BLOOM-3B	55.1	33.6	62.1	70.5	43.2	53.9	41.6	58.2	52.3
MobileLLM-1.5B	67.5	40.9	65.7	74.8	46.4	64.5	50.5	64.7	59.4