EAGLE

<img src="https://yellow-cdn.veclightyear.com/ab5030c0/c2f505d4-ea51-4cf2-b938-85e7aae6759c.png" alt="EAGLE" width="220" align="left"><div align="center"><h1> EAGLE</h1></div>

| <a href="https://arxiv.org/pdf/2401.15077.pdf">论文 (EAGLE)</a> | <a href="https://arxiv.org/pdf/2406.16858">论文 (EAGLE-2)</a> | <a href="https://sites.google.com/view/ eagle-llm">博客</a> | <a href="https://huggingface.co/spaces/yuhuili/EAGLE-2">演示</a> | <a href=""> <img src="https://yellow-cdn.veclightyear.com/ab5030c0/c08be02e-cc73-48d7-8a1b-02c259e568b7.svg" alt="版本"> </a> <a href="https://opensource.org/licenses/Apache-2.0"> <img src="https://yellow-cdn.veclightyear.com/ab5030c0/1fe2d9fa-1a47-40bf-9887-d36fa9acc4df.svg" alt="许可证"> </a> <a href="https://github.com/SafeAILab/EAGLE/issues"> <img src="https://yellow-cdn.veclightyear.com/ab5030c0/bda22bb8-a693-424d-be3f-062333d9ee6b.svg" alt="维护"> </a> <a href="https://github.com/SafeAILab/EAGLE/pulls"> <img src="https://yellow-cdn.veclightyear.com/ab5030c0/bf429610-d166-49e9-b0e1-8a3269fd3ce4.svg?style=flat" alt="欢迎贡献"> </a>

EAGLE（更高效语言模型外推算法）是一种新的基准，用于快速解码大型语言模型（LLMs），并保证性能不会下降。这种方法通过外推LLMs的次顶层上下文特征向量，显著提高了生成效率。

EAGLE 特点：
- 经<a href="https://github.com/hemingkx/Spec-Bench/blob/main/Leaderboard.md">第三方</a>评估认证为目前最快的推测方法。
- 在<a href="https://github.com/pytorch-labs/gpt-fast">gpt-fast</a>上实现2倍速度提升。
- 比普通解码快3倍（13B）。
- 比<a href="https://lmsys.org/blog/2023-11-21-lookahead-decoding/">Lookahead</a>快2倍（13B）。
- 比<a href="https://sites.google.com/view/medusa-llm">Medusa</a>快1.6倍（13B）。
- 可证明地保持与普通解码在生成文本分布上的一致性。
- 可在8块RTX 3090 GPU上训练（1-2天内）和测试。即使GPU资源有限也能负担得起。
- 可与其他并行技术结合，如vLLM、DeepSpeed、Mamba、FlashAttention、量化和硬件优化。

EAGLE-2使用草稿模型的置信度分数来近似接受率，动态调整草稿树结构，进一步提升性能。

EAGLE-2 特点：
- 比普通解码快4倍（13B）。
- 比EAGLE-1快1.4倍（13B）。

使用EAGLE-2，在2块RTX 3060 GPU上的推理速度可以比在A100 GPU上的普通自回归解码更快。

更新

2024.8.8：我们现在支持Qwen-2。

2024.6.27：EAGLE-2发布。

2024.2.25：EAGLE被<a href="https://github.com/hemingkx/Spec-Bench/blob/main/Leaderboard.md">第三方</a>评估认证为最快的推测方法。

2024.1.17：我们现在支持Mixtral-8x7B-Instruct。

2023.12.8：EAGLE v1.0发布。

待办事项

支持非贪婪推理（可证明地保持文本分布）。
支持更多LLMs，如Mixtral 8x7B。
支持LLaMA-3。
支持Qwen-2。

默认主分支是EAGLE-2的实现。如要使用EAGLE-1，请切换到v1分支。

设置与安装

git clone https://github.com/SafeAILab/EAGLE.git
cd EAGLE
pip install -r requirements.txt

EAGLE权重

注意： 当Qwen2为目标模型时，请使用bf16精度而非fp16以避免数值溢出。Qwen2的草稿模型训练数据集为ShareGPT，已删除非英语数据。因此，如果要在非英语数据（如中文）上使用，请使用相应数据进行训练。

与EAGLE相比，EAGLE-2无需额外训练，使用相同的权重。

基础模型	EAGLE在Hugging Face上	EAGLE参数数量	基础模型	EAGLE在Hugging Face上	EAGLE参数数量
Vicuna-7B-v1.3	yuhuili/EAGLE-Vicuna-7B-v1.3	0.24B	LLaMA2-Chat 7B	yuhuili/EAGLE-llama2-chat-7B	0.24B
Vicuna-13B-v1.3	yuhuili/EAGLE-Vicuna-13B-v1.3	0.37B	LLaMA2-Chat 13B	yuhuili/EAGLE-llama2-chat-13B	0.37B
Vicuna-33B-v1.3	yuhuili/EAGLE-Vicuna-33B-v1.3	0.56B	LLaMA2-Chat 70B	yuhuili/EAGLE-llama2-chat-70B	0.99B
Mixtral-8x7B-Instruct-v0.1	yuhuili/EAGLE-mixtral-instruct-8x7B	0.28B
LLaMA3-Instruct 8B	yuhuili/EAGLE-LLaMA3-Instruct-8B	0.25B	LLaMA3-Instruct 70B	yuhuili/EAGLE-LLaMA3-Instruct-70B	0.99B
Qwen2-7B-Instruct	yuhuili/EAGLE-Qwen2-7B-Instruct	0.26B	Qwen2-72B-Instruct	yuhuili/EAGLE-Qwen2-72B-Instruct	1.05B

推理

我们提供的推理代码自动分配模型权重（在多个GPU上加载模型），使您可以运行超出单个GPU内存的模型。

使用界面

我们提供了一个建议的网页界面，您可以通过运行以下命令来使用。模型完全加载后，终端会输出一个URL，您可以在浏览器中输入该URL进行访问。

python -m eagle.application.webui --ea-model-path [EAGLE权重路径]\ 
		--base-model-path [原始模型路径]\
		--model-type [vicuna\llama2\llama3]\
        --total-token [整数]

total-token是草稿token的数量。对于较小的模型和高级GPU，这个值可以设置得更大。根据具体设备和模型进行调整可以达到更好的效果。如果设置为-1，EAGLE-2将自动配置此参数。

使用代码

您可以使用我们提供的"eagenerate"来加速生成，就像使用Hugging Face的'generate'一样。以下是一个示例。

from eagle.model.ea_model import EaModel
from fastchat.model import get_conversation_template
model = EaModel.from_pretrained(
    base_model_path=base_model_path,
    ea_model_path=EAGLE_model_path,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    device_map="auto",
    total_token=-1
)
model.eval()
your_message="Hello"
conv = get_conversation_template("vicuna")
conv.append_message(conv.roles[0], your_message)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids=model.tokenizer([prompt]).input_ids
input_ids = torch.as_tensor(input_ids).cuda()
output_ids=model.eagenerate(input_ids,temperature=0.5,max_new_tokens=512)
output=model.tokenizer.decode(output_ids[0])

注意：Vicuna、LLaMA2-Chat和LLaMA3-Instruct都是聊天模型。您需要使用正确的聊天模板，否则会导致模型输出异常并影响EAGLE的性能。

训练

生成训练数据

您可以运行以下命令来生成训练数据。

python -m eagle.ge_data.allocation --outdir [数据路径]

训练自回归头

accelerate launch -m --mixed_precision=bf16 eagle.train.main --tmpdir [数据路径]\
--cpdir [检查点路径] --configpath [配置文件路径]

eagle/train提供了配置文件的示例。

您也可以使用DeepSpeed进行训练。

cd eagle/train
deepspeed main_deepspeed.py --deepspeed_config ds_config.json

在自定义模型上进行推理

如果原始LLM结构与LLaMA和Mixtral不同，您可以按以下方式使用EAGLE：

从Transformers库中复制modeling_basemodelname.py，然后进行修改以利用预分配的kv_cache来提高基础模型的速度。您可以参考model/modeling_llama_kv.py获取指导，需要修改的地方用# [MODIFIED]标注。这些修改非常少。

评估

您可以使用以下命令在MT-bench上测试EAGLE的速度。

python -m eagle.evaluation.gen_ea_answer_vicuna(或gen_ea_answer_vicuna_llama2chat)\
		 --ea-model-path [EAGLE权重路径]\ 
		 --base-model-path [原始模型路径]\

如果您需要具体的加速比，还需要运行以下命令来获取普通自回归的速度。

python -m eagle.evaluation.gen_baseline_answer_vicuna\
		(或gen_ea_answer_vicuna_llama2chat)\
		 --ea-model-path [EAGLE权重路径]\ 
		 --base-model-path [原始模型路径]\

上述两个命令each会生成一个.jsonl文件，记录生成结果和wall time。然后，您可以使用evaluation/speed.py来计算速度比。

🌟 我们的贡献者

衷心感谢所有贡献者。

贡献者

参考文献

有关技术细节和完整实验结果，请查看EAGLE论文和EAGLE-2论文。

@inproceedings{li2024eagle, 
	author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang}, 
	title = {EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty}, 
	booktitle = {International Conference on Machine Learning},
	year = {2024}
}
@misc{li2024eagle2fasterinferencelanguage,
      title={EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees}, 
      author={Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang},
      year={2024},
      eprint={2406.16858},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.16858}, 
}