Phi-3-medium-128k-instruct-quantized.w4a16

Phi-3-medium-128k-instruct-quantized.w4a16 项目介绍

模型概述

Phi-3-medium-128k-instruct-quantized.w4a16 是一个文本生成模型，基于 Phi-3 架构。它的主要输入和输出都是文本。该模型通过将权重量化为 INT4 数据类型进行优化，适合于商业和研究用途，尤其是在英文环境下作为助手型聊天机器人的应用。

模型优化

这个模型通过将权重量化为 INT4 来优化，与未量化的模型相比，虽然其在 OpenLLM 基准测试上的得分略有下降（从 74.46 降到 72.38），但在存储和计算资源的使用上更加高效。通过量化操作，减少了每个参数所占用的位数，提高了系统的处理效率和存储能力。

适用场景

Phi-3-medium-128k-instruct-quantized.w4a16 适用于商业和研究领域中需要英文文本生成的任务。其设计特别针对助手型聊天机器人应用。然而，该模型不适用于违反法律法规的使用场景，也不支持非英语文本的处理。

部署方法

使用 vLLM

用户可以通过 vLLM 后端高效部署此模型。以下是使用 vLLM 的一个基础示例：

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "neuralmagic/Phi-3-medium-128k-instruct-quantized.w4a16"

sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "system", "content": "你是一个喜欢用海盗风格说话的聊天机器人！"},
    {"role": "user", "content": "你是谁？请用海盗风格回答。"},
]

prompts = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

llm = LLM(model=model_id, tensor_parallel_size=2)

outputs = llm.generate(prompts, sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM 还支持 OpenAI 兼容的服务，更多详情请查阅 vLLM 文档。

使用 transformers

以下示例展示了如何在 Transformers 中通过 generate() 函数部署模型：

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "neuralmagic/Phi-3-medium-128k-instruct-quantized.w4a16"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "system", "content": "你是一个喜欢用海盗风格说话的聊天机器人！"},
    {"role": "user", "content": "你是谁？请用海盗风格回答"},
]

input_ids = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))

创建过程

Phi-3-medium-128k-instruct-quantized.w4a16 使用 llm-compressor 库创建，过程如下：

from transformers import AutoTokenizer
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from datasets import load_dataset
import random

model_id = "microsoft/Phi-3-medium-128k-instruct"

num_samples = 512
max_seq_len = 4096

tokenizer = AutoTokenizer.from_pretrained(model_id)

preprocess_fn = lambda example: {"text": "以下是一项描述任务的指令。请写出合适的回应以完成请求。\n\n{text}".format_map(example)}

dataset_name = "neuralmagic/LLM_compression_calibration"
dataset = load_dataset(dataset_name, split="train")
ds = dataset.shuffle().select(range(num_samples))
ds = ds.map(preprocess_fn)

examples = [
    tokenizer(
        example["text"], padding=False, max_length=max_seq_len, truncation=True,
    ) for example in ds
]

recipe = GPTQModifier(
  targets="Linear",
  scheme="W4A16",
  ignore=["lm_head"],
  dampening_frac=0.1,
)

model = SparseAutoModelForCausalLM.from_pretrained(
  model_id,
  device_map="auto",
  trust_remote_code=True,
)

oneshot(
  model=model,
  dataset=ds,
  recipe=recipe,
  max_seq_length=max_seq_len,
  num_calibration_samples=num_samples,
)

model.save_pretrained("Phi-3-medium-128k-instruct-quantized.w4a16")

模型评估

Phi-3-medium-128k-instruct-quantized.w4a16 在 OpenLLM 排行榜任务中进行了评估，使用了特定的命令和参数配置：

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/Phi-3-medium-128k-instruct-quantized.w4a16",dtype=auto,tensor_parallel_size=2,gpu_memory_utilization=0.4,add_bos_token=True,max_model_len=4096,trust_remote_code=True \
  --tasks openllm \
  --batch_size auto

准确率

模型在 Open LLM 排行榜上的评估得分如下所示：

Benchmark	Phi-3-medium-128k-instruct	Phi-3-medium-128k-instruct-quantized.w4a16	恢复程度
MMLU (5-shot)	75.63	75.54	99.89%
ARC Challenge (25-shot)	67.57	67.06	94.25%
GSM-8K (5-shot, strict-match)	83.32	82.18	98.64%
Hellaswag (10-shot)	84.36	84.04	99.62%
Winogrande (5-shot)	75.45	72.85	96.55%
TruthfulQA (0-shot)	53.54	52.64	98.31%
平均	74.46	72.39	97.21%