Intel® 神经压缩器

<h3>一个开源 Python 库，支持在所有主流深度学习框架(TensorFlow、PyTorch 和 ONNX Runtime)上进行流行的模型压缩技术</h3>

Intel® 神经压缩器旨在为主流框架如 TensorFlow、PyTorch 和 ONNX Runtime，以及 Intel 扩展如 Intel Extension for TensorFlow 和 Intel Extension for PyTorch 提供流行的模型压缩技术，如量化、剪枝(稀疏化)、蒸馏和神经架构搜索。特别是，该工具提供以下关键特性、典型示例和开放合作：

支持广泛的 Intel 硬件，如 Intel Gaudi AI 加速器、Intel Core Ultra 处理器、Intel Xeon 可扩展处理器、Intel Xeon CPU Max 系列、Intel 数据中心 GPU Flex 系列和 Intel 数据中心 GPU Max 系列，并进行了广泛测试；通过 ONNX Runtime 支持 AMD CPU、ARM CPU 和 NVidia GPU，但测试有限；对于某些无权重量化算法如 AutoRound 和 HQQ，支持 NVidia GPU。
验证流行的大语言模型，如 LLama2、Falcon、GPT-J、Bloom、OPT，以及来自流行模型库如 Hugging Face、Torch Vision 和 ONNX Model Zoo 的超过 10,000 个广泛模型，如 Stable Diffusion、BERT-Large 和 ResNet50，并采用自动精度驱动量化策略
与云市场如 Google Cloud Platform、Amazon Web Services 和 Azure，软件平台如阿里云、腾讯 TACO 和 Microsoft Olive，以及开放 AI 生态系统如 Hugging Face、PyTorch、ONNX、ONNX Runtime 和 Lightning AI 进行合作

安装

安装框架

为 CPU 安装 torch

pip install torch --index-url https://download.pytorch.org/whl/cpu

使用预装 torch 的 Docker 镜像用于 HPU

https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#bare-metal-fresh-os-single-click

注意: Intel 神经压缩器与 Gaudi 软件栈之间存在版本对应关系，请参考此表格并确保使用匹配的组合。

为 Intel GPU 安装 torch/intel_extension_for_pytorch

https://intel.github.io/intel-extension-for-pytorch/index.html#installation

为其他平台安装 torch

https://pytorch.org/get-started/locally

安装 tensorflow

pip install tensorflow

从 pypi 安装

# 安装 2.X API + 框架扩展 API + PyTorch 依赖
pip install neural-compressor[pt]
# 安装 2.X API + 框架扩展 API + TensorFlow 依赖
pip install neural-compressor[tf]

注意: 更多安装方法可以在安装指南中找到。查看我们的常见问题了解更多详情。

入门

设置环境：

pip install "neural-compressor>=2.3" "transformers>=4.34.0" torch torchvision

成功安装这些包后，尝试你的第一个量化程序。

FP8 量化

以下示例代码演示了 FP8 量化，它由 Intel Gaudi2 AI 加速器支持。

要在 Intel Gaudi2 上尝试，推荐使用带有 Gaudi 软件栈的 docker 镜像，请参考以下脚本进行环境设置。更多详情可以在 Gaudi 指南中找到。

# 运行一个带有交互式 shell 的容器
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.17.0/ubuntu22.04/habanalabs/pytorch-installer-2.3.1:latest

运行示例：

from neural_compressor.torch.quantization import (
    FP8Config,
    prepare,
    convert,
)
import torchvision.models as models

model = models.resnet18()
qconfig = FP8Config(fp8_config="E4M3")
model = prepare(model, qconfig)
# 用户自定义校准
calib_func(model)
model = convert(model)

权重专用大语言模型加载 (LLMs)

以下示例代码演示了在 Intel Gaudi2 AI 加速器上进行权重专用大语言模型加载。

from neural_compressor.torch.quantization import load

model_name = "TheBloke/Llama-2-7B-GPTQ"
model = load(
    model_name_or_path=model_name,
    format="huggingface",
    device="hpu",
    torch_dtype=torch.bfloat16,
)

注意：

Intel 神经压缩器将在首次加载时将模型格式从 auto-gptq 转换为 hpu 格式，并将 hpu_model.safetensors 保存到本地缓存目录以供下次加载。因此，首次加载可能需要一些时间。

文档

<table class="docutils"> <thead> <tr> <th colspan="8">概览</th> </tr> </thead> <tbody> <tr> <td colspan="2" align="center"><a href="./docs/source/3x/design.md#architecture">架构</a></td> <td colspan="2" align="center"><a href="./docs/source/3x/design.md#workflows">工作流程</a></td> <td colspan="2" align="center"><a href="https://intel.github.io/neural-compressor/latest/docs/source/api-doc/apis.html">API</a></td> <td colspan="1" align="center"><a href="./docs/source/3x/llm_recipes.md">大语言模型方案</a></td> <td colspan="1" align="center"><a href="./examples/3.x_api/README.md">示例</a></td> </tr> </tbody> <thead> <tr> <th colspan="8">PyTorch 扩展 API</th> </tr> </thead> <tbody> <tr> <td colspan="2" align="center"><a href="./docs/source/3x/PyTorch.md">概述</a></td> <td colspan="2" align="center"><a href="./docs/source/3x/PT_DynamicQuant.md">动态量化</a></td> <td colspan="2" align="center"><a href="./docs/source/3x/PT_StaticQuant.md">静态量化</a></td> <td colspan="2" align="center"><a href="./docs/source/3x/PT_SmoothQuant.md">平滑量化</a></td> </tr> <tr> <td colspan="2" align="center"><a href="./docs/source/3x/PT_WeightOnlyQuant.md">权重专用量化</a></td> <td colspan="2" align="center"><a href="./docs/source/3x/PT_FP8Quant.md">FP8 量化</a></td> <td colspan="2" align="center"><a href="./docs/source/3x/PT_MXQuant.md">MX 量化</a></td> <td colspan="2" align="center"><a href="./docs/source/3x/PT_MixedPrecision.md">混合精度</a></td> </tr> </tbody> <thead> <tr> <th colspan="8">TensorFlow 扩展 API</th> </tr> </thead> <tbody> <tr> <td colspan="3" align="center"><a href="./docs/source/3x/TensorFlow.md">概述</a></td> <td colspan="3" align="center"><a href="./docs/source/3x/TF_Quant.md">静态量化</a></td> <td colspan="2" align="center"><a href="./docs/source/3x/TF_SQ.md">平滑量化</a></td> </tr> </tbody> <thead> <tr> <th colspan="8">其他模块</th> </tr> </thead> <tbody> <tr> <td colspan="4" align="center"><a href="./docs/source/3x/autotune.md">自动调优</a></td> <td colspan="4" align="center"><a href="./docs/source/3x/benchmark.md">基准测试</a></td> </tr> </tbody> </table>