<a href="https://github.com/labteral/ernie#stickers-by-sticker-mule" alt="贴纸部分"><img src="https://yellow-cdn.veclightyear.com/0a4dffa0/1eac4ce8-5b99-4577-9c60-6ac240603b97.png" alt="Ernie标志" width="150"/></a> <a href="https://pepy.tech/project/ernie/"><img alt="下载量" src="https://img.shields.io/badge/dynamic/json?style=flat-square&maxAge=3600&label=下载量&query=$.total_downloads&url=https://analytics.pepy.tech/api/v2/projects/ernie"></a> <a href="https://pypi.python.org/pypi/ernie/"><img alt="PyPi" src="https://yellow-cdn.veclightyear.com/0a4dffa0/9e51eac6-64ed-4c71-97d8-5b1ec1ffd707.svg?style=flat-square"></a> <a href="https://github.com/labteral/ernie/releases"><img alt="GitHub发布" src="https://yellow-cdn.veclightyear.com/0a4dffa0/334006ae-4170-438d-9be2-1ee473c0e370.svg?style=flat-square"></a> <a href="https://github.com/labteral/ernie/blob/master/LICENSE"><img alt="许可证" src="https://yellow-cdn.veclightyear.com/0a4dffa0/f50cebb4-412a-4f3b-8a50-4c32d64962f2.svg?style=flat-square"></a> <h3 align="center"> BERT最好的朋友。 </h3> <a href="https://www.buymeacoffee.com/brunneis" target="_blank"><img src="https://yellow-cdn.veclightyear.com/0a4dffa0/33b39140-b754-4ae1-9774-64737ca3fabf.png" alt="给我买杯咖啡" height="35px"></a>

由 <a href="http://stickermule.com/supports/ernie20-sponsorship"><img src="https://yellow-cdn.veclightyear.com/0a4dffa0/979a9358-b1dc-42b2-bc97-aaab96b809ae.png" alt="Sticker Mule标志" width="80px"/></a> 赞助

安装

Ernie需要Python 3.6或更高版本。

pip install ernie

微调

句子分类

from ernie import SentenceClassifier, Models
import pandas as pd

tuples = [
    ("这是一个积极的例子。我今天很开心。", 1),
    ("这是一个消极的句子。今天工作中一切都出错了。", 0)
]
df = pd.DataFrame(tuples)

classifier = SentenceClassifier(
    model_name=Models.BertBaseUncased,
    max_length=64,
    labels_no=2
)
classifier.load_dataset(df, validation_split=0.2)
classifier.fine_tune(
    epochs=4,
    learning_rate=2e-5,
    training_batch_size=32,
    validation_batch_size=64
)

预测

预测单个文本

text = "哦，那太好了！"

# 它返回一个包含预测结果的元组
probabilities = classifier.predict_one(text)

预测多个文本

texts = ["哦，那太好了！", "那真是太糟糕了"]

# 它返回一个包含预测结果的元组生成器
probabilities = classifier.predict(texts)

预测策略

如果文本的标记长度超过模型微调时的 max_length，它们将被截断。为避免信息丢失，你可以使用分割策略并以不同方式聚合预测结果。

分割策略

SentencesWithoutUrls。文本将被分割成句子。
GroupedSentencesWithoutUrls。文本将被分割成标记长度接近 max_length 的句子组。

聚合策略

Mean：文本的预测结果将是各分割部分预测结果的平均值。
MeanTopFiveBinaryClassification：仅对5个最高预测结果计算平均值。
MeanTopTenBinaryClassification：仅对10个最高预测结果计算平均值。
MeanTopFifteenBinaryClassification：仅对15个最高预测结果计算平均值。
MeanTopTwentyBinaryClassification：仅对20个最高预测结果计算平均值。

from ernie import SplitStrategies, AggregationStrategies

texts = ["哦，那太棒了！", "那真是太糟糕了"]
probabilities = classifier.predict(
    texts,
    split_strategy=SplitStrategies.GroupedSentencesWithoutUrls,
    aggregation_strategy=AggregationStrategies.Mean
)

你可以通过 AggregationStrategy 和 SplitStrategy 类定义自定义策略。

from ernie import SplitStrategy, AggregationStrategy

my_split_strategy = SplitStrategy(
    split_patterns: list,
    remove_patterns: list,
    remove_too_short_groups: bool,
    group_splits: bool
)
my_aggregation_strategy = AggregationStrategy(
    method: function,
    max_items: int,
    top_items: bool,
    sorting_class_index: int
)

保存和恢复微调模型

保存模型

classifier.dump('./model')

加载模型

classifier = SentenceClassifier(model_path='./model')

中断训练

由于执行可能在训练期间中断（尤其是在使用Google Colab时），你可以选择保存每个新训练的轮次，这样可以在不丢失所有进度的情况下恢复训练。

classifier = SentenceClassifier(
    model_name=Models.BertBaseUncased,
    max_length=64
)
classifier.load_dataset(df, validation_split=0.2)

for epoch in range(1, 5):
    if epoch == 3:
        raise Exception("强制崩溃")

    classifier.fine_tune(epochs=1)
    classifier.dump(f'./my-model/{epoch}')

last_training_epoch = 2

classifier = SentenceClassifier(model_path=f'./my-model/{last_training_epoch}')
classifier.load_dataset(df, validation_split=0.2)

for epoch in range(last_training_epoch + 1, 5):
    classifier.fine_tune(epochs=1)
    classifier.dump(f'./my-model/{epoch}')

自动保存

即使你没有显式地保存模型，每次成功执行 fine_tune 时，它都会自动保存到 ./ernie-autosave 中。

ernie-autosave/
└── model_family/
    └── timestamp/
        ├── config.json
        ├── special_tokens_map.json
        ├── tf_model.h5
        ├── tokenizer_config.json
        └── vocab.txt

你可以在结束一个会话或开始新会话时，通过调用 clean_autosave 轻松清理自动保存的模型。

from ernie import clean_autosave
clean_autosave()

支持的模型

你可以通过 Models 类访问一些官方基础模型名称。但是，在实例化 SentenceClassifier 时，你可以直接输入HuggingFace的模型名称，如 bert-base-uncased 或 bert-base-chinese。

在 huggingface.co/models 查看所有可用模型。

附加信息

访问模型和分词器

实例化分类器后，你可以直接访问模型和分词器对象：

classifier.model
classifier.tokenizer

Keras `model.fit` 参数

你可以将Keras的 model.fit 方法的参数传递给 classifier.fine_tune 方法。例如：

classifier.fine_tune(class_weight={0: 0.2, 1: 0.8})

ernie

安装

微调

句子分类

预测

预测单个文本

预测多个文本

预测策略

分割策略

聚合策略

保存和恢复微调模型

保存模型

加载模型

中断训练

自动保存

支持的模型

附加信息

访问模型和分词器

Keras model.fit 参数

编辑推荐精选

TRAE编程

蛙蛙写作

问小白

Transly

讯飞智文

讯飞星火

Spark-TTS

咔片PPT

讯飞绘文

材料星

探索AI的无限可能

推荐工具精选

TRAE编程

豆包

讯飞文书

讯飞绘文

讯飞绘镜

阿里绘蛙

咔片PPT

AI云服务特惠

火山引擎

阿里云

腾讯云

华为云

百度智能云

AWS

关注微信公众号

Keras `model.fit` 参数