distilbert-base-uncased

DistilBERT Base 模型介绍

模型概述

DistilBERT 是一种Transformer模型，它是BERT 基础模型的简化版本，既更小也更快。DistilBERT与其“老师”BERT基于相同的训练库，采用自监督方式进行预训练，这意味着它只使用未标记的原始文本数据，通过自动生成标签和输入的方法进行训练。具体而言，DistilBERT通过以下三个目标进行预训练：

蒸馏损失：模型被训练为返回与BERT基础模型相同的概率。
掩码语言模型（MLM）：这是BERT模型原始训练损失的一部分。给定一句话，模型随机掩盖输入中15%的词，然后运行整个掩盖后的句子，预测被掩盖的词。
余弦嵌入损失：模型还被训练为生成尽可能接近BERT基础模型的隐藏状态。

通过这样的训练，DistilBERT能够学习与其教师模型相同的英语语言内部表示，但在推理或下游任务中却更快。

预期用途与限制

DistilBERT可以用于掩码语言模型或下一句话预测的原始模型，但其主要用于下游任务的精调。如果用户对特定任务感兴趣，可以在模型库中查找经过精调的版本。需要注意的是，该模型主要用于使用整句话（可能被掩盖）进行判断的任务，比如序列分类、标记分类或问答。如果任务是文本生成，建议使用如GPT2这样的模型。

使用方法

可以直接用管道来进行掩码语言模型预测：

from transformers import pipeline
unmasker = pipeline('fill-mask', model='distilbert-base-uncased')
unmasker("Hello I'm a [MASK] model.")

使用 PyTorch 获取文本特征：

from transformers import DistilBertTokenizer, DistilBertModel
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained("distilbert-base-uncased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

使用 TensorFlow 获取文本特征：

from transformers import DistilBertTokenizer, TFDistilBertModel
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = TFDistilBertModel.from_pretrained("distilbert-base-uncased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)