xglm-564M

项目介绍：XGLM-564M

什么是XGLM-564M？

XGLM-564M是一个多语言自回归语言模型，拥有5.64亿个参数。该模型经过大规模多语言语料库的训练，包含了30种不同语言，总计5000亿子标记。它是为了在不同语言之间进行少样本学习而开发的，研究成果发表于论文《Few-shot Learning with Multilingual Language Models》中。该模型的实现可以在一个开源库中找到。

训练数据统计

XGLM-564M的训练数据来自多种语言，让我们来看看具体的数据细节：

英语（en）：使用最多，有约8,035亿个标记，占总数据的32.59%。
俄语（ru）：约1,477亿个标记，占数据的6.02%。
中文（zh）：约1,327亿个标记，占数据的4.83%。
德语（de）、西班牙语（es）、**法语（fr）**等其他语言也在训练数据中占据重要位置。

除了这些，占比更小的还有芬兰语、土耳其语、阿拉伯语、越南语等，总共有30种语言，其分布相对均衡，以支持多语言模型的训练。

模型使用信息

模型的详细使用指南可以在XGLM-564M开发团队发布的模型卡中找到，这是模型用户的一个重要资源。

示例应用（COPA任务）

以下是XGLM-564M在“可能的替代选择”任务（COPA）上的应用示例。COPA是一个用于推理的挑战任务，模型需要在两个备选选项中选出更合适的一个。下面展示了如何使用Python代码在COPA任务中做零样本评估：

import torch
import torch.nn.functional as F
from transformers import XGLMTokenizer, XGLMForCausalLM

tokenizer = XGLMTokenizer.from_pretrained("facebook/xglm-564M")
model = XGLMForCausalLM.from_pretrained("facebook/xglm-564M")

data_samples = {
    'en': [
        {
            "premise": "I wanted to conserve energy.",
            "choice1": "I swept the floor in the unoccupied room.",
            "choice2": "I shut off the light in the unoccupied room.",
            "question": "effect",
            "label": "1"
        },
        {
            "premise": "The flame on the candle went out.",
            "choice1": "I blew on the wick.",
            "choice2": "I put a match to the wick.",
            "question": "cause",
            "label": "0"
        }
    ],
    'zh': [
        {
            "premise": "我想节约能源。",
            "choice1": "我在空着的房间里扫了地板。",
            "choice2": "我把空房间里的灯关了。",
            "question": "effect",
            "label": "1"
        },
        {
            "premise": "蜡烛上的火焰熄灭了。",
            "choice1": "我吹灭了灯芯。",
            "choice2": "我把一根火柴放在灯芯上。",
            "question": "cause",
            "label": "0"
        }
    ]
}

def get_logprobs(prompt):
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids, output_ids = inputs["input_ids"], inputs["input_ids"][:, 1:]
    outputs = model(**inputs, labels=input_ids)
    logits = outputs.logits
    logprobs = torch.gather(F.log_softmax(logits, dim=2), 2, output_ids.unsqueeze(2))
    return logprobs

def COPA_eval(prompt, alternative1, alternative2):
    lprob1 = get_logprobs(prompt + "\n" + alternative1).sum()
    lprob2 = get_logprobs(prompt + "\n" + alternative2).sum()
    return 0 if lprob1 > lprob2 else 1

for lang in data_samples:
    for idx, example in enumerate(data_samples[lang]):
        predict = COPA_eval(example["premise"], example["choice1"], example["choice2"])
        print(f'{lang}-{idx}', predict, example['label'])