X-ALMA-13B-Pretrain

X-ALMA-13B-Pretrain项目介绍

项目背景

X-ALMA-13B-Pretrain是一个基于ALMA-R的项目，通过采用模块化架构和专门设计的训练方式，将语言支持从原来的6种扩展到50种。它推出了一个多语言的预训练基础模型，并且在Huggingface平台上公开发布。

数据集与模型基础

项目使用了多个大型数据集进行训练，包括OSCAR-2301、NLLB和OPUS-100，这些数据集涵盖了包括英语、德语、中文、韩语、阿拉伯语等在内的多种语言，共计50种。在此基础上，利用ALMA-13B的预训练模型，进一步提高了模型的翻译质量。

主要特点

多语言支持：X-ALMA-13B-Pretrain能支持包括中文、日语、法语等在内的50种语言的翻译任务。
模块化设计：采用“插入即用”的架构，通过语言特定的模块来提升翻译效果，用户可以根据需要加载特定的语言模块。
适应性拒绝机制：通过自适应的机制来提高翻译的准确性和质量。

模型发布与使用

所有与X-ALMA相关的检查点在Huggingface平台发布，用户可以访问不同的模型版本以满足特定的需求：

X-ALMA：包括所有模块的完整模型。
X-ALMA-13B-Pretrain：基础的多语言预训练模型。
每个语言组特定的模块均有对应的模型下载链接，例如X-ALMA-Group1至X-ALMA-Group8，分别对应不同的语言群组。

模型加载方法

方法一：加载融合后的模型（推荐）

这种方式直接加载已融入特定语言模块的基础模型，操作简单，结果可靠。适合大部分用户使用。

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# 选择特定语言组对应的模型
group_id = '指定语言的组号'
model = AutoModelForCausalLM.from_pretrained(f"haoranxu/X-ALMA-13B-Group{group_id}", torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(f"haoranxu/X-ALMA-13B-Group{group_id}", padding_side='left')

# 翻译操作
prompt = "Translate this from Chinese to English:\nChinese: 我爱机器翻译。\nEnglish:"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

with torch.no_grad():
    generated_ids = model.generate(input_ids=input_ids, num_beams=5, max_new_tokens=20)
    outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

方法二：加载基础模型和语言特定模块（推荐）

这种方式分别加载基础模型和相应的语言特定模块，适合对特定语言需求较大的用户。

model = AutoModelForCausalLM.from_pretrained("haoranxu/X-ALMA-13B-Pretrain", torch_dtype=torch.float16, device_map="auto")
model = PeftModel.from_pretrained(model, f"haoranxu/X-ALMA-13B-Group{group_id}")
tokenizer = AutoTokenizer.from_pretrained(f"haoranxu/X-ALMA-13B-Group{group_id}", padding_side='left')

方法三：加载全部语言模块（需要较大GPU内存）

这种方式适合有很强计算资源并希望在多个语言之间快速切换的用户。

from modeling_xalma import XALMAForCausalLM
model = XALMAForCausalLM.from_pretrained("haoranxu/X-ALMA", torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("haoranxu/X-ALMA", padding_side='left')

# 翻译操作
generated_ids = model.generate(input_ids=input_ids, num_beams=5, max_new_tokens=20, lang="zh")