Llama3-Chinese-Chat

❗️❗️❗️NOTICE: The main branch contains the instructions for Llama3-8B-Chinese-Chat-v2.1. If you want to use or reproduce our Llama3-8B-Chinese-Chat-v1, please refer to the v1 branch; if you want to use or reproduce our Llama3-8B-Chinese-Chat-v2, please refer to the v2 branch.

❗️❗️❗️NOTICE: For optimal performance, we refrain from fine-tuning the model's identity. Thus, inquiries such as "Who are you" or "Who developed you" may yield random responses that are not necessarily accurate.

Updates

🚀🚀🚀 [May 6, 2024] We now introduce Llama3-8B-Chinese-Chat-v2.1! Compared to v1, the training dataset of v2.1 is 5x larger (~100K preference pairs), and it exhibits significant enhancements, especially in roleplay, function calling, and math capabilities! Compared to v2, v2.1 surpasses v2 in math and is less prone to including English words in Chinese responses. The training dataset of Llama3-8B-Chinese-Chat-v2.1 will be released soon. If you love our Llama3-8B-Chinese-Chat-v1 or v2, you won't want to miss out on Llama3-8B-Chinese-Chat-v2.1!
🔥 We provide the official Ollama model for the q4_0 GGUF version of Llama3-8B-Chinese-Chat-v2.1 at wangshenzhi/llama3-8b-chinese-chat-ollama-q4! Run the following command for quick use of this model: ollama run wangshenzhi/llama3-8b-chinese-chat-ollama-q4.
🔥 We provide the official Ollama model for the q8_0 GGUF version of Llama3-8B-Chinese-Chat-v2.1 at wangshenzhi/llama3-8b-chinese-chat-ollama-q8! Run the following command for quick use of this model: ollama run wangshenzhi/llama3-8b-chinese-chat-ollama-q8.
🔥 We provide the official Ollama model for the f16 GGUF version of Llama3-8B-Chinese-Chat-v2.1 at wangshenzhi/llama3-8b-chinese-chat-ollama-fp16! Run the following command for quick use of this model: ollama run wangshenzhi/llama3-8b-chinese-chat-ollama-fp16.
🔥 We provide the official q4_0 GGUF version of Llama3-8B-Chinese-Chat-v2.1 at https://huggingface.co/shenzhi-wang/Llama3-8B-Chinese-Chat-GGUF-4bit!
🔥 We provide the official q8_0 GGUF version of Llama3-8B-Chinese-Chat-v2.1 at https://huggingface.co/shenzhi-wang/Llama3-8B-Chinese-Chat-GGUF-8bit!
🔥 We provide the official f16 GGUF version of Llama3-8B-Chinese-Chat-v2.1 at https://huggingface.co/shenzhi-wang/Llama3-8B-Chinese-Chat-GGUF-f16!
🌟 If you are in China, you can download our models from https://hf-mirror.com/shenzhi-wang/Llama3-8B-Chinese-Chat.

<details> <summary><b>Updates for Llama3-8B-Chinese-Chat-v2 [CLICK TO EXPAND]</b></summary>

🔥 Llama3-8B-Chinese-v2's link: https://huggingface.co/shenzhi-wang/Llama3-8B-Chinese-Chat/tree/v2
🔥 We provide the official f16 GGUF version of Llama3-8B-Chinese-Chat-v2 at https://huggingface.co/shenzhi-wang/Llama3-8B-Chinese-Chat-GGUF-f16/tree/v2!
🔥 We provide the official 8bit-quantized GGUF version of Llama3-8B-Chinese-Chat-v2 at https://huggingface.co/shenzhi-wang/Llama3-8B-Chinese-Chat-GGUF-8bit/tree/v2!
🔥 We provide an online interactive demo for Llama3-8B-Chinese-Chat-v2 (https://huggingface.co/spaces/llamafactory/Llama3-8B-Chinese-Chat). Have fun with our latest model!
🚀🚀🚀 [Apr. 29, 2024] We now introduce Llama3-8B-Chinese-Chat-v2! Compared to v1, the training dataset of v2 is 5x larger (~100K preference pairs), and it exhibits significant enhancements, especially in roleplay, function calling, and math capabilities! If you love our Llama3-8B-Chinese-Chat-v1, you won't want to miss out on Llama3-8B-Chinese-Chat-v2!

</details> <details> <summary><b>Updates for Llama3-8B-Chinese-Chat-v1 [CLICK TO EXPAND]</b></summary>

🔥 Llama3-8B-Chinese-v1's link: https://huggingface.co/shenzhi-wang/Llama3-8B-Chinese-Chat/tree/v1
🔥 We provide the official Ollama model for the f16 GGUF version of Llama3-8B-Chinese-Chat-v1 at wangshenzhi/llama3-8b-chinese-chat-ollama-f16! Run the following command for quick use of this model: ollama run wangshenzhi/llama3-8b-chinese-chat-ollama-fp16.
🔥 We provide the official Ollama model for the 8bit-quantized GGUF version of Llama3-8B-Chinese-Chat-v1 at wangshenzhi/llama3-8b-chinese-chat-ollama-q8! Run the following command for quick use of this model: ollama run wangshenzhi/llama3-8b-chinese-chat-ollama-q8.
🔥 We provide the official f16 GGUF version of Llama3-8B-Chinese-Chat-v1 at shenzhi-wang/Llama3-8B-Chinese-Chat-GGUF-f16-v1!
🔥 We provide the official 8bit-quantized GGUF version of Llama3-8B-Chinese-Chat-v1 at shenzhi-wang/Llama3-8B-Chinese-Chat-GGUF-8bit-v1!
🌟 If you are in China, you can download our v1 model from our Gitee AI repository.

</details>

Model Summary

Llama3-8B-Chinese-Chat is an instruction-tuned language model for Chinese & English users with various abilities such as roleplaying & tool-using built upon the Meta-Llama-3-8B-Instruct model.

Developed by: Shenzhi Wang (王慎执) and Yaowei Zheng (郑耀威)

License: Llama-3 License
Base Model: Meta-Llama-3-8B-Instruct
Model Size: 8.03B
Context length: 8K

1. Introduction

This is the first model specifically fine-tuned for Chinese & English user through ORPO [1] based on the Meta-Llama-3-8B-Instruct model.

Compared to the original Meta-Llama-3-8B-Instruct model, our Llama3-8B-Chinese-Chat-v1 model significantly reduces the issues of "Chinese questions with English answers" and the mixing of Chinese and English in responses.

Compared to Llama3-8B-Chinese-Chat-v1, our Llama3-8B-Chinese-Chat-v2 model significantly increases the training data size (from 20K to 100K), which introduces great performance enhancement, especially in roleplay, tool using, and math.

[1] Hong, Jiwoo, Noah Lee, and James Thorne. "Reference-free Monolithic Preference Optimization with Odds Ratio." arXiv preprint arXiv:2403.07691 (2024).

Training framework: LLaMA-Factory.

Training details:

epochs: 2
learning rate: 3e-6
learning rate scheduler type: cosine
Warmup ratio: 0.1
cutoff len (i.e. context length): 8192
orpo beta (i.e. $\lambda$ in the ORPO paper): 0.05
global batch size: 128
fine-tuning type: full parameters
optimizer: paged_adamw_32bit

2. Model Download

We provide various versions of our Llama3-8B-Chinese-Chat model, including:

Llama3-8B-Chinese-Chat (BF16).

You can download it from this huggingface repo.
Ollama Model for Llama3-8B-Chinese-Chat (4bit-quantized GGUF).

You can download it from this ollama repo.
Ollama Model for Llama3-8B-Chinese-Chat (8bit-quantized GGUF).

You can download it from this ollama repo.
Ollama Model for Llama3-8B-Chinese-Chat (f16 GGUF).

You can download it from this ollama repo.
Llama3-8B-Chinese-Chat (4bit-quantized GGUF).

You can download it from this huggingface repo.
Llama3-8B-Chinese-Chat (8bit-quantized GGUF).

You can download it from this huggingface repo.
Llama3-8B-Chinese-Chat (f16 GGUF).

You can download it from this huggingface repo.

3. Usage

Quick use via Ollama

For the fastest use of our Llama3-8B-Chinese-Chat-v2.1 model, we recommend you use our model via Ollama. Specifically, you can install Ollama here, and then run the following command:

ollama run wangshenzhi/llama3-8b-chinese-chat-ollama-q4  # to use the Ollama model for our 4bit-quantized GGUF Llama3-8B-Chinese-Chat-v2.1
# or
ollama run wangshenzhi/llama3-8b-chinese-chat-ollama-q8  # to use the Ollama model for our 8bit-quantized GGUF Llama3-8B-Chinese-Chat-v2.1
# or
ollama run wangshenzhi/llama3-8b-chinese-chat-ollama-fp16  # to use the Ollama model for our FP16 GGUF Llama3-8B-Chinese-Chat-v2.1

To use the BF16 version of our Llama3-8B-Chinese-Chat model

You can run the following python script:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "shenzhi-wang/Llama3-8B-Chinese-Chat"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype="auto", device_map="auto"
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "写一首诗吧"},
]

input_ids = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=1024,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))

To use the GGUF version of our Llama3-8B-Chinese-Chat model

First, download the 8bit-quantized GGUF model or f16 GGUF model to your local machine.

Then, run the following python script:

from llama_cpp import Llama

model = Llama(
    "/Your/Path/To/Llama3-8B-Chinese-Chat/GGUF/Model",
    verbose=False,
    n_gpu_layers=-1,
)

system_prompt = "You are a helpful assistant."

def generate_reponse(_model, _messages, _max_tokens=8192):
    _output = _model.create_chat_completion(
        _messages,
        stop=["<|eot_id|>", "<|end_of_text|>"],
        max_tokens=_max_tokens,
    )["choices"][0]["message"]["content"]
    return _output

# The following are some examples

messages = [
    {
        "role": "system",
        "content": system_prompt,
    },
    {"role": "user", "content": "写一首诗吧"},
]


print(generate_reponse(_model=model, _messages=messages))

4. Reproduce

To reproduce Llama3-8B-Chinese-Chat-v2.1 (to reproduce Llama3-8B-Chinese-Chat-v1, please refer to this link):

git clone https://github.com/hiyouga/LLaMA-Factory.git
git reset --hard 25aeaae51b6d08a747e222bbcb27e75c4d56a856    # For Llama3-8B-Chinese-Chat-v1: 836ca0558698206bbf4e3b92533ad9f67c9f9864

cd