ParroT

<div align="center"> <img width="25%" alt="ParroT" src="https://github.com/wxjiao/ParroT/assets/31032829/9893aba1-7ea3-4c76-a995-9b12aff44950"> <h2> ParroT: Translating During Chat Using Large Language Models tuned with Human Translation and Feedback <br><br> <a href="https://arxiv.org/abs/2304.02426"> <img alt="paper link" src="https://img.shields.io/badge/Paper-arXiv-red"> </a> <a href="https://github.com/wxjiao/InstructMT"> <img alt="data link" src="https://img.shields.io/badge/Data-InstructMT-blue"> </a> </h2> </div>

:fire: Update

[2023/10/12] ParroT accepted to EMNLP 2023 (Findings)!
[2023/09/23] Fixed the streaming mode for local large datasets, which originally supports only datasets in Hugging Face Datasets; need to use --max_steps instead of --num_train_epochs due to the IterableDataset type.
[2023/07/14] Incorporated flash-attention into BLOOM for long-context training; observed about 20-30% speedup with other settings fixed.

[2023/06/14] Releasing detailed instruction data and scripts on @InstructMT.
The WMT22 test sets are made available.
For medium-to-small models (e.g., 7B), we recommend ZeRO2+offload rather than ZerO3; use gradient accumulation to maximize GPU usage.
Important optimizations: preprocess_function to be 4-5X faster; DataCollatorForSeq2Seq for batch-wise padding to save 5-10% GPU usage.
Introducing ParroT-LoRA which supports saving and restarting from the checkpoints (base model and lora weights) during finetuning.
Setting the default Transformers to >= 4.28.0.dev0 directly as it merged the PR of LLaMA. With this version on Torch 1.13.1 + CUDA 11.7, we find the finetuning process could be a bit faster (~18%) than our v1.0.0 implementation.

</details>

:star: Highlight :star:

:hugs: Try the pretrained models at HuggingFace model hub:
- [Alpaca-7b], [ParroT-7b], [ParroT-Hint-7b]
- [ParroT-Hint-7b-lora] based on [LLaMA-7b]

Parrots are smart birds that can respond to simple commands or questions. The question is whether they're just mimicking, or really intelligent enough to communicate with humans. This is similar to what we currently speculate about LLMs.

Promoting the good is essential, but punishing the evil is also necessary to ensure that goodness prevails. Similarly, aligning LLMs with human feedbacks is exactly to learn from correct examples and discriminate erroneous examples.

Large language models (LLMs) like ChatGPT and GPT-4 have exhibited remarkable abilities on a wide range of natural language processing (NLP) tasks, including various machine translation abilities accomplished during chat. However, these models are only accessible through restricted APIs, which creates barriers to new research and advancements in the field. Therefore, we propose the ParroT framework to enhance and regulate the translation abilities during chat based on open-sourced LLMs (e.g., LLaMA, Bloomz) and human written translation and evaluation data. Specifically, ParroT reformulates translation data into the instruction-following style, and introduces a “Hint” field for incorporating extra requirements to regulate the translation process.

<div align="center"> <img width="60%" alt="LLMs-MT" src="https://github.com/wxjiao/ParroT/assets/31032829/bc791aa5-1c79-4ad7-bbee-f361a3b3009a"> <p class="image-caption">Figure 1: Framework of ParroT. Hints are (optional) extra requirements to regulate the translation process.</p> </div>

Configurations

Datasets

Train Data: data/data_alpaca_hf.json, data_parrot_hf.json
- You can also use Alpaca data by GPT-4: data/data_alpaca_gpt4_hf_en.json, data/data_alpaca_gpt4_hf_zh.json
- Find more data details and resources in [InstructMT]
Test Data: Flores subsets, WMT22 test sets
Instruction-following format:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
We are translating the following sentences from Chinese to English.
    
### Input:
检查情况显示，市场销售的粮油、肉类、水果、蔬菜、蛋奶等生活必需品供应充足，商品价格基本稳定，未发现严重违法违规行为，市场经营秩序总体平稳。

### Hint: A translation with major accuracy/mistranslation errors could be

### Response:The results of the inspection indicate the sufficient supply of living necessities <v>on marketing</v> 
including cereals and oils, meat, fruits, vegetables, eggs and milk, and the basically stabilized commodity price. 
The inspection hasn’t found serious violation of laws and regulations. The market order is stable on an overall basis.

Environment

We develop ParroT based on open-sourced LLMs (e.g., LLaMA, Bloomz) with HuggingFace's transformers library.

Framework Versions:

Python 3.8.12
Pytorch 1.13.1+cu117
Transformers (git+https://github.com/huggingface/transformers.git)
Peft (git+https://github.com/huggingface/peft.git)
Flash-attn
Triton 2.0.0.dev20221202
Other requirements

pip install -r requirements.txt

Data Format Conversion

Convert the regular bilingual sentence pairs into Alpaca data format:

python3 scripts/convert_pair_to_alpaca.py \
    -s zh -t en \
    -if scripts/instruct_follow.txt \
    -sf data/train.zh-en.zh.txt \
    -tf data/train.zh-en.en.txt \
    -of data/train_alp.json

Convert the Alpaca data format to the training data format here:

python3 scripts/convert_alpaca_to_hf.py \
    -i data/train_alp.json \
    -o data/train_alp_hf.json

Finetune

We modify the example script of language modeling in transformers for finetuning, i.e., run_clm.py with the built in HuggingFace Trainer. So it would be easy to get started if you are familiar with run_clm.py. Also, this script supports data streaming, which might be helpful for handling larger