File size: 8,103 Bytes
4628f57 db58aec 751cc82 637597c ce1b57f 8347dbc 557f12d 4628f57 db58aec 83fc491 12220ff 83fc491 12220ff 83fc491 12220ff 83fc491 12220ff 83fc491 a2f7ac1 c97ed7e a2f7ac1 c97ed7e a2f7ac1 0011f8d a2f7ac1 0011f8d a2f7ac1 5651294 a2f7ac1 0011f8d a2f7ac1 c97ed7e 83fc491 ea85d69 83fc491 99a40c4 83fc491 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 |
---
license: apache-2.0
datasets:
- Helsinki-NLP/opus_paracrawl
- turuta/Multi30k-uk
language:
- uk
- en
metrics:
- bleu
library_name: peft
pipeline_tag: text-generation
base_model: mistralai/Mistral-7B-v0.1
tags:
- translation
model-index:
- name: Dragoman
results:
- task:
type: translation # Required. Example: automatic-speech-recognition
name: English-Ukrainian Translation # Optional. Example: Speech Recognition
dataset:
type: facebook/flores # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
name: FLORES-101 # Required. A pretty name for the dataset. Example: Common Voice (French)
config: eng_Latn-ukr_Cyrl # Optional. The name of the dataset configuration used in `load_dataset()`. Example: fr in `load_dataset("common_voice", "fr")`. See the `datasets` docs for more info: https://huggingface.co/docs/datasets/package_reference/loading_methods#datasets.load_dataset.name
split: devtest # Optional. Example: test
metrics:
- type: bleu # Required. Example: wer. Use metric id from https://hf.co/metrics
value: 32.34 # Required. Example: 20.90
name: Test BLEU # Optional. Example: Test WER
widget:
- text: "[INST] who holds this neighborhood? [/INST]"
---
# Dragoman: English-Ukrainian Machine Translation Model
## Model Description
The Dragoman is a sentence-level SOTA English-Ukrainian translation model. It's trained using a two-phase pipeline: pretraining on cleaned [Paracrawl](https://huggingface.co/datasets/Helsinki-NLP/opus_paracrawl) dataset and unsupervised data selection phase on [turuta/Multi30k-uk](https://huggingface.co/datasets/turuta/Multi30k-uk).
By using a two-phase data cleaning and data selection approach we have achieved SOTA performance on FLORES-101 English-Ukrainian devtest subset with **BLEU** `32.34`.
## Model Details
- **Developed by:** Yurii Paniv, Dmytro Chaplynskyi, Nikita Trynus, Volodymyr Kyrylov
- **Model type:** Translation model
- **Language(s):**
- Source Language: English
- Target Language: Ukrainian
- **License:** Apache 2.0
## Model Use Cases
We designed this model for sentence-level English -> Ukrainian translation.
Performance on multi-sentence texts is not guaranteed, please be aware.
#### Running the model
```python
# pip install bitsandbytes transformers peft torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftConfig, PeftModel
import torch
config = PeftConfig.from_pretrained("lang-uk/dragoman")
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=False,
)
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.1", quantization_config=quant_config
)
model = PeftModel.from_pretrained(model, "lang-uk/dragoman").to("cuda")
tokenizer = AutoTokenizer.from_pretrained(
"mistralai/Mistral-7B-v0.1", use_fast=False, add_bos_token=False
)
input_text = "[INST] who holds this neighborhood? [/INST]" # model input should adhere to this format
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids, num_beams=10)
print(tokenizer.decode(outputs[0]))
```
### Running the model with mlx-lm on an Apple computer
We merged Dragoman PT adapter into the base model and uploaded the quantized version of the model into https://huggingface.co/lang-uk/dragoman-4bit.
You can run the model using [mlx-lm](https://pypi.org/project/mlx-lm/).
```
python -m mlx_lm.generate --model lang-uk/dragoman-4bit --prompt '[INST] who holds this neighborhood? [/INST]' --temp 0 --max-tokens 100
```
MLX is a recommended way of using the language model on an Apple computer with an M1 chip and newer.
### Running the model with llama.cpp
We converted Dragoman PT adapter into the [GGLA format](https://huggingface.co/lang-uk/dragoman/blob/main/ggml-adapter-model.bin).
You can download the [Mistral-7B-v0.1 base model in the GGUF format](https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF) (e.g. mistral-7b-v0.1.Q4_K_M.gguf)
and use `ggml-adapter-model.bin` from this repository like this:
```
./main -ngl 32 -m mistral-7b-v0.1.Q4_K_M.gguf --color -c 4096 --temp 0 --repeat_penalty 1.1 -n -1 -p "[INST] who holds this neighborhood? [/INST]" --lora ./ggml-adapter-model.bin
```
### Training Dataset and Resources
Training code: [lang-uk/dragoman](https://github.com/lang-uk/dragoman)
Cleaned Paracrawl: [lang-uk/paracrawl_3m](https://huggingface.co/datasets/lang-uk/paracrawl_3m)
Cleaned Multi30K: [lang-uk/multi30k-extended-17k](https://huggingface.co/datasets/lang-uk/multi30k-extended-17k)
### Benchmark Results against other models on FLORES-101 devset
| **Model** | **BLEU** $\uparrow$ | **spBLEU** | **chrF** | **chrF++** |
|---------------------------------------------|---------------------|-------------|----------|------------|
| **Finetuned** | | | | |
| Dragoman P, 10 beams | 30.38 | 37.93 | 59.49 | 56.41 |
| Dragoman PT, 10 beams | **32.34** | **39.93** | **60.72**| **57.82** |
|---------------------------------------------|---------------------|-------------|----------|------------|
| **Zero shot and few shot** | | | | |
| LLaMa-2-7B 2-shot | 20.1 | 26.78 | 49.22 | 46.29 |
| RWKV-5-World-7B 0-shot | 21.06 | 26.20 | 49.46 | 46.46 |
| gpt-4 10-shot | 29.48 | 37.94 | 58.37 | 55.38 |
| gpt-4-turbo-preview 0-shot | 30.36 | 36.75 | 59.18 | 56.19 |
| Google Translate 0-shot | 25.85 | 32.49 | 55.88 | 52.48 |
|---------------------------------------------|---------------------|-------------|----------|------------|
| **Pretrained** | | | | |
| NLLB 3B, 10 beams | 30.46 | 37.22 | 58.11 | 55.32 |
| OPUS-MT, 10 beams | 32.2 | 39.76 | 60.23 | 57.38 |
## Citation
```
@inproceedings{paniv-etal-2024-dragoman,
title = "Setting up the Data Printer with Improved {E}nglish to {U}krainian Machine Translation",
author = "Paniv, Yurii and
Chaplynskyi, Dmytro and
Trynus, Nikita and
Kyrylov, Volodymyr",
editor = "Romanyshyn, Mariana and
Romanyshyn, Nataliia and
Hlybovets, Andrii and
Ignatenko, Oleksii",
booktitle = "Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.unlp-1.6",
pages = "41--50",
abstract = "To build large language models for Ukrainian we need to expand our corpora with large amounts of new algorithmic tasks expressed in natural language. Examples of task performance expressed in English are abundant, so with a high-quality translation system our community will be enabled to curate datasets faster. To aid this goal, we introduce a recipe to build a translation system using supervised finetuning of a large pretrained language model with a noisy parallel dataset of 3M pairs of Ukrainian and English sentences followed by a second phase of training using 17K examples selected by k-fold perplexity filtering on another dataset of higher quality. Our decoder-only model named Dragoman beats performance of previous state of the art encoder-decoder models on the FLORES devtest set.",
}
```
|