File size: 6,062 Bytes
74ee63f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# IndicTrans2 HF Compatible Models

[![colab link](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AI4Bharat/IndicTrans2/blob/main/huggingface_interface/colab_inference.ipynb)

In this section, we provide details on how to use our [IndicTrans2](https://github.com/AI4Bharat/IndicTrans2) models which were originally trained with the [fairseq](https://github.com/facebookresearch/fairseq) to [HuggingFace transformers](https://huggingface.co/docs/transformers/index) for inference purpose. Our scripts for HuggingFace compatible models are adapted from [M2M100 repository](https://github.com/huggingface/transformers/tree/main/src/transformers/models/m2m_100).

> Note: We have migrated IndicTrans2 tokenizer for HF compatible IndicTrans2 models to [IndicTransToolkit](https://github.com/VarunGumma/IndicTransToolkit) and will be maintained separately there from now onwards. This is automatically installed when you call `install.sh` script in `huggingface_interface`.



### Setup



To get started, follow these steps to set up the environment:



```

# Clone the github repository and navigate to the project directory.

git clone https://github.com/AI4Bharat/IndicTrans2

cd IndicTrans2/huggingface_interface

# Install all the dependencies and requirements associated with the project for running HF compatible models.
source install.sh
```



> Note: The `install.sh` script in this directory is specifically for running HF compatible models for inference.



### Converting



In order to convert the fairseq checkpoint to a PyTorch checkpoint that is compatible with HuggingFace Transformers, use the following command:



```bash

python3 convert_indictrans_checkpoint_to_pytorch.py --fairseq_path <fairseq_checkpoint_best.pt> --pytorch_dump_folder_path <hf_output_dir>

```

- `<fairseq_checkpoint_best.pt>`: path to the fairseq `checkpoint_best.pt` that needs to be converted to HF compatible models
- `<hf_output_dir>`: path to the output directory where the HF compatible models will be saved

### Models

| Model                            | πŸ€— HuggingFace Checkpoints                                                                                        |
| -------------------------------- | ----------------------------------------------------------------------------------------------------------------- |
| En-Indic                         | [ai4bharat/indictrans2-en-indic-1B](https://huggingface.co/ai4bharat/indictrans2-en-indic-1B)                     |
| Indic-En                         | [ai4bharat/indictrans2-indic-en-1B](https://huggingface.co/ai4bharat/indictrans2-indic-en-1B)                     |
| Distilled En-Indic               | [ai4bharat/indictrans2-en-indic-dist-200M](https://huggingface.co/ai4bharat/indictrans2-en-indic-dist-200M)       |
| Distilled Indic-En               | [ai4bharat/indictrans2-indic-en-dist-200M](https://huggingface.co/ai4bharat/indictrans2-indic-en-dist-200M)       |
| Indic-Indic (Stitched)           | [ai4bharat/indictrans2-indic-indic-1B](https://huggingface.co/ai4bharat/indictrans2-indic-indic-1B)               |
| Distilled Indic-Indic (Stitched) | [ai4bharat/indictrans2-indic-indic-dist-320M](https://huggingface.co/ai4bharat/indictrans2-indic-indic-dist-320M) |

### Inference

With the conversion complete, you can now perform inference using the HuggingFace Transformers.

You can start with the provided `example.py` script and customize it for your specific translation use case:

```bash

python3 example.py

```

Feel free to modify the `example.py` script to suit your translation needs.

### Fine-tuning with LoRA

Before starting with fine-tuning IndicTrans2 models, you will need to restructure the training data in the following format.

```

en-indic-exp

β”œβ”€β”€ train

β”‚   β”œβ”€β”€ eng_Latn-asm_Beng

β”‚   β”‚   β”œβ”€β”€ train.eng_Latn

β”‚   β”‚   └── train.asm_Beng

β”‚   β”œβ”€β”€ eng_Latn-ben_Beng

β”‚   β”‚   └── ...

β”‚   └── {src_lang}-{tgt_lang}

β”‚       β”œβ”€β”€ train.{src_lang}

β”‚       └── train.{tgt_lang}

└── dev

    β”œβ”€β”€ eng_Latn-asm_Beng

    β”‚   β”œβ”€β”€ dev.eng_Latn

    β”‚   └── dev.asm_Beng

    β”œβ”€β”€ eng_Latn-ben_Beng

    β”‚   └── ...

    └── {src_lang}-{tgt_lang}

        β”œβ”€β”€ dev.{src_lang}

        └── dev.{tgt_lang}

```

Once you have data ready in above specified format, use the following command.

```bash

bash train_lora.sh <data_dir> <model_name> <output_dir> <direction> <src_lang_list> <tgt_lang_list>

```

We recommend you to refer to `train_lora.sh` for defaults arguments for fine-tuning. Please note that the specified hyperparameters may not be optimal and might require tuning for your use case.

### Inference with LoRA

You can load the LoRA adapters with the base model for inference by modifying the model initialization in `example.py` script.

```python

from transformers import AutoModelForSeq2SeqLM

from peft import PeftConfig, PeftModel



base_ckpt_dir = "ai4bharat/indictrans2-en-indic-1B" # you will need to change as per your use case

base_model = AutoModelForSeq2SeqLM.from_pretrained(base_ckpt_dir, trust_remote_code=True)

lora_model = PeftModel.from_pretrained(base_model, lora_ckpt_dir)

```

> Note: Please feel free to open issues on the GitHub repo in case of any queries/issues.

### Citation

```bibtex

@article{gala2023indictrans,

title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},

author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},

journal={Transactions on Machine Learning Research},

issn={2835-8856},

year={2023},

url={https://openreview.net/forum?id=vfT4YuzAYA},

note={}

}

```