A newer version of the Gradio SDK is available:
5.25.0
IndicTrans2 HF Compatible Models
In this section, we provide details on how to use our IndicTrans2 models which were originally trained with the fairseq to HuggingFace transformers for inference purpose. Our scripts for HuggingFace compatible models are adapted from M2M100 repository.
Note: We have migrated IndicTrans2 tokenizer for HF compatible IndicTrans2 models to IndicTransToolkit and will be maintained separately there from now onwards. This is automatically installed when you call
install.sh
script inhuggingface_interface
.
Setup
To get started, follow these steps to set up the environment:
# Clone the github repository and navigate to the project directory.
git clone https://github.com/AI4Bharat/IndicTrans2
cd IndicTrans2/huggingface_interface
# Install all the dependencies and requirements associated with the project for running HF compatible models.
source install.sh
Note: The
install.sh
script in this directory is specifically for running HF compatible models for inference.
Converting
In order to convert the fairseq checkpoint to a PyTorch checkpoint that is compatible with HuggingFace Transformers, use the following command:
python3 convert_indictrans_checkpoint_to_pytorch.py --fairseq_path <fairseq_checkpoint_best.pt> --pytorch_dump_folder_path <hf_output_dir>
<fairseq_checkpoint_best.pt>
: path to the fairseqcheckpoint_best.pt
that needs to be converted to HF compatible models<hf_output_dir>
: path to the output directory where the HF compatible models will be saved
Models
Model | π€ HuggingFace Checkpoints |
---|---|
En-Indic | ai4bharat/indictrans2-en-indic-1B |
Indic-En | ai4bharat/indictrans2-indic-en-1B |
Distilled En-Indic | ai4bharat/indictrans2-en-indic-dist-200M |
Distilled Indic-En | ai4bharat/indictrans2-indic-en-dist-200M |
Indic-Indic (Stitched) | ai4bharat/indictrans2-indic-indic-1B |
Distilled Indic-Indic (Stitched) | ai4bharat/indictrans2-indic-indic-dist-320M |
Inference
With the conversion complete, you can now perform inference using the HuggingFace Transformers.
You can start with the provided example.py
script and customize it for your specific translation use case:
python3 example.py
Feel free to modify the example.py
script to suit your translation needs.
Fine-tuning with LoRA
Before starting with fine-tuning IndicTrans2 models, you will need to restructure the training data in the following format.
en-indic-exp
βββ train
β βββ eng_Latn-asm_Beng
β β βββ train.eng_Latn
β β βββ train.asm_Beng
β βββ eng_Latn-ben_Beng
β β βββ ...
β βββ {src_lang}-{tgt_lang}
β βββ train.{src_lang}
β βββ train.{tgt_lang}
βββ dev
βββ eng_Latn-asm_Beng
β βββ dev.eng_Latn
β βββ dev.asm_Beng
βββ eng_Latn-ben_Beng
β βββ ...
βββ {src_lang}-{tgt_lang}
βββ dev.{src_lang}
βββ dev.{tgt_lang}
Once you have data ready in above specified format, use the following command.
bash train_lora.sh <data_dir> <model_name> <output_dir> <direction> <src_lang_list> <tgt_lang_list>
We recommend you to refer to train_lora.sh
for defaults arguments for fine-tuning. Please note that the specified hyperparameters may not be optimal and might require tuning for your use case.
Inference with LoRA
You can load the LoRA adapters with the base model for inference by modifying the model initialization in example.py
script.
from transformers import AutoModelForSeq2SeqLM
from peft import PeftConfig, PeftModel
base_ckpt_dir = "ai4bharat/indictrans2-en-indic-1B" # you will need to change as per your use case
base_model = AutoModelForSeq2SeqLM.from_pretrained(base_ckpt_dir, trust_remote_code=True)
lora_model = PeftModel.from_pretrained(base_model, lora_ckpt_dir)
Note: Please feel free to open issues on the GitHub repo in case of any queries/issues.
Citation
@article{gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}