File size: 5,913 Bytes
74ee63f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
Metadata-Version: 2.2
Name: IndicTransToolkit
Version: 1.0.3
Summary: A simple, consistent, and extendable module for IndicTrans2 tokenizer compatible with HuggingFace models
Home-page: https://github.com/VarunGumma/IndicTransToolkit
Author: Varun Gumma
Author-email: [email protected]
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: setuptools>=68.2.2
Requires-Dist: torch
Requires-Dist: cython
Requires-Dist: sacremoses
Requires-Dist: sentencepiece
Requires-Dist: transformers
Requires-Dist: sacrebleu
Requires-Dist: indic-nlp-library-IT2@ git+https://github.com/VarunGumma/indic_nlp_library.git
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# IndicTransToolkit

## About
The goal of this repository is to provide a simple, modular, and extendable toolkit for [IndicTrans2](https://github.com/AI4Bharat/IndicTrans2) and be compatible with the HuggingFace models released. Please refer to the `CHANGELOG.md` for latest developments.

## Pre-requisites
 - `Python 3.8+`
 - [Indic NLP Library](https://github.com/VarunGumma/indic_nlp_library)
 - Other requirements as listed in `requirements.txt`

## Configuration
 - Editable installation (Note, this may take a while):
```bash 
git clone https://github.com/VarunGumma/IndicTransToolkit
cd IndicTransToolkit



pip install --editable . --use-pep517 # required for pip >= 25.0



# in case it fails, try:

# pip install --editable . --use-pep517 --config-settings editable_mode=compat
```

## Examples
For the training usecase, please refer [here](https://github.com/AI4Bharat/IndicTrans2/tree/main/huggingface_interface).

### PreTainedTokenizer 
```python

import torch

from IndicTransToolkit.processor import IndicProcessor # NOW IMPLEMENTED IN CYTHON !!

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer



ip = IndicProcessor(inference=True)

tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indictrans2-en-indic-dist-200M", trust_remote_code=True)

model = AutoModelForSeq2SeqLM.from_pretrained("ai4bharat/indictrans2-en-indic-dist-200M", trust_remote_code=True)



sentences = [

    "This is a test sentence.",

    "This is another longer different test sentence.",

    "Please send an SMS to 9876543210 and an email on [email protected] by 15th October, 2023.",

]



batch = ip.preprocess_batch(sentences, src_lang="eng_Latn", tgt_lang="hin_Deva", visualize=False) # set it to visualize=True to print a progress bar

batch = tokenizer(batch, padding="longest", truncation=True, max_length=256, return_tensors="pt")



with torch.inference_mode():

    outputs = model.generate(**batch, num_beams=5, num_return_sequences=1, max_length=256)



with tokenizer.as_target_tokenizer():

    # This scoping is absolutely necessary, as it will instruct the tokenizer to tokenize using the target vocabulary.
    # Failure to use this scoping will result in gibberish/unexpected predictions as the output will be de-tokenized with the source vocabulary instead.
    outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=True)

outputs = ip.postprocess_batch(outputs, lang="hin_Deva")
print(outputs)

>>> ['यह एक परीक्षण वाक्य है।', 'यह एक और लंबा अलग परीक्षण वाक्य है।', 'कृपया 9876543210 पर एक एस. एम. एस. भेजें और 15 अक्टूबर, 2023 तक [email protected] पर एक ईमेल भेजें।']
```

### Evaluation
- `IndicEvaluator` is a python implementation of [compute_metrics.sh](https://github.com/AI4Bharat/IndicTrans2/blob/main/compute_metrics.sh). 
- We have found that this python implementation gives slightly lower scores than the original `compute_metrics.sh`. So, please use this function cautiously, and feel free to raise a PR if you have found the bug/fix. 
```python

from IndicTransToolkit import IndicEvaluator



# this method returns a dictionary with BLEU and ChrF2++ scores with appropriate signatures

evaluator = IndicEvaluator()

scores = evaluator.evaluate(tgt_lang=tgt_lang, preds=pred_file, refs=ref_file) 



# alternatively, you can pass the list of predictions and references instead of files 

# scores = evaluator.evaluate(tgt_lang=tgt_lang, preds=preds, refs=refs)

```



## Authors

 - Varun Gumma ([email protected])

 - Jay Gala ([email protected])

 - Pranjal Agadh Chitale ([email protected])

 - Raj Dabre ([email protected])





## Bugs and Contribution

Since this a bleeding-edge module, you may encounter broken stuff and import issues once in a while. In case you encounter any bugs or want additional functionalities, please feel free to raise `Issues`/`Pull Requests` or contact the authors. 





## Citation

If you use our codebase, or models, please do cite the following paper:

```bibtex

@article{

    gala2023indictrans,

    title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
    author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
    journal={Transactions on Machine Learning Research},
    issn={2835-8856},
    year={2023},
    url={https://openreview.net/forum?id=vfT4YuzAYA},
    note={}
}
```