Spaces:

nitikdias
/

stt-ml

Running

File size: 5,913 Bytes

74ee63f

Metadata-Version: 2.2
Name: IndicTransToolkit
Version: 1.0.3
Summary: A simple, consistent, and extendable module for IndicTrans2 tokenizer compatible with HuggingFace models
Home-page: https://github.com/VarunGumma/IndicTransToolkit
Author: Varun Gumma
Author-email: [email protected]
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: setuptools>=68.2.2
Requires-Dist: torch
Requires-Dist: cython
Requires-Dist: sacremoses
Requires-Dist: sentencepiece
Requires-Dist: transformers
Requires-Dist: sacrebleu
Requires-Dist: indic-nlp-library-IT2@ git+https://github.com/VarunGumma/indic_nlp_library.git
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# IndicTransToolkit

## About
The goal of this repository is to provide a simple, modular, and extendable toolkit for [IndicTrans2](https://github.com/AI4Bharat/IndicTrans2) and be compatible with the HuggingFace models released. Please refer to the `CHANGELOG.md` for latest developments.

## Pre-requisites
 - `Python 3.8+`
 - [Indic NLP Library](https://github.com/VarunGumma/indic_nlp_library)
 - Other requirements as listed in `requirements.txt`

## Configuration
 - Editable installation (Note, this may take a while):
```bash 
git clone https://github.com/VarunGumma/IndicTransToolkit
cd IndicTransToolkit



pip install --editable . --use-pep517 # required for pip >= 25.0



# in case it fails, try:

# pip install --editable . --use-pep517 --config-settings editable_mode=compat
```

## Examples
For the training usecase, please refer [here](https://github.com/AI4Bharat/IndicTrans2/tree/main/huggingface_interface).

### PreTainedTokenizer 
```python

import torch

from IndicTransToolkit.processor import IndicProcessor # NOW IMPLEMENTED IN CYTHON !!

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer



ip = IndicProcessor(inference=True)

tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indictrans2-en-indic-dist-200M", trust_remote_code=True)

model = AutoModelForSeq2SeqLM.from_pretrained("ai4bharat/indictrans2-en-indic-dist-200M", trust_remote_code=True)



sentences = [

    "This is a test sentence.",

    "This is another longer different test sentence.",

    "Please send an SMS to 9876543210 and an email on [email protected] by 15th October, 2023.",

]



batch = ip.preprocess_batch(sentences, src_lang="eng_Latn", tgt_lang="hin_Deva", visualize=False) # set it to visualize=True to print a progress bar

batch = tokenizer(batch, padding="longest", truncation=True, max_length=256, return_tensors="pt")



with torch.inference_mode():

    outputs = model.generate(**batch, num_beams=5, num_return_sequences=1, max_length=256)



with tokenizer.as_target_tokenizer():

    # This scoping is absolutely necessary, as it will instruct the tokenizer to tokenize using the target vocabulary.
    # Failure to use this scoping will result in gibberish/unexpected predictions as the output will be de-tokenized with the source vocabulary instead.
    outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=True)

outputs = ip.postprocess_batch(outputs, lang="hin_Deva")
print(outputs)

>>> ['यह एक परीक्षण वाक्य है।', 'यह एक और लंबा अलग परीक्षण वाक्य है।', 'कृपया 9876543210 पर एक एस. एम. एस. भेजें और 15 अक्टूबर, 2023 तक [email protected] पर एक ईमेल भेजें।']
```

### Evaluation
- `IndicEvaluator` is a python implementation of [compute_metrics.sh](https://github.com/AI4Bharat/IndicTrans2/blob/main/compute_metrics.sh). 
- We have found that this python implementation gives slightly lower scores than the original `compute_metrics.sh`. So, please use this function cautiously, and feel free to raise a PR if you have found the bug/fix. 
```python

from IndicTransToolkit import IndicEvaluator



# this method returns a dictionary with BLEU and ChrF2++ scores with appropriate signatures

evaluator = IndicEvaluator()

scores = evaluator.evaluate(tgt_lang=tgt_lang, preds=pred_file, refs=ref_file) 



# alternatively, you can pass the list of predictions and references instead of files 

# scores = evaluator.evaluate(tgt_lang=tgt_lang, preds=preds, refs=refs)

```



## Authors

 - Varun Gumma ([email protected])

 - Jay Gala ([email protected])

 - Pranjal Agadh Chitale ([email protected])

 - Raj Dabre ([email protected])





## Bugs and Contribution

Since this a bleeding-edge module, you may encounter broken stuff and import issues once in a while. In case you encounter any bugs or want additional functionalities, please feel free to raise `Issues`/`Pull Requests` or contact the authors. 





## Citation

If you use our codebase, or models, please do cite the following paper:

```bibtex

@article{

    gala2023indictrans,

    title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
    author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
    journal={Transactions on Machine Learning Research},
    issn={2835-8856},
    year={2023},
    url={https://openreview.net/forum?id=vfT4YuzAYA},
    note={}
}
```