File size: 5,913 Bytes
74ee63f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
Metadata-Version: 2.2
Name: IndicTransToolkit
Version: 1.0.3
Summary: A simple, consistent, and extendable module for IndicTrans2 tokenizer compatible with HuggingFace models
Home-page: https://github.com/VarunGumma/IndicTransToolkit
Author: Varun Gumma
Author-email: [email protected]
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: setuptools>=68.2.2
Requires-Dist: torch
Requires-Dist: cython
Requires-Dist: sacremoses
Requires-Dist: sentencepiece
Requires-Dist: transformers
Requires-Dist: sacrebleu
Requires-Dist: indic-nlp-library-IT2@ git+https://github.com/VarunGumma/indic_nlp_library.git
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary
# IndicTransToolkit
## About
The goal of this repository is to provide a simple, modular, and extendable toolkit for [IndicTrans2](https://github.com/AI4Bharat/IndicTrans2) and be compatible with the HuggingFace models released. Please refer to the `CHANGELOG.md` for latest developments.
## Pre-requisites
- `Python 3.8+`
- [Indic NLP Library](https://github.com/VarunGumma/indic_nlp_library)
- Other requirements as listed in `requirements.txt`
## Configuration
- Editable installation (Note, this may take a while):
```bash
git clone https://github.com/VarunGumma/IndicTransToolkit
cd IndicTransToolkit
pip install --editable . --use-pep517 # required for pip >= 25.0
# in case it fails, try:
# pip install --editable . --use-pep517 --config-settings editable_mode=compat
```
## Examples
For the training usecase, please refer [here](https://github.com/AI4Bharat/IndicTrans2/tree/main/huggingface_interface).
### PreTainedTokenizer
```python
import torch
from IndicTransToolkit.processor import IndicProcessor # NOW IMPLEMENTED IN CYTHON !!
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
ip = IndicProcessor(inference=True)
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indictrans2-en-indic-dist-200M", trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained("ai4bharat/indictrans2-en-indic-dist-200M", trust_remote_code=True)
sentences = [
"This is a test sentence.",
"This is another longer different test sentence.",
"Please send an SMS to 9876543210 and an email on [email protected] by 15th October, 2023.",
]
batch = ip.preprocess_batch(sentences, src_lang="eng_Latn", tgt_lang="hin_Deva", visualize=False) # set it to visualize=True to print a progress bar
batch = tokenizer(batch, padding="longest", truncation=True, max_length=256, return_tensors="pt")
with torch.inference_mode():
outputs = model.generate(**batch, num_beams=5, num_return_sequences=1, max_length=256)
with tokenizer.as_target_tokenizer():
# This scoping is absolutely necessary, as it will instruct the tokenizer to tokenize using the target vocabulary.
# Failure to use this scoping will result in gibberish/unexpected predictions as the output will be de-tokenized with the source vocabulary instead.
outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=True)
outputs = ip.postprocess_batch(outputs, lang="hin_Deva")
print(outputs)
>>> ['यह एक परीक्षण वाक्य है।', 'यह एक और लंबा अलग परीक्षण वाक्य है।', 'कृपया 9876543210 पर एक एस. एम. एस. भेजें और 15 अक्टूबर, 2023 तक [email protected] पर एक ईमेल भेजें।']
```
### Evaluation
- `IndicEvaluator` is a python implementation of [compute_metrics.sh](https://github.com/AI4Bharat/IndicTrans2/blob/main/compute_metrics.sh).
- We have found that this python implementation gives slightly lower scores than the original `compute_metrics.sh`. So, please use this function cautiously, and feel free to raise a PR if you have found the bug/fix.
```python
from IndicTransToolkit import IndicEvaluator
# this method returns a dictionary with BLEU and ChrF2++ scores with appropriate signatures
evaluator = IndicEvaluator()
scores = evaluator.evaluate(tgt_lang=tgt_lang, preds=pred_file, refs=ref_file)
# alternatively, you can pass the list of predictions and references instead of files
# scores = evaluator.evaluate(tgt_lang=tgt_lang, preds=preds, refs=refs)
```
## Authors
- Varun Gumma ([email protected])
- Jay Gala ([email protected])
- Pranjal Agadh Chitale ([email protected])
- Raj Dabre ([email protected])
## Bugs and Contribution
Since this a bleeding-edge module, you may encounter broken stuff and import issues once in a while. In case you encounter any bugs or want additional functionalities, please feel free to raise `Issues`/`Pull Requests` or contact the authors.
## Citation
If you use our codebase, or models, please do cite the following paper:
```bibtex
@article{
gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}
```
|