nitikdias's picture
Upload 114 files
74ee63f verified
# Changelog
# πŸ“’ Release v1.0.3
- 🚨 The `IndicProcessor` class has been re-written in [Cython](https://github.com/cython/cython) for faster implementation. This gives us atleast `+10 lines/s`.
- A new `visualize` argument as been added to `preprocess_batch` to track the processing with a `tqdm` bar.
# πŸ“’ Release v1.0.2
- The repository has been renamed to `IndicTransToolkit`.
- 🚨 The custom tokenizer is now **removed** from the repository. Please revert to a previous commit ([v1.0.1](https://github.com/VarunGumma/IndicTransToolkit/tree/0e68fb5872f4d821578a5252f90ad43c9649370f)) to use it **(strongly discouraged)**. The official _(and only tokenizer)_ is available on HF along with the models.
# πŸ“’ Release v1.0.0
- The [PreTrainedTokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer) for IndicTrans2 is now available on HF πŸŽ‰πŸŽ‰ Note that, you still need the `IndicProcessor` to pre-process the sentences before tokenization.
- 🚨 **In favor of the standard PreTrainedTokenizer, we deprecated the custom tokenizer. However, this custom tokenizer will still be available here for backward compatibility, but no further updates/bug-fixes will be provided.**
- The `indic_evaluate` function is now consolidated into a concrete `IndicEvaluator` class.
- The data collation function for training is consolidated into a concrete `IndicDataCollator` class.
- A simple batching method is now available in the `IndicProcessor`.