Changelog

📢 Release v1.0.3

🚨 The IndicProcessor class has been re-written in Cython for faster implementation. This gives us atleast +10 lines/s.
A new visualize argument as been added to preprocess_batch to track the processing with a tqdm bar.

The repository has been renamed to IndicTransToolkit.
🚨 The custom tokenizer is now removed from the repository. Please revert to a previous commit (v1.0.1) to use it (strongly discouraged). The official (and only tokenizer) is available on HF along with the models.

The PreTrainedTokenizer for IndicTrans2 is now available on HF 🎉🎉 Note that, you still need the IndicProcessor to pre-process the sentences before tokenization.
🚨 In favor of the standard PreTrainedTokenizer, we deprecated the custom tokenizer. However, this custom tokenizer will still be available here for backward compatibility, but no further updates/bug-fixes will be provided.
The indic_evaluate function is now consolidated into a concrete IndicEvaluator class.
The data collation function for training is consolidated into a concrete IndicDataCollator class.
A simple batching method is now available in the IndicProcessor.