π¨ The IndicProcessor class has been re-written in Cython for faster implementation. This gives us atleast +10 lines/s.
A new visualize argument as been added to preprocess_batch to track the processing with a tqdm bar.
π’ Release v1.0.2
The repository has been renamed to IndicTransToolkit.
π¨ The custom tokenizer is now removed from the repository. Please revert to a previous commit (v1.0.1) to use it (strongly discouraged). The official (and only tokenizer) is available on HF along with the models.
π’ Release v1.0.0
The PreTrainedTokenizer for IndicTrans2 is now available on HF ππ Note that, you still need the IndicProcessor to pre-process the sentences before tokenization.
π¨ In favor of the standard PreTrainedTokenizer, we deprecated the custom tokenizer. However, this custom tokenizer will still be available here for backward compatibility, but no further updates/bug-fixes will be provided.
The indic_evaluate function is now consolidated into a concrete IndicEvaluator class.
The data collation function for training is consolidated into a concrete IndicDataCollator class.
A simple batching method is now available in the IndicProcessor.