|
# Changelog
|
|
|
|
# π’ Release v1.0.3
|
|
- π¨ The `IndicProcessor` class has been re-written in [Cython](https://github.com/cython/cython) for faster implementation. This gives us atleast `+10 lines/s`.
|
|
- A new `visualize` argument as been added to `preprocess_batch` to track the processing with a `tqdm` bar.
|
|
|
|
# π’ Release v1.0.2
|
|
- The repository has been renamed to `IndicTransToolkit`.
|
|
- π¨ The custom tokenizer is now **removed** from the repository. Please revert to a previous commit ([v1.0.1](https://github.com/VarunGumma/IndicTransToolkit/tree/0e68fb5872f4d821578a5252f90ad43c9649370f)) to use it **(strongly discouraged)**. The official _(and only tokenizer)_ is available on HF along with the models.
|
|
|
|
# π’ Release v1.0.0
|
|
- The [PreTrainedTokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer) for IndicTrans2 is now available on HF ππ Note that, you still need the `IndicProcessor` to pre-process the sentences before tokenization.
|
|
- π¨ **In favor of the standard PreTrainedTokenizer, we deprecated the custom tokenizer. However, this custom tokenizer will still be available here for backward compatibility, but no further updates/bug-fixes will be provided.**
|
|
- The `indic_evaluate` function is now consolidated into a concrete `IndicEvaluator` class.
|
|
- The data collation function for training is consolidated into a concrete `IndicDataCollator` class.
|
|
- A simple batching method is now available in the `IndicProcessor`. |