|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
language: |
|
- hi |
|
pipeline_tag: token-classification |
|
--- |
|
|
|
## Model Details |
|
|
|
### BertWordPieceTokenizer |
|
|
|
- tokenizer for hindi language |
|
|
|
#### Usage |
|
|
|
```py |
|
from transformers import AutoTokenizer |
|
|
|
hi_tokenizer = AutoTokenizer.from_pretrained('krinal/BertWordPieceTokenizer-hi') |
|
|
|
hi_str = "आज का सूर्य देखो, कितना प्यारा, कितना शीतल है" |
|
|
|
# encode text |
|
encoded_str = hi_tokenizer.encode(hi_str) |
|
|
|
# decode text |
|
decoded_str = hi_tokenizer.decode(encoded_str) |
|
``` |
|
|
|
#### Language |
|
|
|
- hi |
|
|
|
#### Training |
|
|
|
- For training see [Train BertWordPieceTokenizer](https://gist.github.com/kjdeveloper8/57d9e16848cd77df778804c9e2214a78) |
|
|
|
#### Dataset |
|
|
|
- trained on BHAAV (hi sentiment analysis dataset) |
|
- dataset source: [Bhaav](https://github.com/midas-research/bhaav) |
|
- Hindi text corpus (20,304 sentences) |
|
|
|
#### Citation |
|
|
|
```shell |
|
@article{kumar2019bhaav, |
|
title={BHAAV-A Text Corpus for Emotion Analysis from Hindi Stories}, |
|
author={Kumar, Yaman and Mahata, Debanjan and Aggarwal, Sagar and Chugh, Anmol and Maheshwari, Rajat and Shah, Rajiv Ratn}, |
|
journal={arXiv preprint arXiv:1910.04073}, |
|
year={2019} |
|
} |
|
``` |
|
|