krinal's picture
Update README.md
0d1f32f verified
---
library_name: transformers
license: apache-2.0
language:
- hi
pipeline_tag: token-classification
---
## Model Details
### BertWordPieceTokenizer
- tokenizer for hindi language
#### Usage
```py
from transformers import AutoTokenizer
hi_tokenizer = AutoTokenizer.from_pretrained('krinal/BertWordPieceTokenizer-hi')
hi_str = "आज का सूर्य देखो, कितना प्यारा, कितना शीतल है"
# encode text
encoded_str = hi_tokenizer.encode(hi_str)
# decode text
decoded_str = hi_tokenizer.decode(encoded_str)
```
#### Language
- hi
#### Training
- For training see [Train BertWordPieceTokenizer](https://gist.github.com/kjdeveloper8/57d9e16848cd77df778804c9e2214a78)
#### Dataset
- trained on BHAAV (hi sentiment analysis dataset)
- dataset source: [Bhaav](https://github.com/midas-research/bhaav)
- Hindi text corpus (20,304 sentences)
#### Citation
```shell
@article{kumar2019bhaav,
title={BHAAV-A Text Corpus for Emotion Analysis from Hindi Stories},
author={Kumar, Yaman and Mahata, Debanjan and Aggarwal, Sagar and Chugh, Anmol and Maheshwari, Rajat and Shah, Rajiv Ratn},
journal={arXiv preprint arXiv:1910.04073},
year={2019}
}
```