k3mbed-wordpiece-v1 / README.md
wira-pratama's picture
Update README.md
77b3574 verified
---
library_name: transformers
tags: []
---
# k3mbed-wordpiece: Indonesian OSH WordPiece Tokenizer Model
This model adds 4.067 new Occupational Safety and Health (OSH) specific words to the [indobenchmark/indobert-base-p1](https://huggingface.co/indobenchmark/indobert-base-p1) tokenizer.
This is a WordPiece model for use with bert models. This model has not yet include very technical words to its vocabulary, hence use with caution for use in technical settings.
## Model Details
### Model Description
The tokenizer was added with 4.067 new words gained from OSH news sites in Indonesia. We used TF-IDF weightings to filter likely important words.
- **Dictionary Size:** 30521
- **Language(s) (NLP):** Indonesian
## Model Card Authors
This model and model card was created and maintained by the following contributors:
- **[Adi Wira Pratama](https://huggingface.co/wira-pratama)***Primary author, responsible for data curation, model training, evaluation, and documentation.*