|
--- |
|
library_name: transformers |
|
tags: [] |
|
--- |
|
|
|
# k3mbed-wordpiece: Indonesian OSH WordPiece Tokenizer Model |
|
This model adds 4.067 new Occupational Safety and Health (OSH) specific words to the [indobenchmark/indobert-base-p1](https://huggingface.co/indobenchmark/indobert-base-p1) tokenizer. |
|
This is a WordPiece model for use with bert models. This model has not yet include very technical words to its vocabulary, hence use with caution for use in technical settings. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
The tokenizer was added with 4.067 new words gained from OSH news sites in Indonesia. We used TF-IDF weightings to filter likely important words. |
|
|
|
- **Dictionary Size:** 30521 |
|
- **Language(s) (NLP):** Indonesian |
|
|
|
|
|
## Model Card Authors |
|
This model and model card was created and maintained by the following contributors: |
|
- **[Adi Wira Pratama](https://huggingface.co/wira-pratama)** – *Primary author, responsible for data curation, model training, evaluation, and documentation.* |