k3mbed-wordpiece: Indonesian OSH WordPiece Tokenizer Model
This model adds 4.067 new Occupational Safety and Health (OSH) specific words to the indobenchmark/indobert-base-p1 tokenizer. This is a WordPiece model for use with bert models. This model has not yet include very technical words to its vocabulary, hence use with caution for use in technical settings.
Model Details
Model Description
The tokenizer was added with 4.067 new words gained from OSH news sites in Indonesia. We used TF-IDF weightings to filter likely important words.
- Dictionary Size: 30521
- Language(s) (NLP): Indonesian
Model Card Authors
This model and model card was created and maintained by the following contributors:
- Adi Wira Pratama – Primary author, responsible for data curation, model training, evaluation, and documentation.
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no pipeline_tag.