k3mbed-wordpiece: Indonesian OSH WordPiece Tokenizer Model

This model adds 4.067 new Occupational Safety and Health (OSH) specific words to the indobenchmark/indobert-base-p1 tokenizer. This is a WordPiece model for use with bert models. This model has not yet include very technical words to its vocabulary, hence use with caution for use in technical settings.

Model Details

Model Description

The tokenizer was added with 4.067 new words gained from OSH news sites in Indonesia. We used TF-IDF weightings to filter likely important words.

  • Dictionary Size: 30521
  • Language(s) (NLP): Indonesian

Model Card Authors

This model and model card was created and maintained by the following contributors:

  • Adi Wira Pratama – Primary author, responsible for data curation, model training, evaluation, and documentation.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.