metadata
library_name: transformers
tags: []
k3mbed-wordpiece: Indonesian OSH WordPiece Tokenizer Model
This model adds 4.067 new Occupational Safety and Health (OSH) specific words to the indobenchmark/indobert-base-p1 tokenizer. This is a WordPiece model for use with bert models. This model has not yet include very technical words to its vocabulary, hence use with caution for use in technical settings.
Model Details
Model Description
The tokenizer was added with 4.067 new words gained from OSH news sites in Indonesia. We used TF-IDF weightings to filter likely important words.
- Dictionary Size: 30521
- Language(s) (NLP): Indonesian
Model Card Authors
This model and model card was created and maintained by the following contributors:
- Adi Wira Pratama – Primary author, responsible for data curation, model training, evaluation, and documentation.