k3mbed-wordpiece-v1 / README.md
wira-pratama's picture
Update README.md
77b3574 verified
metadata
library_name: transformers
tags: []

k3mbed-wordpiece: Indonesian OSH WordPiece Tokenizer Model

This model adds 4.067 new Occupational Safety and Health (OSH) specific words to the indobenchmark/indobert-base-p1 tokenizer. This is a WordPiece model for use with bert models. This model has not yet include very technical words to its vocabulary, hence use with caution for use in technical settings.

Model Details

Model Description

The tokenizer was added with 4.067 new words gained from OSH news sites in Indonesia. We used TF-IDF weightings to filter likely important words.

  • Dictionary Size: 30521
  • Language(s) (NLP): Indonesian

Model Card Authors

This model and model card was created and maintained by the following contributors:

  • Adi Wira PratamaPrimary author, responsible for data curation, model training, evaluation, and documentation.