BertTokenizer-based Tokenizer that can tokenize Chinese/Cantonese sentences into phrases

Apart from the original 51,271 tokens from the base tokenizer, 194,020 additional Chinese vocabularies are added to this tokenizer.

Usage:

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('raptorkwok/wordseg-tokenizer')

Examples:

Cantonese Example 1

tokenizer.tokenize("我哋今日去睇陳奕迅演唱會")
# Output: ['我哋', '今日', '去', '睇', '陳奕迅', '演唱會']

Cantonese Example 2

tokenizer.tokenize("再嘈我打爆你個嘴!")
# Output: ['再', '嘈', '我', '打爆', '你', '個', '嘴', '!']

Chinese Example 1

tokenizer.tokenize("你很肥胖呢,要開始減肥了。")
# Output: ['你', '很', '肥胖', '呢', ',', '要', '開始', '減肥', '了', '。']

Chinese Example 2

tokenizer.tokenize("案件現由大嶼山警區重案組接手調查。")
# Output: ['案件', '現', '由', '大嶼山', '警區', '重案組', '接手', '調查', '。']

Questions?

Please feel free to leave a message in the Community tab.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Model tree for raptorkwok/wordseg-tokenizer

Finetuned
(8)
this model