g2pW-canto-20241206-bert-base

This is a G2P (Grapheme-to-Phoneme) model trained on the Naozumi0512/g2p-Cantonese-aggregate-pos-retag dataset and evaluated on the yue-g2p-benchmark.

Model Overview

The model uses hon9kon9ize/bert-base-cantonese. For more details see https://github.com/Naozumi520/g2pW-Cantonese .


Dataset

The model was trained on the Naozumi0512/g2p-Cantonese-aggregate-pos-retag dataset, which includes:

  • 68,500 Cantonese words/phrases with corresponding phonetic transcriptions.
  • Data is formatted to align with the CPP (Chinese Polyphones with Pinyin) structure.
  • Sources include:
    • Rime Cantonese Input Schema (jyut6ping3.words.dict.yaml)
    • 粵典 Words.hk
    • CantoDict

Evaluation

The model was evaluated on the yue-g2p-benchmark:

Metric Score
Accuracy 0.9117
Phoneme Error Rate 0.0274

Inference

https://github.com/Naozumi520/g2pW-Cantonese/tree/20241206-bert-base

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Dataset used to train Naozumi0512/g2pW-canto-20241206-bert-base

Collection including Naozumi0512/g2pW-canto-20241206-bert-base