hexgrad
/

Kokoro-82M

Text-to-Speech

English

Model card Files Files and versions Community

124

hexgrad commited on 17 days ago

Commit

aa89b69

verified ·

1 Parent(s): 9cb51e9

Upload 2 files

Browse files

Files changed (2) hide show

README.md +16 -4
VOICES.md +17 -5

README.md CHANGED Viewed

@@ -20,10 +20,22 @@ pipeline_tag: text-to-speech
 ### Kokoro is getting an upgrade!
-| Model | Date | Training Data | A100 80GB vRAM | GPU Cost | Released Voices | Released Langs |
-| ----- | ---- | ------------- | -------------- | -------- | --------------- | -------------- |
-| v0.19 | 2024 Dec 25 | <100h | 500 hrs | $400 | 10 | 1 |
-| v1.0 | 2025 Jan 27 | Few hundred hrs | 1000 hrs | $1000 | [26+](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md) | ? |
 ### Usage

 ### Kokoro is getting an upgrade!
+| Model | Published | Training Data | Compute (A100 80GB) | Released Voices | Released Langs |
+| ----- | --------- | ------------- | ------------------- | --------------- | -------------- |
+| v0.19 | 2024 Dec 25 | <100 hrs | 500 hrs @ $400 | 10 | 1 |
+| **v1.0** | 2025 Jan 27 | Few hundred hrs | 1000 hrs @ $1000 | [27+](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md) | [2+](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md) |
+Training is continuous. The v0.19 model was produced "on the way" to the v1.0 model, so the Compute footprints overlap.
+### Voices and Languages
+Voices are listed in [VOICES.md](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md). Not all voices are created equal:
+- Subjectively, voices will sound better or worse to different people.
+- Objectively, having less training data for a given voice (minutes instead of hours) lowers inference quality.
+- Objectively, poor audio quality in training data (compression, sample rate, artifacts) lowers inference quality.
+- Objectively, text-audio misalignment alignment (too much text i.e. hallucinations, or not enough text i.e. failed transcriptions) lowers inference quality.
+Support for non-English languages may be absent or thin due to weak G2P and/or lack of training data. Some languages are only represented by a small handful or even just one voice (French).
 ### Usage

VOICES.md CHANGED Viewed

@@ -2,7 +2,7 @@
 For each voice, the given grades are intended to be estimates of the **quality and quantity** of its associated training data, both of which impact overall inference quality.
-Voices may also subjectively sound better or worse to different people.
 **Target Quality**
 - How high quality is the reference voice? This grade may be impacted by audio quality, artifacts, compression, & sample rate.
@@ -13,14 +13,16 @@ Voices may also subjectively sound better or worse to different people.
 ### American 🇺🇸
 | Name | Traits | Target Quality | Training Duration | Overall Grade |
 | ---- | ------ | -------------- | ----------------- | ------------- |
 | af_alloy | 🚺 | B | MM minutes | C |
-| af_aoede | 🚺 | A | H hours | B+ |
-| af_bella | 🚺🔥 | A | HH hours | A- |
 | af_jessica | 🚺 | C | MM minutes | D |
 | af_kore | 🚺 | B | H hours | C+ |
-| af_nicole | 🚺🎧 | B | HH hours | B- |
 | af_nova | 🚺 | B | MM minutes | C |
 | af_river | 🚺 | C | MM minutes | D |
 | af_sarah | 🚺 | B | H hours | C+ |
@@ -36,13 +38,23 @@ Voices may also subjectively sound better or worse to different people.
 ### British 🇬🇧
 | Name | Traits | Target Quality | Training Duration | Overall Grade |
 | ---- | ------ | -------------- | ----------------- | ------------- |
 | bf_alice | 🚺 | C | MM minutes | D |
-| bf_emma | 🚺 | B | HH hours | B- |
 | bf_isabella | 🚺 | B | MM minutes | C |
 | bf_lily | 🚺 | C | MM minutes | D |
 | bm_daniel | 🚹 | C | MM minutes | D |
 | bm_fable | 🚹 | B | MM minutes | C |
 | bm_george | 🚹 | B | MM minutes | C |
 | bm_lewis | 🚹 | C | H hours | D+ |

 For each voice, the given grades are intended to be estimates of the **quality and quantity** of its associated training data, both of which impact overall inference quality.
+Subjectively, voices will sound better or worse to different people.
 **Target Quality**
 - How high quality is the reference voice? This grade may be impacted by audio quality, artifacts, compression, & sample rate.
 ### American 🇺🇸
+American G2P: [`misaki[en]`](https://github.com/hexgrad/misaki) with `en-us` espeak-ng fallback
 | Name | Traits | Target Quality | Training Duration | Overall Grade |
 | ---- | ------ | -------------- | ----------------- | ------------- |
 | af_alloy | 🚺 | B | MM minutes | C |
+| af_aoede | 🚺 | B | H hours | C+ |
+| af_bella | 🚺🔥 | **A** | **HH hours** | **A-** |
 | af_jessica | 🚺 | C | MM minutes | D |
 | af_kore | 🚺 | B | H hours | C+ |
+| af_nicole | 🚺🎧 | B | **HH hours** | B- |
 | af_nova | 🚺 | B | MM minutes | C |
 | af_river | 🚺 | C | MM minutes | D |
 | af_sarah | 🚺 | B | H hours | C+ |
 ### British 🇬🇧
+British G2P: [`misaki[en]`](https://github.com/hexgrad/misaki) with `en-gb` espeak-ng fallback
 | Name | Traits | Target Quality | Training Duration | Overall Grade |
 | ---- | ------ | -------------- | ----------------- | ------------- |
 | bf_alice | 🚺 | C | MM minutes | D |
+| bf_emma | 🚺 | B | **HH hours** | B- |
 | bf_isabella | 🚺 | B | MM minutes | C |
 | bf_lily | 🚺 | C | MM minutes | D |
 | bm_daniel | 🚹 | C | MM minutes | D |
 | bm_fable | 🚹 | B | MM minutes | C |
 | bm_george | 🚹 | B | MM minutes | C |
 | bm_lewis | 🚹 | C | H hours | D+ |
+### French 🇫🇷
+French G2P: espeak-ng `fr-fr`
+| Name | Traits | Target Quality | Training Duration | Overall Grade |
+| ---- | ------ | -------------- | ----------------- | ------------- |
+| [ff_siwis](https://datashare.ed.ac.uk/handle/10283/2353) | 🚺 | B | <11 hours | B- |