Text-to-Speech
English
hexgrad commited on
Commit
aa89b69
Β·
verified Β·
1 Parent(s): 9cb51e9

Upload 2 files

Browse files
Files changed (2) hide show
  1. README.md +16 -4
  2. VOICES.md +17 -5
README.md CHANGED
@@ -20,10 +20,22 @@ pipeline_tag: text-to-speech
20
 
21
  ### Kokoro is getting an upgrade!
22
 
23
- | Model | Date | Training Data | A100 80GB vRAM | GPU Cost | Released Voices | Released Langs |
24
- | ----- | ---- | ------------- | -------------- | -------- | --------------- | -------------- |
25
- | v0.19 | 2024 Dec 25 | <100h | 500 hrs | $400 | 10 | 1 |
26
- | v1.0 | 2025 Jan 27 | Few hundred hrs | 1000 hrs | $1000 | [26+](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md) | ? |
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
  ### Usage
29
 
 
20
 
21
  ### Kokoro is getting an upgrade!
22
 
23
+ | Model | Published | Training Data | Compute (A100 80GB) | Released Voices | Released Langs |
24
+ | ----- | --------- | ------------- | ------------------- | --------------- | -------------- |
25
+ | v0.19 | 2024 Dec 25 | <100 hrs | 500 hrs @ $400 | 10 | 1 |
26
+ | **v1.0** | 2025 Jan 27 | Few hundred hrs | 1000 hrs @ $1000 | [27+](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md) | [2+](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md) |
27
+
28
+ Training is continuous. The v0.19 model was produced "on the way" to the v1.0 model, so the Compute footprints overlap.
29
+
30
+ ### Voices and Languages
31
+
32
+ Voices are listed in [VOICES.md](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md). Not all voices are created equal:
33
+ - Subjectively, voices will sound better or worse to different people.
34
+ - Objectively, having less training data for a given voice (minutes instead of hours) lowers inference quality.
35
+ - Objectively, poor audio quality in training data (compression, sample rate, artifacts) lowers inference quality.
36
+ - Objectively, text-audio misalignment alignment (too much text i.e. hallucinations, or not enough text i.e. failed transcriptions) lowers inference quality.
37
+
38
+ Support for non-English languages may be absent or thin due to weak G2P and/or lack of training data. Some languages are only represented by a small handful or even just one voice (French).
39
 
40
  ### Usage
41
 
VOICES.md CHANGED
@@ -2,7 +2,7 @@
2
 
3
  For each voice, the given grades are intended to be estimates of the **quality and quantity** of its associated training data, both of which impact overall inference quality.
4
 
5
- Voices may also subjectively sound better or worse to different people.
6
 
7
  **Target Quality**
8
  - How high quality is the reference voice? This grade may be impacted by audio quality, artifacts, compression, & sample rate.
@@ -13,14 +13,16 @@ Voices may also subjectively sound better or worse to different people.
13
 
14
  ### American πŸ‡ΊπŸ‡Έ
15
 
 
 
16
  | Name | Traits | Target Quality | Training Duration | Overall Grade |
17
  | ---- | ------ | -------------- | ----------------- | ------------- |
18
  | af_alloy | 🚺 | B | MM minutes | C |
19
- | af_aoede | 🚺 | A | H hours | B+ |
20
- | af_bella | 🚺πŸ”₯ | A | HH hours | A- |
21
  | af_jessica | 🚺 | C | MM minutes | D |
22
  | af_kore | 🚺 | B | H hours | C+ |
23
- | af_nicole | 🚺🎧 | B | HH hours | B- |
24
  | af_nova | 🚺 | B | MM minutes | C |
25
  | af_river | 🚺 | C | MM minutes | D |
26
  | af_sarah | 🚺 | B | H hours | C+ |
@@ -36,13 +38,23 @@ Voices may also subjectively sound better or worse to different people.
36
 
37
  ### British πŸ‡¬πŸ‡§
38
 
 
 
39
  | Name | Traits | Target Quality | Training Duration | Overall Grade |
40
  | ---- | ------ | -------------- | ----------------- | ------------- |
41
  | bf_alice | 🚺 | C | MM minutes | D |
42
- | bf_emma | 🚺 | B | HH hours | B- |
43
  | bf_isabella | 🚺 | B | MM minutes | C |
44
  | bf_lily | 🚺 | C | MM minutes | D |
45
  | bm_daniel | 🚹 | C | MM minutes | D |
46
  | bm_fable | 🚹 | B | MM minutes | C |
47
  | bm_george | 🚹 | B | MM minutes | C |
48
  | bm_lewis | 🚹 | C | H hours | D+ |
 
 
 
 
 
 
 
 
 
2
 
3
  For each voice, the given grades are intended to be estimates of the **quality and quantity** of its associated training data, both of which impact overall inference quality.
4
 
5
+ Subjectively, voices will sound better or worse to different people.
6
 
7
  **Target Quality**
8
  - How high quality is the reference voice? This grade may be impacted by audio quality, artifacts, compression, & sample rate.
 
13
 
14
  ### American πŸ‡ΊπŸ‡Έ
15
 
16
+ American G2P: [`misaki[en]`](https://github.com/hexgrad/misaki) with `en-us` espeak-ng fallback
17
+
18
  | Name | Traits | Target Quality | Training Duration | Overall Grade |
19
  | ---- | ------ | -------------- | ----------------- | ------------- |
20
  | af_alloy | 🚺 | B | MM minutes | C |
21
+ | af_aoede | 🚺 | B | H hours | C+ |
22
+ | af_bella | 🚺πŸ”₯ | **A** | **HH hours** | **A-** |
23
  | af_jessica | 🚺 | C | MM minutes | D |
24
  | af_kore | 🚺 | B | H hours | C+ |
25
+ | af_nicole | 🚺🎧 | B | **HH hours** | B- |
26
  | af_nova | 🚺 | B | MM minutes | C |
27
  | af_river | 🚺 | C | MM minutes | D |
28
  | af_sarah | 🚺 | B | H hours | C+ |
 
38
 
39
  ### British πŸ‡¬πŸ‡§
40
 
41
+ British G2P: [`misaki[en]`](https://github.com/hexgrad/misaki) with `en-gb` espeak-ng fallback
42
+
43
  | Name | Traits | Target Quality | Training Duration | Overall Grade |
44
  | ---- | ------ | -------------- | ----------------- | ------------- |
45
  | bf_alice | 🚺 | C | MM minutes | D |
46
+ | bf_emma | 🚺 | B | **HH hours** | B- |
47
  | bf_isabella | 🚺 | B | MM minutes | C |
48
  | bf_lily | 🚺 | C | MM minutes | D |
49
  | bm_daniel | 🚹 | C | MM minutes | D |
50
  | bm_fable | 🚹 | B | MM minutes | C |
51
  | bm_george | 🚹 | B | MM minutes | C |
52
  | bm_lewis | 🚹 | C | H hours | D+ |
53
+
54
+ ### French πŸ‡«πŸ‡·
55
+
56
+ French G2P: espeak-ng `fr-fr`
57
+
58
+ | Name | Traits | Target Quality | Training Duration | Overall Grade |
59
+ | ---- | ------ | -------------- | ----------------- | ------------- |
60
+ | [ff_siwis](https://datashare.ed.ac.uk/handle/10283/2353) | 🚺 | B | <11 hours | B- |