SamratBarai commited on
Commit
4dc3d0d
·
verified ·
1 Parent(s): 83a55ae

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +170 -160
README.md CHANGED
@@ -1,161 +1,171 @@
1
- ---
2
- license: apache-2.0
3
- language:
4
- - en
5
- base_model:
6
- - yl4579/StyleTTS2-LJSpeech
7
- pipeline_tag: text-to-speech
8
- ---
9
- 📣 Jan 12 Status: Intent to improve the base model https://hf.co/hexgrad/Kokoro-82M/discussions/36
10
-
11
- ❤️ Kokoro Discord Server: https://discord.gg/QuGxSWBfQy
12
-
13
- <audio controls><source src="https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/demo/HEARME.wav" type="audio/wav"></audio>
14
-
15
- **Kokoro** is a frontier TTS model for its size of **82 million parameters** (text in/audio out).
16
-
17
- On 25 Dec 2024, Kokoro v0.19 weights were permissively released in full fp32 precision under an Apache 2.0 license. As of 2 Jan 2025, 10 unique Voicepacks have been released, and a `.onnx` version of v0.19 is available.
18
-
19
- In the weeks leading up to its release, Kokoro v0.19 was the #1🥇 ranked model in [TTS Spaces Arena](https://huggingface.co/hexgrad/Kokoro-82M#evaluation). Kokoro had achieved higher Elo in this single-voice Arena setting over other models, using fewer parameters and less data:
20
- 1. **Kokoro v0.19: 82M params, Apache, trained on <100 hours of audio**
21
- 2. XTTS v2: 467M, CPML, >10k hours
22
- 3. Edge TTS: Microsoft, proprietary
23
- 4. MetaVoice: 1.2B, Apache, 100k hours
24
- 5. Parler Mini: 880M, Apache, 45k hours
25
- 6. Fish Speech: ~500M, CC-BY-NC-SA, 1M hours
26
-
27
- Kokoro's ability to top this Elo ladder suggests that the scaling law (Elo vs compute/data/params) for traditional TTS models might have a steeper slope than previously expected.
28
-
29
- You can find a hosted demo at [hf.co/spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS).
30
-
31
- ### Usage
32
-
33
- The following can be run in a single cell on [Google Colab](https://colab.research.google.com/).
34
- ```py
35
- # 1️⃣ Install dependencies silently
36
- !git lfs install
37
- !git clone https://huggingface.co/hexgrad/Kokoro-82M
38
- %cd Kokoro-82M
39
- !apt-get -qq -y install espeak-ng > /dev/null 2>&1
40
- !pip install -q phonemizer torch transformers scipy munch
41
-
42
- # 2️⃣ Build the model and load the default voicepack
43
- from models import build_model
44
- import torch
45
- device = 'cuda' if torch.cuda.is_available() else 'cpu'
46
- MODEL = build_model('kokoro-v0_19.pth', device)
47
- VOICE_NAME = [
48
- 'af', # Default voice is a 50-50 mix of Bella & Sarah
49
- 'af_bella', 'af_sarah', 'am_adam', 'am_michael',
50
- 'bf_emma', 'bf_isabella', 'bm_george', 'bm_lewis',
51
- 'af_nicole', 'af_sky',
52
- ][0]
53
- VOICEPACK = torch.load(f'voices/{VOICE_NAME}.pt', weights_only=True).to(device)
54
- print(f'Loaded voice: {VOICE_NAME}')
55
-
56
- # 3️⃣ Call generate, which returns 24khz audio and the phonemes used
57
- from kokoro import generate
58
- text = "How could I know? It's an unanswerable question. Like asking an unborn child if they'll lead a good life. They haven't even been born."
59
- audio, out_ps = generate(MODEL, text, VOICEPACK, lang=VOICE_NAME[0])
60
- # Language is determined by the first letter of the VOICE_NAME:
61
- # 🇺🇸 'a' => American English => en-us
62
- # 🇬🇧 'b' => British English => en-gb
63
-
64
- # 4️⃣ Display the 24khz audio and print the output phonemes
65
- from IPython.display import display, Audio
66
- display(Audio(data=audio, rate=24000, autoplay=True))
67
- print(out_ps)
68
- ```
69
- If you have trouble with `espeak-ng`, see this [github issue](https://github.com/bootphon/phonemizer/issues/44#issuecomment-1540885186). [Mac users also see this](https://huggingface.co/hexgrad/Kokoro-82M/discussions/12#677435d3d8ace1de46071489), and [Windows users see this](https://huggingface.co/hexgrad/Kokoro-82M/discussions/12#67742594fdeebf74f001ecfc).
70
-
71
- For ONNX usage, see [#14](https://huggingface.co/hexgrad/Kokoro-82M/discussions/14).
72
-
73
- ### Model Facts
74
-
75
- No affiliation can be assumed between parties on different lines.
76
-
77
- **Architecture:**
78
- - StyleTTS 2: https://arxiv.org/abs/2306.07691
79
- - ISTFTNet: https://arxiv.org/abs/2203.02395
80
- - Decoder only: no diffusion, no encoder release
81
-
82
- **Architected by:** Li et al @ https://github.com/yl4579/StyleTTS2
83
-
84
- **Trained by**: `@rzvzn` on Discord
85
-
86
- **Supported Languages:** American English, British English
87
-
88
- **Model SHA256 Hash:** `3b0c392f87508da38fad3a2f9d94c359f1b657ebd2ef79f9d56d69503e470b0a`
89
-
90
- ### Releases
91
- - 25 Dec 2024: Model v0.19, `af_bella`, `af_sarah`
92
- - 26 Dec 2024: `am_adam`, `am_michael`
93
- - 28 Dec 2024: `bf_emma`, `bf_isabella`, `bm_george`, `bm_lewis`
94
- - 30 Dec 2024: `af_nicole`
95
- - 31 Dec 2024: `af_sky`
96
- - 2 Jan 2025: ONNX v0.19 `ebef4245`
97
-
98
- ### Licenses
99
- - Apache 2.0 weights in this repository
100
- - MIT inference code in [spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS) adapted from [yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
101
- - GPLv3 dependency in [espeak-ng](https://github.com/espeak-ng/espeak-ng)
102
-
103
- The inference code was originally MIT licensed by the paper author. Note that this card applies only to this model, Kokoro. Original models published by the paper author can be found at [hf.co/yl4579](https://huggingface.co/yl4579).
104
-
105
- ### Evaluation
106
-
107
- **Metric:** Elo rating
108
-
109
- **Leaderboard:** [hf.co/spaces/Pendrokar/TTS-Spaces-Arena](https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena)
110
-
111
- ![TTS-Spaces-Arena-25-Dec-2024](demo/TTS-Spaces-Arena-25-Dec-2024.png)
112
-
113
- The voice ranked in the Arena is a 50-50 mix of Bella and Sarah. For your convenience, this mix is included in this repository as `af.pt`, but you can trivially reproduce it like this:
114
-
115
- ```py
116
- import torch
117
- bella = torch.load('voices/af_bella.pt', weights_only=True)
118
- sarah = torch.load('voices/af_sarah.pt', weights_only=True)
119
- af = torch.mean(torch.stack([bella, sarah]), dim=0)
120
- assert torch.equal(af, torch.load('voices/af.pt', weights_only=True))
121
- ```
122
-
123
- ### Training Details
124
-
125
- **Compute:** Kokoro v0.19 was trained on A100 80GB vRAM instances for approximately 500 total GPU hours. The average cost for each GPU hour was around $0.80, so the total cost was around $400.
126
-
127
- **Data:** Kokoro was trained exclusively on **permissive/non-copyrighted audio data** and IPA phoneme labels. Examples of permissive/non-copyrighted audio include:
128
- - Public domain audio
129
- - Audio licensed under Apache, MIT, etc
130
- - Synthetic audio<sup>[1]</sup> generated by closed<sup>[2]</sup> TTS models from large providers<br/>
131
- [1] https://copyright.gov/ai/ai_policy_guidance.pdf<br/>
132
- [2] No synthetic audio from open TTS models or "custom voice clones"
133
-
134
- **Epochs:** Less than **20 epochs**
135
-
136
- **Total Dataset Size:** Less than **100 hours** of audio
137
-
138
- ### Limitations
139
-
140
- Kokoro v0.19 is limited in some specific ways, due to its training set and/or architecture:
141
- - [Data] Lacks voice cloning capability, likely due to small <100h training set
142
- - [Arch] Relies on external g2p (espeak-ng), which introduces a class of g2p failure modes
143
- - [Data] Training dataset is mostly long-form reading and narration, not conversation
144
- - [Arch] At 82M params, Kokoro almost certainly falls to a well-trained 1B+ param diffusion transformer, or a many-billion-param MLLM like GPT-4o / Gemini 2.0 Flash
145
- - [Data] Multilingual capability is architecturally feasible, but training data is mostly English
146
-
147
- Refer to the [Philosophy discussion](https://huggingface.co/hexgrad/Kokoro-82M/discussions/5) to better understand these limitations.
148
-
149
- **Will the other voicepacks be released?** There is currently no release date scheduled for the other voicepacks, but in the meantime you can try them in the hosted demo at [hf.co/spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS).
150
-
151
- ### Acknowledgements
152
- - [@yl4579](https://huggingface.co/yl4579) for architecting StyleTTS 2
153
- - [@Pendrokar](https://huggingface.co/Pendrokar) for adding Kokoro as a contender in the TTS Spaces Arena
154
-
155
- ### Model Card Contact
156
-
157
- `@rzvzn` on Discord. Server invite: https://discord.gg/QuGxSWBfQy
158
-
159
- <img src="https://static0.gamerantimages.com/wordpress/wp-content/uploads/2024/08/terminator-zero-41-1.jpg" width="400" alt="kokoro" />
160
-
 
 
 
 
 
 
 
 
 
 
161
  https://terminator.fandom.com/wiki/Kokoro
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - yl4579/StyleTTS2-LJSpeech
7
+ pipeline_tag: text-to-speech
8
+ title: Kokoro-80M Interface
9
+ sdk: gradio
10
+ emoji: 🐨
11
+ colorFrom: blue
12
+ colorTo: yellow
13
+ pinned: true
14
+ short_description: A interface made with gradio for Kokoro-82M model
15
+ ---
16
+
17
+ <h1>This Gradio Interface was created by <a href="https://github.com/SamratBarai/">@SamratBarai</a></h1>
18
+
19
+ 📣 Jan 12 Status: Intent to improve the base model https://hf.co/hexgrad/Kokoro-82M/discussions/36
20
+
21
+ ❤️ Kokoro Discord Server: https://discord.gg/QuGxSWBfQy
22
+
23
+ <audio controls><source src="https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/demo/HEARME.wav" type="audio/wav"></audio>
24
+
25
+ **Kokoro** is a frontier TTS model for its size of **82 million parameters** (text in/audio out).
26
+
27
+ On 25 Dec 2024, Kokoro v0.19 weights were permissively released in full fp32 precision under an Apache 2.0 license. As of 2 Jan 2025, 10 unique Voicepacks have been released, and a `.onnx` version of v0.19 is available.
28
+
29
+ In the weeks leading up to its release, Kokoro v0.19 was the #1🥇 ranked model in [TTS Spaces Arena](https://huggingface.co/hexgrad/Kokoro-82M#evaluation). Kokoro had achieved higher Elo in this single-voice Arena setting over other models, using fewer parameters and less data:
30
+ 1. **Kokoro v0.19: 82M params, Apache, trained on <100 hours of audio**
31
+ 2. XTTS v2: 467M, CPML, >10k hours
32
+ 3. Edge TTS: Microsoft, proprietary
33
+ 4. MetaVoice: 1.2B, Apache, 100k hours
34
+ 5. Parler Mini: 880M, Apache, 45k hours
35
+ 6. Fish Speech: ~500M, CC-BY-NC-SA, 1M hours
36
+
37
+ Kokoro's ability to top this Elo ladder suggests that the scaling law (Elo vs compute/data/params) for traditional TTS models might have a steeper slope than previously expected.
38
+
39
+ You can find a hosted demo at [hf.co/spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS).
40
+
41
+ ### Usage
42
+
43
+ The following can be run in a single cell on [Google Colab](https://colab.research.google.com/).
44
+ ```py
45
+ # 1️⃣ Install dependencies silently
46
+ !git lfs install
47
+ !git clone https://huggingface.co/hexgrad/Kokoro-82M
48
+ %cd Kokoro-82M
49
+ !apt-get -qq -y install espeak-ng > /dev/null 2>&1
50
+ !pip install -q phonemizer torch transformers scipy munch
51
+
52
+ # 2️⃣ Build the model and load the default voicepack
53
+ from models import build_model
54
+ import torch
55
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
56
+ MODEL = build_model('kokoro-v0_19.pth', device)
57
+ VOICE_NAME = [
58
+ 'af', # Default voice is a 50-50 mix of Bella & Sarah
59
+ 'af_bella', 'af_sarah', 'am_adam', 'am_michael',
60
+ 'bf_emma', 'bf_isabella', 'bm_george', 'bm_lewis',
61
+ 'af_nicole', 'af_sky',
62
+ ][0]
63
+ VOICEPACK = torch.load(f'voices/{VOICE_NAME}.pt', weights_only=True).to(device)
64
+ print(f'Loaded voice: {VOICE_NAME}')
65
+
66
+ # 3️⃣ Call generate, which returns 24khz audio and the phonemes used
67
+ from kokoro import generate
68
+ text = "How could I know? It's an unanswerable question. Like asking an unborn child if they'll lead a good life. They haven't even been born."
69
+ audio, out_ps = generate(MODEL, text, VOICEPACK, lang=VOICE_NAME[0])
70
+ # Language is determined by the first letter of the VOICE_NAME:
71
+ # 🇺🇸 'a' => American English => en-us
72
+ # 🇬🇧 'b' => British English => en-gb
73
+
74
+ # 4️⃣ Display the 24khz audio and print the output phonemes
75
+ from IPython.display import display, Audio
76
+ display(Audio(data=audio, rate=24000, autoplay=True))
77
+ print(out_ps)
78
+ ```
79
+ If you have trouble with `espeak-ng`, see this [github issue](https://github.com/bootphon/phonemizer/issues/44#issuecomment-1540885186). [Mac users also see this](https://huggingface.co/hexgrad/Kokoro-82M/discussions/12#677435d3d8ace1de46071489), and [Windows users see this](https://huggingface.co/hexgrad/Kokoro-82M/discussions/12#67742594fdeebf74f001ecfc).
80
+
81
+ For ONNX usage, see [#14](https://huggingface.co/hexgrad/Kokoro-82M/discussions/14).
82
+
83
+ ### Model Facts
84
+
85
+ No affiliation can be assumed between parties on different lines.
86
+
87
+ **Architecture:**
88
+ - StyleTTS 2: https://arxiv.org/abs/2306.07691
89
+ - ISTFTNet: https://arxiv.org/abs/2203.02395
90
+ - Decoder only: no diffusion, no encoder release
91
+
92
+ **Architected by:** Li et al @ https://github.com/yl4579/StyleTTS2
93
+
94
+ **Trained by**: `@rzvzn` on Discord
95
+
96
+ **Supported Languages:** American English, British English
97
+
98
+ **Model SHA256 Hash:** `3b0c392f87508da38fad3a2f9d94c359f1b657ebd2ef79f9d56d69503e470b0a`
99
+
100
+ ### Releases
101
+ - 25 Dec 2024: Model v0.19, `af_bella`, `af_sarah`
102
+ - 26 Dec 2024: `am_adam`, `am_michael`
103
+ - 28 Dec 2024: `bf_emma`, `bf_isabella`, `bm_george`, `bm_lewis`
104
+ - 30 Dec 2024: `af_nicole`
105
+ - 31 Dec 2024: `af_sky`
106
+ - 2 Jan 2025: ONNX v0.19 `ebef4245`
107
+
108
+ ### Licenses
109
+ - Apache 2.0 weights in this repository
110
+ - MIT inference code in [spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS) adapted from [yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
111
+ - GPLv3 dependency in [espeak-ng](https://github.com/espeak-ng/espeak-ng)
112
+
113
+ The inference code was originally MIT licensed by the paper author. Note that this card applies only to this model, Kokoro. Original models published by the paper author can be found at [hf.co/yl4579](https://huggingface.co/yl4579).
114
+
115
+ ### Evaluation
116
+
117
+ **Metric:** Elo rating
118
+
119
+ **Leaderboard:** [hf.co/spaces/Pendrokar/TTS-Spaces-Arena](https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena)
120
+
121
+ ![TTS-Spaces-Arena-25-Dec-2024](demo/TTS-Spaces-Arena-25-Dec-2024.png)
122
+
123
+ The voice ranked in the Arena is a 50-50 mix of Bella and Sarah. For your convenience, this mix is included in this repository as `af.pt`, but you can trivially reproduce it like this:
124
+
125
+ ```py
126
+ import torch
127
+ bella = torch.load('voices/af_bella.pt', weights_only=True)
128
+ sarah = torch.load('voices/af_sarah.pt', weights_only=True)
129
+ af = torch.mean(torch.stack([bella, sarah]), dim=0)
130
+ assert torch.equal(af, torch.load('voices/af.pt', weights_only=True))
131
+ ```
132
+
133
+ ### Training Details
134
+
135
+ **Compute:** Kokoro v0.19 was trained on A100 80GB vRAM instances for approximately 500 total GPU hours. The average cost for each GPU hour was around $0.80, so the total cost was around $400.
136
+
137
+ **Data:** Kokoro was trained exclusively on **permissive/non-copyrighted audio data** and IPA phoneme labels. Examples of permissive/non-copyrighted audio include:
138
+ - Public domain audio
139
+ - Audio licensed under Apache, MIT, etc
140
+ - Synthetic audio<sup>[1]</sup> generated by closed<sup>[2]</sup> TTS models from large providers<br/>
141
+ [1] https://copyright.gov/ai/ai_policy_guidance.pdf<br/>
142
+ [2] No synthetic audio from open TTS models or "custom voice clones"
143
+
144
+ **Epochs:** Less than **20 epochs**
145
+
146
+ **Total Dataset Size:** Less than **100 hours** of audio
147
+
148
+ ### Limitations
149
+
150
+ Kokoro v0.19 is limited in some specific ways, due to its training set and/or architecture:
151
+ - [Data] Lacks voice cloning capability, likely due to small <100h training set
152
+ - [Arch] Relies on external g2p (espeak-ng), which introduces a class of g2p failure modes
153
+ - [Data] Training dataset is mostly long-form reading and narration, not conversation
154
+ - [Arch] At 82M params, Kokoro almost certainly falls to a well-trained 1B+ param diffusion transformer, or a many-billion-param MLLM like GPT-4o / Gemini 2.0 Flash
155
+ - [Data] Multilingual capability is architecturally feasible, but training data is mostly English
156
+
157
+ Refer to the [Philosophy discussion](https://huggingface.co/hexgrad/Kokoro-82M/discussions/5) to better understand these limitations.
158
+
159
+ **Will the other voicepacks be released?** There is currently no release date scheduled for the other voicepacks, but in the meantime you can try them in the hosted demo at [hf.co/spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS).
160
+
161
+ ### Acknowledgements
162
+ - [@yl4579](https://huggingface.co/yl4579) for architecting StyleTTS 2
163
+ - [@Pendrokar](https://huggingface.co/Pendrokar) for adding Kokoro as a contender in the TTS Spaces Arena
164
+
165
+ ### Model Card Contact
166
+
167
+ `@rzvzn` on Discord. Server invite: https://discord.gg/QuGxSWBfQy
168
+
169
+ <img src="https://static0.gamerantimages.com/wordpress/wp-content/uploads/2024/08/terminator-zero-41-1.jpg" width="400" alt="kokoro" />
170
+
171
  https://terminator.fandom.com/wiki/Kokoro