Delete demo

Browse files

Files changed (6) hide show

demo/HEARME.txt +0 -47
demo/HEARME.wav +0 -3
demo/TTS-Spaces-Arena-25-Dec-2024.png +0 -3
demo/af_sky.txt +0 -11
demo/af_sky.wav +0 -3
demo/restoring-sky.md +0 -42

demo/HEARME.txt DELETED Viewed

@@ -1,47 +0,0 @@
-Kokoro is a frontier TTS model for its size of 82 million parameters.
-On the 25th of December, 2024, Kokoro v0 point 19 weights were permissively released in full fp32 precision along with 2 voicepacks (Bella and Sarah), all under an Apache 2 license.
-At the time of release, Kokoro v0 point 19 was the number 1 ranked model in TTS Spaces Arena. With 82 million parameters trained for under 20 epics on under 100 total hours of audio, Kokoro achieved higher Eelo in this single-voice Arena setting, over larger models. Kokoro's ability to top this Eelo ladder using relatively low compute and data, suggests that the scaling law for traditional TTS models might have a steeper slope than previously expected.
-Licenses. Apache 2 weights in this repository. MIT inference code. GPLv3 dependency in espeak NG.
-The inference code was originally MIT licensed by the paper author. Note that this card applies only to this model, Kokoro.
-Evaluation. Metric: Eelo rating. Leaderboard: TTS Spaces Arena.
-The voice ranked in the Arena is a 50 50 mix of Bella and Sarah. For your convenience, this mix is included in this repository as A-F dot PT, but you can trivially re-produce it.
-Training Details.
-Compute: Kokoro was trained on "A100 80GB v-ram instances" rented from Vast.ai. Vast was chosen over other compute providers due to its competitive on-demand hourly rates. The average hourly cost for the A100 80GB v-ram instances used for training was below $1 per hour per GPU, which was around half the quoted rates from other providers at the time.
-Data: Kokoro was trained exclusively on permissive non-copyrighted audio data and IPA phoneme labels. Examples of permissive non-copyrighted audio include:
-Public domain audio. Audio licensed under Apache, MIT, etc.
-Synthetic audio[1] generated by closed[2] TTS models from large providers.
-Epics: Less than 20 Epics. Total Dataset Size: Less than 100 hours of audio.
-Limitations. Kokoro v0 point 19 is limited in some ways, in its training set and architecture:
-Lacks voice cloning capability, likely due to small, under 100 hour training set.
-Relies on external g2p, which introduces a class of g2p failure modes.
-Training dataset is mostly long-form reading and narration, not conversation.
-At 82 million parameters, Kokoro almost certainly falls to a well-trained 1B+ parameter diffusion transformer, or a many-billion-parameter M LLM like GPT 4o or Gemini 2 Flash.
-Multilingual capability is architecturally feasible, but training data is almost entirely English.
-Will the other voicepacks be released?
-There is currently no release date scheduled for the other voicepacks, but in the meantime you can try them in the hosted demo.
-Acknowledgements. yL4 5 7 9 for architecting StyleTTS 2.
-Pendrokar for adding Kokoro as a contender in the TTS Spaces Arena.
-Model Card Contact. @rzvzn on Discord.

demo/HEARME.wav DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:98b884082db74c250b3cecda78341d1724c66727c0391b29a0160af918eccdb3
-size 11198508

demo/TTS-Spaces-Arena-25-Dec-2024.png DELETED Viewed

Git LFS Details

SHA256: e78b5ec1557323fa0e62681c83f6b81777f9834b91bbf26bf7567b036f011d52
Pointer size: 132 Bytes
Size of remote file: 1.07 MB

demo/af_sky.txt DELETED Viewed

@@ -1,11 +0,0 @@
-Last September, I received an offer from Sam Altman, who wanted to hire me to voice the current ChatGPT 4 system. He told me that he felt that by my voicing the system, I could bridge the gap between tech companies and creatives and help consumers to feel comfortable with the seismic shift concerning humans and AI. He said he felt that my voice would be comforting to people.
-After much consideration and for personal reasons, I declined the offer. Nine months later, my friends, family and the general public all noted how much the newest system named Sky sounded like me.
-When I heard the released demo, I was shocked, angered and in disbelief that Mr. Altman would pursue a voice that sounded so eerily similar to mine that my closest friends and news ou'tlits could not tell the difference. Mr. Altman even insinuated that the similarity was intentional, tweeting a single word — hur — a reference to the film in which I voiced a chat system, Samantha, who forms an intimate relationship with a human.
-Two days before the ChatGPT 4 demo was released, Mr. Altman contacted my agent, asking me to reconsider. Before we could connect, the system was out there.
-As a result of their actions, I was forced to hire legal counsel, who wrote two letters to Mr. Altman and OpenAI, setting out what they had done and asking them to detail the exact process by which they created the Sky voice. Consequently, OpenAI reluctantly agreed to take down the Sky voice.
-In a time when we are all grappling with deepfakes and the protection of our own likeness, our own work, our own identities, I believe these are questions that deserve absolute clarity. I look forward to resolution in the form of transparency and the passage of appropriate legislation to help ensure that individual rights are protected.

demo/af_sky.wav DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:ce36292bf868aa5f15931f3d81a9f46cc35ea76372e618a5e4453c9542e5ad7e
-size 5486636

demo/restoring-sky.md DELETED Viewed

@@ -1,42 +0,0 @@
-# Restoring Sky & reflecting on Kokoro
-<img src="https://static0.gamerantimages.com/wordpress/wp-content/uploads/2024/08/terminator-zero-41-1.jpg" width="400" alt="kokoro" />
-For those who don't know, [Kokoro](https://huggingface.co/hexgrad/Kokoro-82M) is an Apache TTS model that uses a skinny version of the open [StyleTTS 2](https://github.com/yl4579/StyleTTS2/tree/main) architecture.
-Based on leaderboard [Elo rating](https://huggingface.co/hexgrad/Kokoro-82M#evaluation) (prior to getting [review bombed](https://huggingface.co/datasets/Pendrokar/TTS_Arena/discussions/2)), Kokoro appears to do more with less, a theme that is surely [top-of-mind](https://huggingface.co/deepseek-ai/DeepSeek-V3) for many. It's peak performance on specific voices is comparable or better than much larger models, but it has not yet been trained on enough data to effectively zero-shot out of distribution (aka voice cloning).
-Tonight on NYE, `af_sky` joins Kokoro's roster of downloadable voices. This follows last night's quiet release of `af_nicole`, and an additional 8 voices are currently available: 2F 2M voices each for American & British English.
-Nicole in particular was trained on ~10 hours of synthetic data, and demonstrates that you _can_ include unique speaking styles in a general-purpose TTS model without affecting the stock voices (even in a low data small model): a good sign for scalability.
-Sky is interesting because it is the voice that ScarJo [got OpenAI to take down](https://x.com/OpenAI/status/1792443575839678909), so new training data cannot be generated. However, OpenAI did not remove 2023 samples of Sky from their [blog post](https://openai.com/index/chatgpt-can-now-see-hear-and-speak/), and along with a few seconds lying around various other parts of the internet, we can cobble together about 3 minutes of 2023 Sky.
-```sh
-wget https://cdn.openai.com/new-voice-and-image-capabilities-in-chatgpt/hd/story-sky.mp3
-wget https://cdn.openai.com/new-voice-and-image-capabilities-in-chatgpt/hd/recipe-sky.mp3
-wget https://cdn.openai.com/new-voice-and-image-capabilities-in-chatgpt/hd/speech-sky.mp3
-wget https://cdn.openai.com/new-voice-and-image-capabilities-in-chatgpt/hd/poem-sky.mp3
-wget https://cdn.openai.com/new-voice-and-image-capabilities-in-chatgpt/hd/info-sky.mp3
-```
-To be clear, this is not the first attempt to reconstruct Sky. On X, Benjamin De Kraker posted:
-> Here's the official statement released by Scarlett Johansson, detailing OpenAI's alleged illegal usage of her voice...
-> ...read by the Sky AI voice, because irony.
-> https://x.com/BenjaminDEKR/status/1792693868497871086
-and in the replies, he [stated](https://x.com/BenjaminDEKR/status/1792714347275501595):
-> It's an ElevenLabs clone I made based on Sky audio before they removed it. Not perfect.
-Here is `Kokoro/af_sky`'s rendition of the same:
-<audio controls><source src="https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/demo/af_sky.wav" type="audio/wav"></audio>
-A crude reconstruction, but the model that produced that voice is Apache FOSS that can be downloaded from HF and run locally. You can reproduce the above by dragging the [text script](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/demo/af_sky.txt) (note a handful of modified chars for better delivery) into the "Long Form" tab of this [hosted demo](https://huggingface.co/spaces/hexgrad/Kokoro-TTS), or you can download the [model weights](https://huggingface.co/hexgrad/Kokoro-82M), install dependencies and DIY.
-Sky shows that it is possible to reconstruct a voice—maybe a shadow of its former self, but a reconstruction nonetheless—from fairly little training data.
-### What's next
-Kokoro is a good start, but I can think of some tricks that might make it better, beginning with better data. More on this in another article.
-Feel free to check out [Kokoro's weights](https://huggingface.co/hexgrad/Kokoro-82M), try out a no-install [hosted demo](https://huggingface.co/spaces/hexgrad/Kokoro-TTS), and/or [join the Discord](https://discord.gg/QuGxSWBfQy).