anuragsingh922 commited on
Commit
d7dfeff
·
verified ·
1 Parent(s): 21a8ffb

Upload folder using huggingface_hub

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .DS_Store +0 -0
  2. .env +1 -0
  3. .gitattributes +4 -0
  4. .gitignore +0 -0
  5. README.md +196 -0
  6. __pycache__/chat_database.cpython-310.pyc +0 -0
  7. __pycache__/chat_database.cpython-313.pyc +0 -0
  8. __pycache__/grpc.cpython-310.pyc +0 -0
  9. __pycache__/grpc.cpython-313.pyc +0 -0
  10. __pycache__/grpc_code.cpython-310.pyc +0 -0
  11. __pycache__/istftnet.cpython-310.pyc +0 -0
  12. __pycache__/istftnet.cpython-312.pyc +0 -0
  13. __pycache__/istftnet.cpython-313.pyc +0 -0
  14. __pycache__/kokoro.cpython-310.pyc +0 -0
  15. __pycache__/kokoro.cpython-312.pyc +0 -0
  16. __pycache__/kokoro.cpython-313.pyc +0 -0
  17. __pycache__/models.cpython-310.pyc +0 -0
  18. __pycache__/models.cpython-312.pyc +0 -0
  19. __pycache__/models.cpython-313.pyc +0 -0
  20. __pycache__/plbert.cpython-310.pyc +0 -0
  21. __pycache__/plbert.cpython-312.pyc +0 -0
  22. __pycache__/plbert.cpython-313.pyc +0 -0
  23. __pycache__/queue.cpython-310.pyc +0 -0
  24. __pycache__/text_to_speech_pb2.cpython-310.pyc +0 -0
  25. __pycache__/text_to_speech_pb2.cpython-313.pyc +0 -0
  26. __pycache__/text_to_speech_pb2_grpc.cpython-310.pyc +0 -0
  27. __pycache__/text_to_speech_pb2_grpc.cpython-313.pyc +0 -0
  28. app.py +206 -0
  29. backend/.DS_Store +0 -0
  30. backend/.gitignore +2 -0
  31. backend/app.js +22 -0
  32. backend/config.env +1 -0
  33. backend/config.js +7 -0
  34. backend/handle-realtime-tts/cleangRPCconnections.js +47 -0
  35. backend/handle-realtime-tts/makegRPCconnection.js +40 -0
  36. backend/handle-realtime-tts/sttModelSocket.js +289 -0
  37. backend/handle-realtime-tts/text_to_speech.proto +32 -0
  38. backend/package-lock.json +0 -0
  39. backend/package.json +27 -0
  40. backend/providers/updateChathistory.js +46 -0
  41. backend/utils/session.js +15 -0
  42. chat_database.py +64 -0
  43. chat_history.pkl +3 -0
  44. config.json +26 -0
  45. demo/HEARME.txt +47 -0
  46. demo/HEARME.wav +3 -0
  47. demo/TTS-Spaces-Arena-25-Dec-2024.png +3 -0
  48. demo/af_sky.txt +11 -0
  49. demo/af_sky.wav +3 -0
  50. demo/restoring-sky.md +42 -0
.DS_Store ADDED
Binary file (8.2 kB). View file
 
.env ADDED
@@ -0,0 +1 @@
 
 
1
+ OPENAI_API_KEY = <openai_api_key>
.gitattributes CHANGED
@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ TTS-Spaces-Arena-25-Dec-2024.png filter=lfs diff=lfs merge=lfs -text
37
+ HEARME.wav filter=lfs diff=lfs merge=lfs -text
38
+ demo/af_sky.wav filter=lfs diff=lfs merge=lfs -text
39
+ output.wav filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
File without changes
README.md ADDED
@@ -0,0 +1,196 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # **Realtime TTS System**
2
+ This repository contains the complete codebase for building your personal Realtime Text-to-Speech (TTS) solution. It integrates a powerful TTS model, gRPC communication, an Express server, and a React-based client. Follow this guide to set up and explore the system effectively.
3
+
4
+ ---
5
+
6
+ ## **Repository Structure**
7
+ ```
8
+ ├── backend/ # Express server for handling API requests
9
+ ├── frontend/ # React client for user interaction
10
+ ├── .env # Environment variables (OpenAI API key, etc.)
11
+ ├── voices # all available voices
12
+ ├── demo # demo files of model
13
+ ├── other...
14
+ ```
15
+
16
+ ---
17
+
18
+ ## **Setup Guide**
19
+
20
+ ### **Step 1: Clone the Repository**
21
+ Clone this repository to your local machine:
22
+ ```bash
23
+ git clone https://huggingface.co/anuragsingh922/realtime-tts
24
+ cd realtime-tts
25
+ ```
26
+
27
+ ---
28
+
29
+ ### **Step 2: Python Virtual Environment Setup**
30
+ Create a virtual environment to manage dependencies:
31
+
32
+ #### macOS/Linux:
33
+ ```bash
34
+ python3 -m venv venv
35
+ source venv/bin/activate
36
+ ```
37
+
38
+ #### Windows:
39
+ ```bash
40
+ python -m venv venv
41
+ venv\Scripts\activate
42
+ ```
43
+
44
+ ---
45
+
46
+ ### **Step 3: Install Python Dependencies**
47
+ With the virtual environment activated, install the required dependencies:
48
+ ```bash
49
+ pip install --upgrade pip setuptools wheel
50
+ pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
51
+ pip install -r requirements.txt
52
+ ```
53
+
54
+ ### **Installing eSpeak**
55
+ `eSpeak` is a necessary dependency for the TTS system. Follow the instructions below to install it on your platform:
56
+
57
+ #### **Ubuntu/Linux**
58
+ Use the `apt-get` package manager to install `eSpeak`:
59
+ ```bash
60
+ sudo apt-get update
61
+ sudo apt-get install espeak
62
+ ```
63
+
64
+ #### **macOS**
65
+ Install `eSpeak` using [Homebrew](https://brew.sh/):
66
+ 1. Ensure Homebrew is installed on your system:
67
+ ```bash
68
+ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
69
+ ```
70
+ 2. Install `espeak`:
71
+ ```bash
72
+ brew install espeak
73
+ ```
74
+
75
+ #### **Windows**
76
+ For Windows, follow these steps to install `eSpeak`:
77
+ 1. Download the eSpeak installer from the official website: [eSpeak Downloads](http://espeak.sourceforge.net/download.html).
78
+ 2. Run the installer and follow the on-screen instructions to complete the installation.
79
+ 3. Add the `eSpeak` installation path to your system's `PATH` environment variable:
80
+ - Open **System Properties** → **Advanced** → **Environment Variables**.
81
+ - In the "System Variables" section, find the `Path` variable and edit it.
82
+ - Add the path to the `espeak.exe` file (e.g., `C:\Program Files (x86)\eSpeak`).
83
+ 4. Verify the installation:
84
+ Open Command Prompt and run:
85
+ ```cmd
86
+ espeak --version
87
+ ```
88
+
89
+ ---
90
+
91
+ ### **Verification**
92
+ After installing `eSpeak`, verify it is correctly set up by running:
93
+ ```bash
94
+ espeak "Hello, world!"
95
+ ```
96
+
97
+ This should output "Hello, world!" as audio on your system.
98
+
99
+ ---
100
+
101
+ ### **Step 4: Backend Setup (Express Server)**
102
+ 1. Navigate to the `backend` directory:
103
+ ```bash
104
+ cd backend
105
+ ```
106
+ 2. Install Node.js dependencies:
107
+ ```bash
108
+ npm install
109
+ ```
110
+ 3. Update the `config.env` file with your Deepgram API key:
111
+ - Open `config.env` in a text editor.
112
+ - Replace `<deepgram_api_key>` with your actual Deepgram API key.
113
+
114
+ 4. Start the Express server:
115
+ ```bash
116
+ node app.js
117
+ ```
118
+
119
+ ---
120
+
121
+ ### **Step 5: Frontend Setup (React Client)**
122
+ 1. Open a new terminal and navigate to the `frontend` directory:
123
+ ```bash
124
+ cd frontend
125
+ ```
126
+ 2. Install client dependencies:
127
+ ```bash
128
+ npm install
129
+ ```
130
+ 3. Start the client:
131
+ ```bash
132
+ npm start
133
+ ```
134
+
135
+ ---
136
+
137
+ ### **Step 6: Start the TTS Server**
138
+ 1. Add your OpenAI API key to the `.env` file:
139
+ - Open `.env` in a text editor.
140
+ - Replace `<openai_api_key>` with your actual OpenAI API key.
141
+
142
+ 2. Start the TTS server:
143
+ ```bash
144
+ python3 app.py
145
+ ```
146
+
147
+ ---
148
+
149
+ ### **Step 7: Test the Full System**
150
+ - Once all servers are running:
151
+ 1. Access the React client at [http://localhost:3000](http://localhost:3000).
152
+ 2. Interact with the TTS system via the web interface.
153
+
154
+ ---
155
+
156
+ ## **Model Used**
157
+ This project utilizes the [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) TTS model hosted on Hugging Face. The model generates high-quality, realtime text-to-speech outputs.
158
+
159
+ ---
160
+
161
+ ## **Key Features**
162
+ 1. **Realtime TTS Generation**: Convert text input into speech with minimal latency.
163
+ 2. **React Client**: A user-friendly frontend for interaction.
164
+ 3. **Express Backend**: Handles API requests and integrates the TTS system with external services.
165
+ 4. **gRPC Communication**: Seamless communication between the TTS server and other components.
166
+ 5. **Configurable APIs**: Supports OpenAI and Deepgram API integrations.
167
+
168
+ ---
169
+
170
+ ## **Dependencies**
171
+
172
+ ### Python:
173
+ - `torch`, `torchvision`, `torchaudio`
174
+ - `phonemizer`
175
+ - `transformers`
176
+ - `scipy`
177
+ - `munch`
178
+ - `python-dotenv`
179
+ - `openai`
180
+ - `grpcio`, `grpcio-tools`
181
+ - `espeak`
182
+
183
+ ### Node.js:
184
+ - Express server dependencies (`npm install` in `backend`).
185
+ - React client dependencies (`npm install` in `frontend`).
186
+
187
+ ---
188
+
189
+ ## **Contributing**
190
+ Contributions are welcome! Feel free to fork this repository and create a pull request with your improvements.
191
+
192
+ ---
193
+
194
+ ## **Acknowledgments**
195
+ - [Hugging Face](https://huggingface.co/) for hosting the Kokoro-82M model.
196
+ - The amazing communities behind PyTorch, OpenAI, and Deepgram APIs.
__pycache__/chat_database.cpython-310.pyc ADDED
Binary file (1.58 kB). View file
 
__pycache__/chat_database.cpython-313.pyc ADDED
Binary file (2.87 kB). View file
 
__pycache__/grpc.cpython-310.pyc ADDED
Binary file (4.17 kB). View file
 
__pycache__/grpc.cpython-313.pyc ADDED
Binary file (7.31 kB). View file
 
__pycache__/grpc_code.cpython-310.pyc ADDED
Binary file (4.18 kB). View file
 
__pycache__/istftnet.cpython-310.pyc ADDED
Binary file (16.5 kB). View file
 
__pycache__/istftnet.cpython-312.pyc ADDED
Binary file (30.6 kB). View file
 
__pycache__/istftnet.cpython-313.pyc ADDED
Binary file (30.5 kB). View file
 
__pycache__/kokoro.cpython-310.pyc ADDED
Binary file (7.49 kB). View file
 
__pycache__/kokoro.cpython-312.pyc ADDED
Binary file (13.7 kB). View file
 
__pycache__/kokoro.cpython-313.pyc ADDED
Binary file (13.8 kB). View file
 
__pycache__/models.cpython-310.pyc ADDED
Binary file (12.7 kB). View file
 
__pycache__/models.cpython-312.pyc ADDED
Binary file (25.8 kB). View file
 
__pycache__/models.cpython-313.pyc ADDED
Binary file (25.9 kB). View file
 
__pycache__/plbert.cpython-310.pyc ADDED
Binary file (957 Bytes). View file
 
__pycache__/plbert.cpython-312.pyc ADDED
Binary file (1.15 kB). View file
 
__pycache__/plbert.cpython-313.pyc ADDED
Binary file (1.22 kB). View file
 
__pycache__/queue.cpython-310.pyc ADDED
Binary file (134 Bytes). View file
 
__pycache__/text_to_speech_pb2.cpython-310.pyc ADDED
Binary file (1.67 kB). View file
 
__pycache__/text_to_speech_pb2.cpython-313.pyc ADDED
Binary file (2.27 kB). View file
 
__pycache__/text_to_speech_pb2_grpc.cpython-310.pyc ADDED
Binary file (3.17 kB). View file
 
__pycache__/text_to_speech_pb2_grpc.cpython-313.pyc ADDED
Binary file (4.43 kB). View file
 
app.py ADDED
@@ -0,0 +1,206 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from concurrent import futures
2
+ import torch
3
+ from models import build_model
4
+ import numpy as np
5
+ import re
6
+ import wave
7
+ from kokoro import generate
8
+ from openai import OpenAI
9
+ from collections import deque
10
+ import grpc
11
+ import text_to_speech_pb2
12
+ import text_to_speech_pb2_grpc
13
+ import io
14
+ from dotenv import load_dotenv
15
+ import os
16
+ from chat_database import save_chat_entry, get_chat_history
17
+
18
+ load_dotenv()
19
+
20
+ # Device configuration
21
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
22
+
23
+ # Load the Kokoro model
24
+ MODEL = build_model('kokoro-v0_19.pth', device)
25
+
26
+ # Specify the voice name and load the voice pack
27
+ VOICE_NAME = [
28
+ 'af',
29
+ 'af_bella', 'af_sarah', 'am_adam', 'am_michael',
30
+ 'bf_emma', 'bf_isabella', 'bm_george', 'bm_lewis',
31
+ 'af_nicole', 'af_sky',
32
+ ][0]
33
+ VOICEPACK = torch.load(f'voices/{VOICE_NAME}.pt', weights_only=True).to(device)
34
+
35
+
36
+ client = OpenAI(
37
+ api_key= os.getenv("OPENAI_API_KEY")
38
+ )
39
+
40
+ def chunk_text(text, max_chars=2040):
41
+ sentences = re.split(r'(?<=[.!?])\s+', text)
42
+ chunks = []
43
+ current_chunk = []
44
+ current_length = 0
45
+ for sentence in sentences:
46
+ sentence_length = len(sentence)
47
+ if current_length + sentence_length <= max_chars:
48
+ current_chunk.append(sentence)
49
+ current_length += sentence_length
50
+ else:
51
+ if current_chunk:
52
+ chunks.append(' '.join(current_chunk))
53
+ current_chunk = [sentence]
54
+ current_length = sentence_length
55
+ if current_chunk:
56
+ chunks.append(' '.join(current_chunk))
57
+ return chunks
58
+
59
+ def generate_audio_from_chunks(text, model, voicepack, voice_name):
60
+ chunks = chunk_text(text)
61
+ combined_audio = np.array([])
62
+ for chunk in chunks:
63
+ try:
64
+ audio, _ = generate(model, chunk, voicepack, lang=voice_name[0])
65
+ combined_audio = np.concatenate([combined_audio, audio]) if combined_audio.size > 0 else audio
66
+ except Exception:
67
+ pass
68
+ return combined_audio
69
+
70
+ def save_audio_to_file(audio_data, file_number, sample_rate=24000):
71
+ filename = f"output-{file_number}.wav"
72
+ with wave.open(filename, 'wb') as wav_file:
73
+ wav_file.setnchannels(1)
74
+ wav_file.setsampwidth(2)
75
+ wav_file.setframerate(sample_rate)
76
+ audio_int16 = (audio_data * 32767).astype(np.int16)
77
+ wav_file.writeframes(audio_int16.tobytes())
78
+ return filename
79
+
80
+ def getResponse(text , session_id):
81
+ try:
82
+ chat_history = get_chat_history(session_id)
83
+ response = client.chat.completions.create(
84
+ model='gpt-3.5-turbo',
85
+ messages=chat_history,
86
+ stream=True
87
+ )
88
+ return response
89
+ except Exception as e:
90
+ print("Error in getResponse : " , e)
91
+
92
+ def get_audio_bytes(audio_data, sample_rate=24000):
93
+ wav_bytes = io.BytesIO()
94
+ with wave.open(wav_bytes, 'wb') as wav_file:
95
+ wav_file.setnchannels(1)
96
+ wav_file.setsampwidth(2)
97
+ wav_file.setframerate(sample_rate)
98
+ audio_int16 = (audio_data * 32767).astype(np.int16)
99
+ wav_file.writeframes(audio_int16.tobytes())
100
+ wav_bytes.seek(0)
101
+ return wav_bytes.read()
102
+
103
+ def dummy_bytes():
104
+ buffer = io.BytesIO()
105
+ dummy_data = b"This is a test of dummy byte data."
106
+ buffer.write(dummy_data)
107
+ buffer.seek(0)
108
+ byte_value = buffer.getvalue()
109
+ return byte_value
110
+
111
+
112
+ class TextToSpeechServicer(text_to_speech_pb2_grpc.TextToSpeechServiceServicer):
113
+ def ProcessText(self, request_iterator, context):
114
+ try:
115
+ print("Received new request")
116
+ parameters = {
117
+ "processing_active": False,
118
+ "queue": deque(),
119
+ "file_number": 0,
120
+ "session_id": "",
121
+ "interrupt_seq" : 0
122
+ }
123
+ for request in request_iterator:
124
+ field = request.WhichOneof('request_data')
125
+ if field == 'metadata':
126
+ parameters["session_id"] = request.metadata.session_id
127
+ continue
128
+ elif field == 'text':
129
+ text = request.text
130
+ if not text:
131
+ continue
132
+ save_chat_entry(parameters["session_id"] , "user" , text)
133
+ parameters["queue"].clear()
134
+ yield text_to_speech_pb2.ProcessTextResponse(
135
+ buffer = dummy_bytes(),
136
+ session_id=parameters["session_id"],
137
+ sequence_id = "-2",
138
+ transcript=text,
139
+ )
140
+ final_response = ""
141
+ response = getResponse(text , parameters["session_id"])
142
+ for chunk in response:
143
+ msg = chunk.choices[0].delta.content
144
+ if msg:
145
+ final_response += msg
146
+ if final_response.endswith(('.', '!', '?')):
147
+ parameters["file_number"] += 1
148
+ parameters["queue"].append((final_response, parameters["file_number"]))
149
+ final_response = ""
150
+ if not parameters["processing_active"]:
151
+ yield from self.process_queue(parameters)
152
+
153
+ if final_response:
154
+ parameters["file_number"] += 1
155
+ parameters["queue"].append((final_response, parameters["file_number"]))
156
+ if not parameters["processing_active"]:
157
+ yield from self.process_queue(parameters)
158
+
159
+ elif field == 'status':
160
+ transcript = request.status.transcript
161
+ played_seq = request.status.played_seq
162
+ interrupt_seq = request.status.interrupt_seq
163
+ parameters["interrupt_seq"] = interrupt_seq
164
+ save_chat_entry(parameters["session_id"] , "assistant" , transcript)
165
+ continue
166
+ else:
167
+ continue
168
+ except Exception as e:
169
+ print("Error in ProcessText:", e)
170
+
171
+ def process_queue(self , parameters):
172
+ try:
173
+ while True:
174
+ if not parameters["queue"]:
175
+ parameters["processing_active"] = False
176
+ break
177
+ parameters["processing_active"] = True
178
+ sentence, file_number = parameters["queue"].popleft()
179
+ if file_number <= int(parameters["interrupt_seq"]):
180
+ continue
181
+ combined_audio = generate_audio_from_chunks(sentence, MODEL, VOICEPACK, VOICE_NAME)
182
+ audio_bytes = get_audio_bytes(combined_audio)
183
+ # filename = save_audio_to_file(combined_audio, file_number)
184
+ yield text_to_speech_pb2.ProcessTextResponse(
185
+ buffer=audio_bytes,
186
+ session_id=parameters["session_id"],
187
+ sequence_id=str(file_number),
188
+ transcript=sentence,
189
+ )
190
+ except Exception as e:
191
+ parameters["processing_active"] = False
192
+ print("Error in process_queue:", e)
193
+
194
+
195
+ def serve():
196
+ print("Starting gRPC server...")
197
+ server = grpc.server(futures.ThreadPoolExecutor(max_workers=1))
198
+ text_to_speech_pb2_grpc.add_TextToSpeechServiceServicer_to_server(TextToSpeechServicer(), server)
199
+ server.add_insecure_port('[::]:8081')
200
+ server.start()
201
+ print("gRPC server is running on port 8081")
202
+ server.wait_for_termination()
203
+
204
+
205
+ if __name__ == "__main__":
206
+ serve()
backend/.DS_Store ADDED
Binary file (6.15 kB). View file
 
backend/.gitignore ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ /node_modules
2
+ .DS_Store
backend/app.js ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ const express = require("express");
2
+ const cors = require("cors");
3
+ const bodyParser = require("body-parser");
4
+ const app = express();
5
+ require("express-ws")(app);
6
+ app.use(express.json());
7
+ app.use(cors());
8
+ app.use(express.urlencoded({ extended: true }));
9
+ app.use(bodyParser.json());
10
+ const port = 8080;
11
+
12
+ const { audio_stream } = require("./handle-realtime-tts/sttModelSocket.js");
13
+
14
+ app.get("/health", (req, res) => {
15
+ res.send("Green");
16
+ });
17
+
18
+ app.ws("/v2v", audio_stream);
19
+
20
+ app.listen(port, () => {
21
+ console.log(`Example app listening at http://localhost:${port}`);
22
+ });
backend/config.env ADDED
@@ -0,0 +1 @@
 
 
1
+ DEEPGRAM_KEY = <deepgram_api_key>
backend/config.js ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ require("dotenv").config({ path: "./config.env" });
2
+
3
+ const deepgram_key = process.env.DEEPGRAM_KEY;
4
+
5
+ module.exports = {
6
+ deepgram_key
7
+ };
backend/handle-realtime-tts/cleangRPCconnections.js ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ const cleanupConnection = async (session) => {
2
+ try {
3
+ if (session.channel) {
4
+ const state = session.channel.getConnectivityState(false);
5
+ console.log(`Client : ${state}`);
6
+ if (state !== 4) {
7
+ console.log("Closing call and client.");
8
+ session.client.close();
9
+ session.call.end();
10
+ session.client = null;
11
+ session.call = null;
12
+ }
13
+ } else {
14
+ try {
15
+ if (session.client) {
16
+ session.client.close();
17
+ if (session.call) {
18
+ session.call.end();
19
+ }
20
+ session.call = null;
21
+ session.client = null;
22
+ session.channel = null;
23
+ }
24
+ } catch (err) {
25
+ session.call = null;
26
+ session.client = null;
27
+ session.channel = null;
28
+ }
29
+ }
30
+ console.log("gRPC connection ended.");
31
+ } catch (err) {
32
+ if (session.call) {
33
+ session.call.end();
34
+ }
35
+ session.call = null;
36
+ console.log("Error ending gRPC connection: ", err);
37
+ } finally {
38
+ if (session.call) {
39
+ session.call.end();
40
+ }
41
+ session.call = null;
42
+ session.client = null;
43
+ session.channel = null;
44
+ }
45
+ };
46
+
47
+ module.exports = { cleanupConnection };
backend/handle-realtime-tts/makegRPCconnection.js ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ const grpc = require("@grpc/grpc-js");
2
+ const protoLoader = require("@grpc/proto-loader");
3
+ const path = require("path");
4
+
5
+ const getgRPCConnection = (session) => {
6
+ return new Promise((resolve, reject) => {
7
+ protoLoader
8
+ .load(path.join(__dirname, "text_to_speech.proto"), {
9
+ keepCase: true,
10
+ longs: String,
11
+ enums: String,
12
+ defaults: true,
13
+ oneofs: true,
14
+ })
15
+ .then((packageDefinition) => {
16
+ const textToSpeechProto = grpc.loadPackageDefinition(packageDefinition).texttospeech;
17
+ const client = new textToSpeechProto.TextToSpeechService(
18
+ "localhost:8081",
19
+ grpc.credentials.createInsecure()
20
+ );
21
+ session.client = client;
22
+ const channel = session.client.getChannel();
23
+ session.channel = channel;
24
+ console.log("Made connection");
25
+ session.client = client;
26
+
27
+
28
+ const call = client.ProcessText();
29
+ resolve(call);
30
+
31
+ })
32
+ .catch((error) => {
33
+ session.client = null;
34
+ console.error("Error loading proto file:", error);
35
+ reject(new Error("Error in making gRPC Connection."));
36
+ });
37
+ });
38
+ };
39
+
40
+ module.exports = { getgRPCConnection };
backend/handle-realtime-tts/sttModelSocket.js ADDED
@@ -0,0 +1,289 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ const isBuffer = require("is-buffer");
2
+ const { Buffer } = require("buffer");
3
+ const {deepgram_key} = require("../config");
4
+ const Session = require("../utils/session.js");
5
+ const { cleanupConnection } = require("./cleangRPCconnections.js");
6
+ const { getgRPCConnection } = require("./makegRPCconnection.js");
7
+ const { updateChathistory } = require("../providers/updateChathistory.js");
8
+ const { createClient, LiveTranscriptionEvents } = require("@deepgram/sdk");
9
+ const deepgram = createClient(deepgram_key);
10
+
11
+ const audio_stream = async (wss, req) => {
12
+ try {
13
+ const session = new Session();
14
+
15
+ wss.send(JSON.stringify({ type: "initial", msg: "connected" }));
16
+
17
+
18
+ const connection = deepgram.listen.live({
19
+ punctuate: true,
20
+ interim_results: true,
21
+ speech_final: true,
22
+ encoding: "linear16",
23
+ sample_rate: 16000,
24
+ model: "nova-2",
25
+ speech_final: true,
26
+ version: "latest",
27
+ });
28
+
29
+ const callMLServer = async (text) => {
30
+ try {
31
+ session.call.write({ text: text });
32
+ } catch (error) {
33
+ console.error("Error in calling ml server : ", error);
34
+ }
35
+ }
36
+
37
+
38
+ connection.on(LiveTranscriptionEvents.Open, () => {
39
+ console.log(LiveTranscriptionEvents.Open);
40
+ connection.on(LiveTranscriptionEvents.Close, () => {
41
+ console.log("Connection closed.");
42
+ });
43
+
44
+ connection.on(LiveTranscriptionEvents.Transcript, (data) => {
45
+ const text = data?.channel?.alternatives[0]?.transcript;
46
+ // console.log("Response : ", text);
47
+ if (data.is_final && data.speech_final && text) {
48
+ console.log("Response : ", text);
49
+ callMLServer(text);
50
+ }
51
+ });
52
+
53
+ connection.on(LiveTranscriptionEvents.Metadata, (data) => {
54
+ console.log(data);
55
+ });
56
+
57
+ connection.on(LiveTranscriptionEvents.Error, (err) => {
58
+ console.error(err);
59
+ });
60
+ });
61
+
62
+
63
+ wss.on("message", async (message) => {
64
+ try {
65
+ if (isBuffer(message) && session.call) {
66
+ try {
67
+ const audioChunk = {
68
+ audio_data: message,
69
+ };
70
+
71
+ try {
72
+ if (connection && connection.getReadyState() == 1) {
73
+ connection.send(message);
74
+ }
75
+ } catch (error) {
76
+ console.log("Error sending buffer to deepgram : ", error);
77
+ }
78
+ } catch (err) {
79
+ console.error("Error writing to stream: ", err);
80
+ }
81
+ }
82
+
83
+ // Handle message not of typeof buffer
84
+ if (typeof message === "string") {
85
+ try {
86
+ const data = JSON.parse(message);
87
+
88
+ const { type, msg } = data;
89
+
90
+ switch (type) {
91
+ case "start":
92
+ session.starttime = Date.now();
93
+ session.chathistory = [];
94
+ session.chathistorybackup = [];
95
+ console.log("Making Connection with gRPC...");
96
+ try {
97
+ console.time("grpcconnection");
98
+ session.call = await getgRPCConnection(session);
99
+ console.timeEnd("grpcconnection");
100
+ const state = session.channel.getConnectivityState(false);
101
+ console.log(`Client : ${state}`);
102
+ session.saved = false;
103
+ wss.send(JSON.stringify({ type: "ready", msg: "connected" }));
104
+ console.log("Connected to gRPC.");
105
+
106
+ const {
107
+ sessionId,
108
+ } = JSON.parse(msg);
109
+ const metadata = {
110
+ metadata: {
111
+ session_id: sessionId,
112
+ },
113
+ };
114
+ if (session.call) {
115
+ console.log("Sending metadata.")
116
+ session.call.write(metadata);
117
+ }
118
+ } catch (err) {
119
+ await cleanupConnection(session);
120
+ console.error("Error in making gRPC Connection. : ", err);
121
+ }
122
+ session.call.on("data", (response) => {
123
+ console.log("Data : ", response);
124
+
125
+ const {session_id , sequence_id , transcript , buffer} = response;
126
+
127
+ const metadata = JSON.stringify({
128
+ session_id: session_id,
129
+ sequence_id: sequence_id,
130
+ transcript: transcript,
131
+ });
132
+
133
+ if (sequence_id === "-2") {
134
+ session.latency = Date.now();
135
+ wss.send(JSON.stringify({ type: "clear", msg: "clear" }));
136
+ session.chathistory = [...session.chathistorybackup];
137
+ wss.send(
138
+ JSON.stringify({
139
+ type: "chathistory",
140
+ msg: session.chathistorybackup,
141
+ })
142
+ );
143
+ const wavBuffer = Buffer.concat([
144
+ Buffer.from(metadata),
145
+ Buffer.from([0]),
146
+ buffer,
147
+ ]);
148
+
149
+ const base64buffer = wavBuffer.toString("base64");
150
+ wss.send(
151
+ JSON.stringify({ type: "media", msg: base64buffer })
152
+ );
153
+ session.chathistory.push({
154
+ speaker: "USER",
155
+ content: transcript,
156
+ });
157
+ wss.send(
158
+ JSON.stringify({
159
+ type: "chathistory",
160
+ msg: session.chathistory,
161
+ })
162
+ );
163
+ session.chathistorybackup.push({
164
+ speaker: "USER",
165
+ content: transcript,
166
+ });
167
+ return;
168
+ }
169
+
170
+ if (sequence_id === "0") {
171
+ wss.send(JSON.stringify({ type: "pause", msg: "pause" }));
172
+ session.cansend = false;
173
+ return;
174
+ }
175
+
176
+ if (sequence_id === "-1") {
177
+ wss.send(
178
+ JSON.stringify({ type: "continue", msg: "continue" })
179
+ );
180
+ return;
181
+ }
182
+
183
+ if (sequence_id === "1") {
184
+ const latency = Date.now() - session.latency;
185
+ console.log("First Response Latency: ", latency, "ms");
186
+ session.latency = 0;
187
+ // wss.send(JSON.stringify({ type: "clear", msg: "clear" }));
188
+ session.cansend = true;
189
+ }
190
+
191
+ if (!buffer) {
192
+ return;
193
+ }
194
+
195
+ if (!session.cansend && sequence_id !== "0") {
196
+ return;
197
+ }
198
+
199
+ // Combine header and PCM data into a single Buffer
200
+ const wavBuffer = Buffer.concat([
201
+ Buffer.from(metadata),
202
+ Buffer.from([0]),
203
+ buffer,
204
+ ]);
205
+
206
+ const base64buffer = wavBuffer.toString("base64");
207
+ wss.send(
208
+ JSON.stringify({ type: "media", msg: base64buffer })
209
+ );
210
+
211
+ updateChathistory(transcript, false, session);
212
+
213
+ wss.send(
214
+ JSON.stringify({
215
+ type: "chathistory",
216
+ msg: session.chathistory,
217
+ })
218
+ );
219
+ });
220
+
221
+ session.call.on("end", async () => {
222
+ console.log("Ended");
223
+ await cleanupConnection(session);
224
+ try {
225
+ wss.send(JSON.stringify({ type: "end", msg: "end" }));
226
+ } catch (err) { }
227
+ console.log("Stream ended");
228
+ });
229
+
230
+ session.call.on("error", async (error) => {
231
+ console.error(`Stream error: ${error}`);
232
+ try {
233
+ wss.send(JSON.stringify({ type: "end", msg: "end" }));
234
+ } catch (err) { }
235
+ await cleanupConnection(session);
236
+ });
237
+ break;
238
+
239
+ case "status":
240
+ const { session_id, sequence_id, transcript } = msg;
241
+ const status = {
242
+ status: {
243
+ transcript : transcript,
244
+ played_seq: sequence_id,
245
+ interrupt_seq: sequence_id,
246
+ },
247
+ };
248
+
249
+ if (session.call) {
250
+ session.call.write(status);
251
+ }
252
+
253
+ updateChathistory(transcript, true, session);
254
+ break;
255
+
256
+ case "stop":
257
+ console.log("Client Stoped the stream.");
258
+ await cleanupConnection(session);
259
+ break;
260
+ default:
261
+ console.log("Type not handled.");
262
+ }
263
+ } catch (err) {
264
+ console.log(`Not a valid json : ${err}`);
265
+ }
266
+ }
267
+ } catch (err) {
268
+ console.error(`Error in wss.onmessage : ${err}`);
269
+ }
270
+ });
271
+
272
+ wss.on("close", async () => {
273
+ await cleanupConnection(session);
274
+ console.log("WebSocket connection closed.");
275
+ });
276
+
277
+ wss.on("error", async (err) => {
278
+ console.error(`WebSocket error: ${err}`);
279
+ await cleanupConnection(session);
280
+ });
281
+ } catch (err) {
282
+ try {
283
+ console.log(err)
284
+ wss.send(JSON.stringify({ type: "end", msg: "end" }));
285
+ } catch (err) { }
286
+ }
287
+ };
288
+
289
+ module.exports = { audio_stream };
backend/handle-realtime-tts/text_to_speech.proto ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ syntax = "proto3";
2
+
3
+ package texttospeech;
4
+
5
+ service TextToSpeechService {
6
+ rpc ProcessText (stream ProcessTextRequest) returns (stream ProcessTextResponse);
7
+ }
8
+
9
+ message ProcessTextRequest {
10
+ oneof request_data {
11
+ string text = 1;
12
+ Meta metadata = 2;
13
+ Status status = 3;
14
+ }
15
+ }
16
+
17
+ message ProcessTextResponse {
18
+ bytes buffer = 1;
19
+ string session_id = 2;
20
+ string sequence_id = 3;
21
+ string transcript = 4;
22
+ }
23
+
24
+ message Meta {
25
+ string session_id = 1;
26
+ }
27
+
28
+ message Status {
29
+ string transcript = 1;
30
+ string played_seq = 2;
31
+ string interrupt_seq = 3;
32
+ }
backend/package-lock.json ADDED
The diff for this file is too large to render. See raw diff
 
backend/package.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "name": "backend",
3
+ "version": "1.0.0",
4
+ "description": "",
5
+ "main": "index.js",
6
+ "scripts": {
7
+ "test": "echo \"Error: no test specified\" && exit 1"
8
+ },
9
+ "keywords": [],
10
+ "author": "",
11
+ "license": "ISC",
12
+ "dependencies": {
13
+ "@deepgram/sdk": "^3.9.0",
14
+ "@geckos.io/server": "^3.0.0",
15
+ "@grpc/grpc-js": "^1.11.3",
16
+ "axios": "^1.7.9",
17
+ "bcryptjs": "^2.4.3",
18
+ "cors": "^2.8.5",
19
+ "crypto": "^1.0.1",
20
+ "dotenv": "^16.4.5",
21
+ "express": "^4.21.0",
22
+ "express-ws": "^5.0.2",
23
+ "is-buffer": "^2.0.5",
24
+ "jsonwebtoken": "^9.0.2",
25
+ "module": "^1.2.5"
26
+ }
27
+ }
backend/providers/updateChathistory.js ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ const updateChathistory = (transcript, backup , session) => {
2
+ try {
3
+ if (backup) {
4
+ if (
5
+ session.chathistorybackup.length > 0 &&
6
+ session.chathistorybackup[session.chathistorybackup.length - 1]
7
+ .speaker === "USER"
8
+ ) {
9
+ session.chathistorybackup.push({
10
+ speaker: "AI",
11
+ content: ``,
12
+ });
13
+ }
14
+ if (
15
+ session.chathistory &&
16
+ session.chathistory.length > 0 &&
17
+ session.chathistory[session.chathistory.length - 1].speaker === "AI"
18
+ ) {
19
+ session.chathistorybackup[
20
+ session.chathistorybackup.length - 1
21
+ ].content += ` ${transcript}`;
22
+ }
23
+ } else if (!backup) {
24
+ if (
25
+ session.chathistory.length > 0 &&
26
+ session.chathistory[session.chathistory.length - 1].speaker === "USER"
27
+ ) {
28
+ session.chathistory.push({ speaker: "AI", content: `` });
29
+ }
30
+ if (
31
+ session.chathistory &&
32
+ session.chathistory.length > 0 &&
33
+ session.chathistory[session.chathistory.length - 1].speaker === "AI"
34
+ ) {
35
+ session.chathistory[
36
+ session.chathistory.length - 1
37
+ ].content += ` ${transcript}`;
38
+ }
39
+ }
40
+ } catch (error) {
41
+ console.log("Error in updating chathistory : ", error);
42
+ }
43
+ };
44
+
45
+
46
+ module.exports = {updateChathistory}
backend/utils/session.js ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ class Session {
2
+ constructor() {
3
+ this.call = null;
4
+ this.client = null;
5
+ this.channel = null;
6
+ this.saved = false;
7
+ this.starttime = Date.now();
8
+ this.cansend = false;
9
+ this.chathistory = [];
10
+ this.chathistorybackup = [];
11
+ this.latency = 0;
12
+ }
13
+ }
14
+
15
+ module.exports = Session;
chat_database.py ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pickle
2
+ import os
3
+
4
+ FILE_PATH = "chat_history.pkl"
5
+
6
+ if not os.path.exists(FILE_PATH):
7
+ with open(FILE_PATH, "wb") as file:
8
+ pickle.dump({}, file)
9
+
10
+ def save_chat_entry(session_id, role, transcript):
11
+ try:
12
+ with open(FILE_PATH, "rb") as file:
13
+ data = pickle.load(file)
14
+
15
+ if session_id not in data:
16
+ data[session_id] = []
17
+
18
+ if role == "user":
19
+ data[session_id].append({
20
+ "role": role,
21
+ "transcript": transcript
22
+ })
23
+ else:
24
+ if data[session_id] and data[session_id][-1]['role'] == "assistant":
25
+ data[session_id][-1]['transcript'] += " " + transcript
26
+ else:
27
+ data[session_id].append({
28
+ "role": role,
29
+ "transcript": transcript
30
+ })
31
+
32
+ with open(FILE_PATH, "wb") as file:
33
+ pickle.dump(data, file)
34
+
35
+ except Exception as e:
36
+ print(f"Error saving chat entry: {e}")
37
+
38
+
39
+ def get_chat_history(session_id):
40
+ try:
41
+ with open(FILE_PATH, "rb") as file:
42
+ data = pickle.load(file)
43
+
44
+ chat_history = data.get(session_id, [])
45
+
46
+ if not chat_history:
47
+ return []
48
+
49
+ message_history = []
50
+ for entry in chat_history:
51
+ role = entry.get('role', '')
52
+ transcript = entry.get('transcript', '')
53
+ if role and transcript:
54
+ message_history.append({"role": role, "content": transcript})
55
+
56
+ return message_history
57
+
58
+ except (FileNotFoundError, pickle.UnpicklingError) as e:
59
+ print(f"Error reading or parsing the file: {e}")
60
+ return []
61
+ except Exception as e:
62
+ print(f"Unexpected error: {e}")
63
+ return []
64
+
chat_history.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d229c43b598eacb4620b5bf033308ff27c1e9979af506afd1e58d7e6ba24c9da
3
+ size 12508
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "decoder": {
3
+ "type": "istftnet",
4
+ "upsample_kernel_sizes": [20, 12],
5
+ "upsample_rates": [10, 6],
6
+ "gen_istft_hop_size": 5,
7
+ "gen_istft_n_fft": 20,
8
+ "resblock_dilation_sizes": [
9
+ [1, 3, 5],
10
+ [1, 3, 5],
11
+ [1, 3, 5]
12
+ ],
13
+ "resblock_kernel_sizes": [3, 7, 11],
14
+ "upsample_initial_channel": 512
15
+ },
16
+ "dim_in": 64,
17
+ "dropout": 0.2,
18
+ "hidden_dim": 512,
19
+ "max_conv_dim": 512,
20
+ "max_dur": 50,
21
+ "multispeaker": true,
22
+ "n_layer": 3,
23
+ "n_mels": 80,
24
+ "n_token": 178,
25
+ "style_dim": 128
26
+ }
demo/HEARME.txt ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Kokoro is a frontier TTS model for its size of 82 million parameters.
2
+
3
+ On the 25th of December, 2024, Kokoro v0 point 19 weights were permissively released in full fp32 precision along with 2 voicepacks (Bella and Sarah), all under an Apache 2 license.
4
+
5
+ At the time of release, Kokoro v0 point 19 was the number 1 ranked model in TTS Spaces Arena. With 82 million parameters trained for under 20 epics on under 100 total hours of audio, Kokoro achieved higher Eelo in this single-voice Arena setting, over larger models. Kokoro's ability to top this Eelo ladder using relatively low compute and data, suggests that the scaling law for traditional TTS models might have a steeper slope than previously expected.
6
+
7
+ Licenses. Apache 2 weights in this repository. MIT inference code. GPLv3 dependency in espeak NG.
8
+
9
+ The inference code was originally MIT licensed by the paper author. Note that this card applies only to this model, Kokoro.
10
+
11
+ Evaluation. Metric: Eelo rating. Leaderboard: TTS Spaces Arena.
12
+
13
+ The voice ranked in the Arena is a 50 50 mix of Bella and Sarah. For your convenience, this mix is included in this repository as A-F dot PT, but you can trivially re-produce it.
14
+
15
+ Training Details.
16
+
17
+ Compute: Kokoro was trained on "A100 80GB v-ram instances" rented from Vast.ai. Vast was chosen over other compute providers due to its competitive on-demand hourly rates. The average hourly cost for the A100 80GB v-ram instances used for training was below $1 per hour per GPU, which was around half the quoted rates from other providers at the time.
18
+
19
+ Data: Kokoro was trained exclusively on permissive non-copyrighted audio data and IPA phoneme labels. Examples of permissive non-copyrighted audio include:
20
+
21
+ Public domain audio. Audio licensed under Apache, MIT, etc.
22
+
23
+ Synthetic audio[1] generated by closed[2] TTS models from large providers.
24
+
25
+ Epics: Less than 20 Epics. Total Dataset Size: Less than 100 hours of audio.
26
+
27
+ Limitations. Kokoro v0 point 19 is limited in some ways, in its training set and architecture:
28
+
29
+ Lacks voice cloning capability, likely due to small, under 100 hour training set.
30
+
31
+ Relies on external g2p, which introduces a class of g2p failure modes.
32
+
33
+ Training dataset is mostly long-form reading and narration, not conversation.
34
+
35
+ At 82 million parameters, Kokoro almost certainly falls to a well-trained 1B+ parameter diffusion transformer, or a many-billion-parameter M LLM like GPT 4o or Gemini 2 Flash.
36
+
37
+ Multilingual capability is architecturally feasible, but training data is almost entirely English.
38
+
39
+ Will the other voicepacks be released?
40
+
41
+ There is currently no release date scheduled for the other voicepacks, but in the meantime you can try them in the hosted demo.
42
+
43
+ Acknowledgements. yL4 5 7 9 for architecting StyleTTS 2.
44
+
45
+ Pendrokar for adding Kokoro as a contender in the TTS Spaces Arena.
46
+
47
+ Model Card Contact. @rzvzn on Discord.
demo/HEARME.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a66b2d1022df066f39773d650ff6c523d635407e88054c02066a821d50246800
3
+ size 3407872
demo/TTS-Spaces-Arena-25-Dec-2024.png ADDED

Git LFS Details

  • SHA256: e78b5ec1557323fa0e62681c83f6b81777f9834b91bbf26bf7567b036f011d52
  • Pointer size: 132 Bytes
  • Size of remote file: 1.07 MB
demo/af_sky.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Last September, I received an offer from Sam Altman, who wanted to hire me to voice the current ChatGPT 4 system. He told me that he felt that by my voicing the system, I could bridge the gap between tech companies and creatives and help consumers to feel comfortable with the seismic shift concerning humans and AI. He said he felt that my voice would be comforting to people.
2
+
3
+ After much consideration and for personal reasons, I declined the offer. Nine months later, my friends, family and the general public all noted how much the newest system named Sky sounded like me.
4
+
5
+ When I heard the released demo, I was shocked, angered and in disbelief that Mr. Altman would pursue a voice that sounded so eerily similar to mine that my closest friends and news ou'tlits could not tell the difference. Mr. Altman even insinuated that the similarity was intentional, tweeting a single word — hur — a reference to the film in which I voiced a chat system, Samantha, who forms an intimate relationship with a human.
6
+
7
+ Two days before the ChatGPT 4 demo was released, Mr. Altman contacted my agent, asking me to reconsider. Before we could connect, the system was out there.
8
+
9
+ As a result of their actions, I was forced to hire legal counsel, who wrote two letters to Mr. Altman and OpenAI, setting out what they had done and asking them to detail the exact process by which they created the Sky voice. Consequently, OpenAI reluctantly agreed to take down the Sky voice.
10
+
11
+ In a time when we are all grappling with deepfakes and the protection of our own likeness, our own work, our own identities, I believe these are questions that deserve absolute clarity. I look forward to resolution in the form of transparency and the passage of appropriate legislation to help ensure that individual rights are protected.
demo/af_sky.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:96491c978f85727b49b638beb94ad591a3830ce4a657db02d740acd61ec4322c
3
+ size 3407872
demo/restoring-sky.md ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Restoring Sky & reflecting on Kokoro
2
+
3
+ <img src="https://static0.gamerantimages.com/wordpress/wp-content/uploads/2024/08/terminator-zero-41-1.jpg" width="400" alt="kokoro" />
4
+
5
+ For those who don't know, [Kokoro](https://huggingface.co/hexgrad/Kokoro-82M) is an Apache TTS model that uses a skinny version of the open [StyleTTS 2](https://github.com/yl4579/StyleTTS2/tree/main) architecture.
6
+
7
+ Based on leaderboard [Elo rating](https://huggingface.co/hexgrad/Kokoro-82M#evaluation) (prior to getting [review bombed](https://huggingface.co/datasets/Pendrokar/TTS_Arena/discussions/2)), Kokoro appears to do more with less, a theme that is surely [top-of-mind](https://huggingface.co/deepseek-ai/DeepSeek-V3) for many. It's peak performance on specific voices is comparable or better than much larger models, but it has not yet been trained on enough data to effectively zero-shot out of distribution (aka voice cloning).
8
+
9
+ Tonight on NYE, `af_sky` joins Kokoro's roster of downloadable voices. This follows last night's quiet release of `af_nicole`, and an additional 8 voices are currently available: 2F 2M voices each for American & British English.
10
+
11
+ Nicole in particular was trained on ~10 hours of synthetic data, and demonstrates that you _can_ include unique speaking styles in a general-purpose TTS model without affecting the stock voices (even in a low data small model): a good sign for scalability.
12
+
13
+ Sky is interesting because it is the voice that ScarJo [got OpenAI to take down](https://x.com/OpenAI/status/1792443575839678909), so new training data cannot be generated. However, OpenAI did not remove 2023 samples of Sky from their [blog post](https://openai.com/index/chatgpt-can-now-see-hear-and-speak/), and along with a few seconds lying around various other parts of the internet, we can cobble together about 3 minutes of 2023 Sky.
14
+
15
+ ```sh
16
+ wget https://cdn.openai.com/new-voice-and-image-capabilities-in-chatgpt/hd/story-sky.mp3
17
+ wget https://cdn.openai.com/new-voice-and-image-capabilities-in-chatgpt/hd/recipe-sky.mp3
18
+ wget https://cdn.openai.com/new-voice-and-image-capabilities-in-chatgpt/hd/speech-sky.mp3
19
+ wget https://cdn.openai.com/new-voice-and-image-capabilities-in-chatgpt/hd/poem-sky.mp3
20
+ wget https://cdn.openai.com/new-voice-and-image-capabilities-in-chatgpt/hd/info-sky.mp3
21
+ ```
22
+
23
+ To be clear, this is not the first attempt to reconstruct Sky. On X, Benjamin De Kraker posted:
24
+ > Here's the official statement released by Scarlett Johansson, detailing OpenAI's alleged illegal usage of her voice...
25
+ > ...read by the Sky AI voice, because irony.
26
+ > https://x.com/BenjaminDEKR/status/1792693868497871086
27
+
28
+ and in the replies, he [stated](https://x.com/BenjaminDEKR/status/1792714347275501595):
29
+ > It's an ElevenLabs clone I made based on Sky audio before they removed it. Not perfect.
30
+
31
+ Here is `Kokoro/af_sky`'s rendition of the same:
32
+ <audio controls><source src="https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/demo/af_sky.wav" type="audio/wav"></audio>
33
+
34
+ A crude reconstruction, but the model that produced that voice is Apache FOSS that can be downloaded from HF and run locally. You can reproduce the above by dragging the [text script](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/demo/af_sky.txt) (note a handful of modified chars for better delivery) into the "Long Form" tab of this [hosted demo](https://huggingface.co/spaces/hexgrad/Kokoro-TTS), or you can download the [model weights](https://huggingface.co/hexgrad/Kokoro-82M), install dependencies and DIY.
35
+
36
+ Sky shows that it is possible to reconstruct a voice—maybe a shadow of its former self, but a reconstruction nonetheless—from fairly little training data.
37
+
38
+ ### What's next
39
+
40
+ Kokoro is a good start, but I can think of some tricks that might make it better, beginning with better data. More on this in another article.
41
+
42
+ Feel free to check out [Kokoro's weights](https://huggingface.co/hexgrad/Kokoro-82M), try out a no-install [hosted demo](https://huggingface.co/spaces/hexgrad/Kokoro-TTS), and/or [join the Discord](https://discord.gg/QuGxSWBfQy).