anuragsingh922 commited on Jan 28

Commit

d7dfeff

verified ·

1 Parent(s): 21a8ffb

Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.DS_Store +0 -0
.env +1 -0
.gitattributes +4 -0
.gitignore +0 -0
README.md +196 -0
__pycache__/chat_database.cpython-310.pyc +0 -0
__pycache__/chat_database.cpython-313.pyc +0 -0
__pycache__/grpc.cpython-310.pyc +0 -0
__pycache__/grpc.cpython-313.pyc +0 -0
__pycache__/grpc_code.cpython-310.pyc +0 -0
__pycache__/istftnet.cpython-310.pyc +0 -0
__pycache__/istftnet.cpython-312.pyc +0 -0
__pycache__/istftnet.cpython-313.pyc +0 -0
__pycache__/kokoro.cpython-310.pyc +0 -0
__pycache__/kokoro.cpython-312.pyc +0 -0
__pycache__/kokoro.cpython-313.pyc +0 -0
__pycache__/models.cpython-310.pyc +0 -0
__pycache__/models.cpython-312.pyc +0 -0
__pycache__/models.cpython-313.pyc +0 -0
__pycache__/plbert.cpython-310.pyc +0 -0
__pycache__/plbert.cpython-312.pyc +0 -0
__pycache__/plbert.cpython-313.pyc +0 -0
__pycache__/queue.cpython-310.pyc +0 -0
__pycache__/text_to_speech_pb2.cpython-310.pyc +0 -0
__pycache__/text_to_speech_pb2.cpython-313.pyc +0 -0
__pycache__/text_to_speech_pb2_grpc.cpython-310.pyc +0 -0
__pycache__/text_to_speech_pb2_grpc.cpython-313.pyc +0 -0
app.py +206 -0
backend/.DS_Store +0 -0
backend/.gitignore +2 -0
backend/app.js +22 -0
backend/config.env +1 -0
backend/config.js +7 -0
backend/handle-realtime-tts/cleangRPCconnections.js +47 -0
backend/handle-realtime-tts/makegRPCconnection.js +40 -0
backend/handle-realtime-tts/sttModelSocket.js +289 -0
backend/handle-realtime-tts/text_to_speech.proto +32 -0
backend/package-lock.json +0 -0
backend/package.json +27 -0
backend/providers/updateChathistory.js +46 -0
backend/utils/session.js +15 -0
chat_database.py +64 -0
chat_history.pkl +3 -0
config.json +26 -0
demo/HEARME.txt +47 -0
demo/HEARME.wav +3 -0
demo/TTS-Spaces-Arena-25-Dec-2024.png +3 -0
demo/af_sky.txt +11 -0
demo/af_sky.wav +3 -0
demo/restoring-sky.md +42 -0

.DS_Store ADDED Viewed

Binary file (8.2 kB). View file

.env ADDED Viewed

	@@ -0,0 +1 @@


1	+ OPENAI_API_KEY = <openai_api_key>

.gitattributes CHANGED Viewed

@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+TTS-Spaces-Arena-25-Dec-2024.png filter=lfs diff=lfs merge=lfs -text
+HEARME.wav filter=lfs diff=lfs merge=lfs -text
+demo/af_sky.wav filter=lfs diff=lfs merge=lfs -text
+output.wav filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

File without changes

README.md ADDED Viewed

	@@ -0,0 +1,196 @@

+# **Realtime TTS System**
+This repository contains the complete codebase for building your personal Realtime Text-to-Speech (TTS) solution. It integrates a powerful TTS model, gRPC communication, an Express server, and a React-based client. Follow this guide to set up and explore the system effectively.
+---
+## **Repository Structure**
+```
+├── backend/         # Express server for handling API requests
+├── frontend/        # React client for user interaction
+├── .env             # Environment variables (OpenAI API key, etc.)
+├── voices           # all available voices
+├── demo             # demo files of model
+├── other...
+```
+---
+## **Setup Guide**
+### **Step 1: Clone the Repository**
+Clone this repository to your local machine:
+```bash
+git clone https://huggingface.co/anuragsingh922/realtime-tts
+cd realtime-tts
+```
+---
+### **Step 2: Python Virtual Environment Setup**
+Create a virtual environment to manage dependencies:
+#### macOS/Linux:
+```bash
+python3 -m venv venv
+source venv/bin/activate
+```
+#### Windows:
+```bash
+python -m venv venv
+venv\Scripts\activate
+```
+---
+### **Step 3: Install Python Dependencies**
+With the virtual environment activated, install the required dependencies:
+```bash
+pip install --upgrade pip setuptools wheel
+pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
+pip install -r requirements.txt
+```
+### **Installing eSpeak**
+`eSpeak` is a necessary dependency for the TTS system. Follow the instructions below to install it on your platform:
+#### **Ubuntu/Linux**
+Use the `apt-get` package manager to install `eSpeak`:
+```bash
+sudo apt-get update
+sudo apt-get install espeak
+```
+#### **macOS**
+Install `eSpeak` using [Homebrew](https://brew.sh/):
+1. Ensure Homebrew is installed on your system:
+   ```bash
+   /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
+   ```
+2. Install `espeak`:
+   ```bash
+   brew install espeak
+   ```
+#### **Windows**
+For Windows, follow these steps to install `eSpeak`:
+1. Download the eSpeak installer from the official website: [eSpeak Downloads](http://espeak.sourceforge.net/download.html).
+2. Run the installer and follow the on-screen instructions to complete the installation.
+3. Add the `eSpeak` installation path to your system's `PATH` environment variable:
+   - Open **System Properties** → **Advanced** → **Environment Variables**.
+   - In the "System Variables" section, find the `Path` variable and edit it.
+   - Add the path to the `espeak.exe` file (e.g., `C:\Program Files (x86)\eSpeak`).
+4. Verify the installation:
+   Open Command Prompt and run:
+   ```cmd
+   espeak --version
+   ```
+---
+### **Verification**
+After installing `eSpeak`, verify it is correctly set up by running:
+```bash
+espeak "Hello, world!"
+```
+This should output "Hello, world!" as audio on your system.
+---
+### **Step 4: Backend Setup (Express Server)**
+1. Navigate to the `backend` directory:
+   ```bash
+   cd backend
+   ```
+2. Install Node.js dependencies:
+   ```bash
+   npm install
+   ```
+3. Update the `config.env` file with your Deepgram API key:
+   - Open `config.env` in a text editor.
+   - Replace `<deepgram_api_key>` with your actual Deepgram API key.
+4. Start the Express server:
+   ```bash
+   node app.js
+   ```
+---
+### **Step 5: Frontend Setup (React Client)**
+1. Open a new terminal and navigate to the `frontend` directory:
+   ```bash
+   cd frontend
+   ```
+2. Install client dependencies:
+   ```bash
+   npm install
+   ```
+3. Start the client:
+   ```bash
+   npm start
+   ```
+---
+### **Step 6: Start the TTS Server**
+1. Add your OpenAI API key to the `.env` file:
+   - Open `.env` in a text editor.
+   - Replace `<openai_api_key>` with your actual OpenAI API key.
+2. Start the TTS server:
+   ```bash
+   python3 app.py
+   ```
+---
+### **Step 7: Test the Full System**
+- Once all servers are running:
+  1. Access the React client at [http://localhost:3000](http://localhost:3000).
+  2. Interact with the TTS system via the web interface.
+---
+## **Model Used**
+This project utilizes the [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) TTS model hosted on Hugging Face. The model generates high-quality, realtime text-to-speech outputs.
+---
+## **Key Features**
+1. **Realtime TTS Generation**: Convert text input into speech with minimal latency.
+2. **React Client**: A user-friendly frontend for interaction.
+3. **Express Backend**: Handles API requests and integrates the TTS system with external services.
+4. **gRPC Communication**: Seamless communication between the TTS server and other components.
+5. **Configurable APIs**: Supports OpenAI and Deepgram API integrations.
+---
+## **Dependencies**
+### Python:
+- `torch`, `torchvision`, `torchaudio`
+- `phonemizer`
+- `transformers`
+- `scipy`
+- `munch`
+- `python-dotenv`
+- `openai`
+- `grpcio`, `grpcio-tools`
+- `espeak`
+### Node.js:
+- Express server dependencies (`npm install` in `backend`).
+- React client dependencies (`npm install` in `frontend`).
+---
+## **Contributing**
+Contributions are welcome! Feel free to fork this repository and create a pull request with your improvements.
+---
+## **Acknowledgments**
+- [Hugging Face](https://huggingface.co/) for hosting the Kokoro-82M model.
+- The amazing communities behind PyTorch, OpenAI, and Deepgram APIs.

__pycache__/chat_database.cpython-310.pyc ADDED Viewed

Binary file (1.58 kB). View file

__pycache__/chat_database.cpython-313.pyc ADDED Viewed

Binary file (2.87 kB). View file

__pycache__/grpc.cpython-310.pyc ADDED Viewed

Binary file (4.17 kB). View file

__pycache__/grpc.cpython-313.pyc ADDED Viewed

Binary file (7.31 kB). View file

__pycache__/grpc_code.cpython-310.pyc ADDED Viewed

Binary file (4.18 kB). View file

__pycache__/istftnet.cpython-310.pyc ADDED Viewed

Binary file (16.5 kB). View file

__pycache__/istftnet.cpython-312.pyc ADDED Viewed

Binary file (30.6 kB). View file

__pycache__/istftnet.cpython-313.pyc ADDED Viewed

Binary file (30.5 kB). View file

__pycache__/kokoro.cpython-310.pyc ADDED Viewed

Binary file (7.49 kB). View file

__pycache__/kokoro.cpython-312.pyc ADDED Viewed

Binary file (13.7 kB). View file

__pycache__/kokoro.cpython-313.pyc ADDED Viewed

Binary file (13.8 kB). View file

__pycache__/models.cpython-310.pyc ADDED Viewed

Binary file (12.7 kB). View file

__pycache__/models.cpython-312.pyc ADDED Viewed

Binary file (25.8 kB). View file

__pycache__/models.cpython-313.pyc ADDED Viewed

Binary file (25.9 kB). View file

__pycache__/plbert.cpython-310.pyc ADDED Viewed

Binary file (957 Bytes). View file

__pycache__/plbert.cpython-312.pyc ADDED Viewed

Binary file (1.15 kB). View file

__pycache__/plbert.cpython-313.pyc ADDED Viewed

Binary file (1.22 kB). View file

__pycache__/queue.cpython-310.pyc ADDED Viewed

Binary file (134 Bytes). View file

__pycache__/text_to_speech_pb2.cpython-310.pyc ADDED Viewed

Binary file (1.67 kB). View file

__pycache__/text_to_speech_pb2.cpython-313.pyc ADDED Viewed

Binary file (2.27 kB). View file

__pycache__/text_to_speech_pb2_grpc.cpython-310.pyc ADDED Viewed

Binary file (3.17 kB). View file

__pycache__/text_to_speech_pb2_grpc.cpython-313.pyc ADDED Viewed

Binary file (4.43 kB). View file

app.py ADDED Viewed

	@@ -0,0 +1,206 @@

+from concurrent import futures
+import torch
+from models import build_model
+import numpy as np
+import re
+import wave
+from kokoro import generate
+from openai import OpenAI
+from collections import deque
+import grpc
+import text_to_speech_pb2
+import text_to_speech_pb2_grpc
+import io
+from dotenv import load_dotenv
+import os
+from chat_database import save_chat_entry, get_chat_history
+load_dotenv()
+# Device configuration
+device = 'cuda' if torch.cuda.is_available() else 'cpu'
+# Load the Kokoro model
+MODEL = build_model('kokoro-v0_19.pth', device)
+# Specify the voice name and load the voice pack
+VOICE_NAME = [
+    'af',
+    'af_bella', 'af_sarah', 'am_adam', 'am_michael',
+    'bf_emma', 'bf_isabella', 'bm_george', 'bm_lewis',
+    'af_nicole', 'af_sky',
+][0]
+VOICEPACK = torch.load(f'voices/{VOICE_NAME}.pt', weights_only=True).to(device)
+client = OpenAI(
+    api_key= os.getenv("OPENAI_API_KEY")
+)
+def chunk_text(text, max_chars=2040):
+    sentences = re.split(r'(?<=[.!?])\s+', text)
+    chunks = []
+    current_chunk = []
+    current_length = 0
+    for sentence in sentences:
+        sentence_length = len(sentence)
+        if current_length + sentence_length <= max_chars:
+            current_chunk.append(sentence)
+            current_length += sentence_length
+        else:
+            if current_chunk:
+                chunks.append(' '.join(current_chunk))
+            current_chunk = [sentence]
+            current_length = sentence_length
+    if current_chunk:
+        chunks.append(' '.join(current_chunk))
+    return chunks
+def generate_audio_from_chunks(text, model, voicepack, voice_name):
+    chunks = chunk_text(text)
+    combined_audio = np.array([])
+    for chunk in chunks:
+        try:
+            audio, _ = generate(model, chunk, voicepack, lang=voice_name[0])
+            combined_audio = np.concatenate([combined_audio, audio]) if combined_audio.size > 0 else audio
+        except Exception:
+            pass
+    return combined_audio
+def save_audio_to_file(audio_data, file_number, sample_rate=24000):
+    filename = f"output-{file_number}.wav"
+    with wave.open(filename, 'wb') as wav_file:
+        wav_file.setnchannels(1)
+        wav_file.setsampwidth(2)
+        wav_file.setframerate(sample_rate)
+        audio_int16 = (audio_data * 32767).astype(np.int16)
+        wav_file.writeframes(audio_int16.tobytes())
+    return filename
+def getResponse(text , session_id):
+    try:
+        chat_history = get_chat_history(session_id)
+        response = client.chat.completions.create(
+            model='gpt-3.5-turbo',
+            messages=chat_history,
+            stream=True
+        )
+        return response
+    except Exception as e:
+        print("Error in getResponse : "  , e)
+def get_audio_bytes(audio_data, sample_rate=24000):
+    wav_bytes = io.BytesIO()
+    with wave.open(wav_bytes, 'wb') as wav_file:
+        wav_file.setnchannels(1)
+        wav_file.setsampwidth(2)
+        wav_file.setframerate(sample_rate)
+        audio_int16 = (audio_data * 32767).astype(np.int16)
+        wav_file.writeframes(audio_int16.tobytes())
+    wav_bytes.seek(0)
+    return wav_bytes.read()
+def dummy_bytes():
+    buffer = io.BytesIO()
+    dummy_data = b"This is a test of dummy byte data."
+    buffer.write(dummy_data)
+    buffer.seek(0)
+    byte_value = buffer.getvalue()
+    return byte_value
+class TextToSpeechServicer(text_to_speech_pb2_grpc.TextToSpeechServiceServicer):
+    def ProcessText(self, request_iterator, context):
+        try:
+            print("Received new request")
+            parameters = {
+                "processing_active": False,
+                "queue": deque(),
+                "file_number": 0,
+                "session_id": "",
+                "interrupt_seq" : 0
+            }
+            for request in request_iterator:
+                field = request.WhichOneof('request_data')
+                if field == 'metadata':
+                    parameters["session_id"] = request.metadata.session_id
+                    continue
+                elif field == 'text':
+                    text = request.text
+                    if not text:
+                        continue
+                    save_chat_entry(parameters["session_id"] , "user" , text)
+                    parameters["queue"].clear()
+                    yield text_to_speech_pb2.ProcessTextResponse(
+                        buffer = dummy_bytes(),
+                        session_id=parameters["session_id"],
+                        sequence_id = "-2",
+                        transcript=text,
+                    )
+                    final_response = ""
+                    response = getResponse(text , parameters["session_id"])
+                    for chunk in response:
+                        msg = chunk.choices[0].delta.content
+                        if msg:
+                            final_response += msg
+                            if final_response.endswith(('.', '!', '?')):
+                                parameters["file_number"] += 1
+                                parameters["queue"].append((final_response, parameters["file_number"]))
+                                final_response = ""
+                                if not parameters["processing_active"]:
+                                    yield from self.process_queue(parameters)
+                    if final_response:
+                        parameters["file_number"] += 1
+                        parameters["queue"].append((final_response, parameters["file_number"]))
+                        if not parameters["processing_active"]:
+                            yield from self.process_queue(parameters)
+                elif field == 'status':
+                    transcript = request.status.transcript
+                    played_seq = request.status.played_seq
+                    interrupt_seq = request.status.interrupt_seq
+                    parameters["interrupt_seq"] = interrupt_seq
+                    save_chat_entry(parameters["session_id"] , "assistant" , transcript)
+                    continue
+                else:
+                    continue
+        except Exception as e:
+            print("Error in ProcessText:", e)
+    def process_queue(self , parameters):
+        try:
+            while True:
+                if not parameters["queue"]:
+                    parameters["processing_active"] = False
+                    break
+                parameters["processing_active"] = True
+                sentence, file_number = parameters["queue"].popleft()
+                if file_number <= int(parameters["interrupt_seq"]):
+                    continue
+                combined_audio = generate_audio_from_chunks(sentence, MODEL, VOICEPACK, VOICE_NAME)
+                audio_bytes = get_audio_bytes(combined_audio)
+                # filename = save_audio_to_file(combined_audio, file_number)
+                yield text_to_speech_pb2.ProcessTextResponse(
+                    buffer=audio_bytes,
+                    session_id=parameters["session_id"],
+                    sequence_id=str(file_number),
+                    transcript=sentence,
+                )
+        except Exception as e:
+            parameters["processing_active"] = False
+            print("Error in process_queue:", e)
+def serve():
+    print("Starting gRPC server...")
+    server = grpc.server(futures.ThreadPoolExecutor(max_workers=1))
+    text_to_speech_pb2_grpc.add_TextToSpeechServiceServicer_to_server(TextToSpeechServicer(), server)
+    server.add_insecure_port('[::]:8081')
+    server.start()
+    print("gRPC server is running on port 8081")
+    server.wait_for_termination()
+if __name__ == "__main__":
+    serve()

backend/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

backend/.gitignore ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ /node_modules
2	+ .DS_Store

backend/app.js ADDED Viewed

	@@ -0,0 +1,22 @@

+const express = require("express");
+const cors = require("cors");
+const bodyParser = require("body-parser");
+const app = express();
+require("express-ws")(app);
+app.use(express.json());
+app.use(cors());
+app.use(express.urlencoded({ extended: true }));
+app.use(bodyParser.json());
+const port = 8080;
+const { audio_stream } = require("./handle-realtime-tts/sttModelSocket.js");
+app.get("/health", (req, res) => {
+  res.send("Green");
+});
+app.ws("/v2v", audio_stream);
+app.listen(port, () => {
+  console.log(`Example app listening at http://localhost:${port}`);
+});

backend/config.env ADDED Viewed

	@@ -0,0 +1 @@


1	+ DEEPGRAM_KEY = <deepgram_api_key>

backend/config.js ADDED Viewed

	@@ -0,0 +1,7 @@

+require("dotenv").config({ path: "./config.env" });
+const deepgram_key = process.env.DEEPGRAM_KEY;
+module.exports = {
+  deepgram_key
+};

backend/handle-realtime-tts/cleangRPCconnections.js ADDED Viewed

	@@ -0,0 +1,47 @@

+const cleanupConnection = async (session) => {
+  try {
+    if (session.channel) {
+      const state = session.channel.getConnectivityState(false);
+      console.log(`Client : ${state}`);
+      if (state !== 4) {
+        console.log("Closing call and client.");
+        session.client.close();
+        session.call.end();
+        session.client = null;
+        session.call = null;
+      }
+    } else {
+      try {
+        if (session.client) {
+          session.client.close();
+          if (session.call) {
+            session.call.end();
+          }
+          session.call = null;
+          session.client = null;
+          session.channel = null;
+        }
+      } catch (err) {
+        session.call = null;
+        session.client = null;
+        session.channel = null;
+      }
+    }
+    console.log("gRPC connection ended.");
+  } catch (err) {
+    if (session.call) {
+      session.call.end();
+    }
+    session.call = null;
+    console.log("Error ending gRPC connection: ", err);
+  } finally {
+    if (session.call) {
+      session.call.end();
+    }
+    session.call = null;
+    session.client = null;
+    session.channel = null;
+  }
+};
+module.exports = { cleanupConnection };

backend/handle-realtime-tts/makegRPCconnection.js ADDED Viewed

	@@ -0,0 +1,40 @@

+const grpc = require("@grpc/grpc-js");
+const protoLoader = require("@grpc/proto-loader");
+const path = require("path");
+const getgRPCConnection = (session) => {
+  return new Promise((resolve, reject) => {
+    protoLoader
+      .load(path.join(__dirname, "text_to_speech.proto"), {
+        keepCase: true,
+        longs: String,
+        enums: String,
+        defaults: true,
+        oneofs: true,
+      })
+      .then((packageDefinition) => {
+        const textToSpeechProto = grpc.loadPackageDefinition(packageDefinition).texttospeech;
+        const client = new textToSpeechProto.TextToSpeechService(
+          "localhost:8081",
+          grpc.credentials.createInsecure()
+        );
+        session.client = client;
+        const channel = session.client.getChannel();
+        session.channel = channel;
+        console.log("Made connection");
+        session.client = client;
+        const call = client.ProcessText();
+        resolve(call);
+      })
+      .catch((error) => {
+        session.client = null;
+        console.error("Error loading proto file:", error);
+        reject(new Error("Error in making gRPC Connection."));
+      });
+  });
+};
+module.exports = { getgRPCConnection };

backend/handle-realtime-tts/sttModelSocket.js ADDED Viewed

	@@ -0,0 +1,289 @@

+const isBuffer = require("is-buffer");
+const { Buffer } = require("buffer");
+const {deepgram_key} = require("../config");
+const Session = require("../utils/session.js");
+const { cleanupConnection } = require("./cleangRPCconnections.js");
+const { getgRPCConnection } = require("./makegRPCconnection.js");
+const { updateChathistory } = require("../providers/updateChathistory.js");
+const { createClient, LiveTranscriptionEvents } = require("@deepgram/sdk");
+const deepgram = createClient(deepgram_key);
+const audio_stream = async (wss, req) => {
+  try {
+    const session = new Session();
+    wss.send(JSON.stringify({ type: "initial", msg: "connected" }));
+    const connection = deepgram.listen.live({
+      punctuate: true,
+      interim_results: true,
+      speech_final: true,
+      encoding: "linear16",
+      sample_rate: 16000,
+      model: "nova-2",
+      speech_final: true,
+      version: "latest",
+    });
+    const callMLServer = async (text) => {
+      try {
+        session.call.write({ text: text });
+      } catch (error) {
+        console.error("Error in calling ml server : ", error);
+      }
+    }
+    connection.on(LiveTranscriptionEvents.Open, () => {
+      console.log(LiveTranscriptionEvents.Open);
+      connection.on(LiveTranscriptionEvents.Close, () => {
+        console.log("Connection closed.");
+      });
+      connection.on(LiveTranscriptionEvents.Transcript, (data) => {
+        const text = data?.channel?.alternatives[0]?.transcript;
+        // console.log("Response : ", text);
+        if (data.is_final && data.speech_final && text) {
+          console.log("Response : ", text);
+          callMLServer(text);
+        }
+      });
+      connection.on(LiveTranscriptionEvents.Metadata, (data) => {
+        console.log(data);
+      });
+      connection.on(LiveTranscriptionEvents.Error, (err) => {
+        console.error(err);
+      });
+    });
+    wss.on("message", async (message) => {
+      try {
+        if (isBuffer(message) && session.call) {
+          try {
+            const audioChunk = {
+              audio_data: message,
+            };
+            try {
+              if (connection && connection.getReadyState() == 1) {
+                connection.send(message);
+              }
+            } catch (error) {
+              console.log("Error sending buffer to deepgram : ", error);
+            }
+          } catch (err) {
+            console.error("Error writing to stream: ", err);
+          }
+        }
+        // Handle message not of typeof buffer
+        if (typeof message === "string") {
+          try {
+            const data = JSON.parse(message);
+            const { type, msg } = data;
+            switch (type) {
+              case "start":
+                session.starttime = Date.now();
+                session.chathistory = [];
+                session.chathistorybackup = [];
+                console.log("Making Connection with gRPC...");
+                try {
+                  console.time("grpcconnection");
+                  session.call = await getgRPCConnection(session);
+                  console.timeEnd("grpcconnection");
+                  const state = session.channel.getConnectivityState(false);
+                  console.log(`Client : ${state}`);
+                  session.saved = false;
+                  wss.send(JSON.stringify({ type: "ready", msg: "connected" }));
+                  console.log("Connected to gRPC.");
+                  const {
+                    sessionId,
+                  } = JSON.parse(msg);
+                  const metadata = {
+                    metadata: {
+                      session_id: sessionId,
+                    },
+                  };
+                  if (session.call) {
+                    console.log("Sending metadata.")
+                    session.call.write(metadata);
+                  }
+                } catch (err) {
+                  await cleanupConnection(session);
+                  console.error("Error in making gRPC Connection. : ", err);
+                }
+                session.call.on("data", (response) => {
+                  console.log("Data : ", response);
+                  const {session_id , sequence_id , transcript , buffer} = response;
+                  const metadata = JSON.stringify({
+                    session_id: session_id,
+                    sequence_id: sequence_id,
+                    transcript: transcript,
+                  });
+                  if (sequence_id === "-2") {
+                    session.latency = Date.now();
+                    wss.send(JSON.stringify({ type: "clear", msg: "clear" }));
+                    session.chathistory = [...session.chathistorybackup];
+                    wss.send(
+                      JSON.stringify({
+                        type: "chathistory",
+                        msg: session.chathistorybackup,
+                      })
+                    );
+                    const wavBuffer = Buffer.concat([
+                      Buffer.from(metadata),
+                      Buffer.from([0]),
+                      buffer,
+                    ]);
+                    const base64buffer = wavBuffer.toString("base64");
+                    wss.send(
+                      JSON.stringify({ type: "media", msg: base64buffer })
+                    );
+                    session.chathistory.push({
+                      speaker: "USER",
+                      content: transcript,
+                    });
+                    wss.send(
+                      JSON.stringify({
+                        type: "chathistory",
+                        msg: session.chathistory,
+                      })
+                    );
+                    session.chathistorybackup.push({
+                      speaker: "USER",
+                      content: transcript,
+                    });
+                    return;
+                  }
+                  if (sequence_id === "0") {
+                    wss.send(JSON.stringify({ type: "pause", msg: "pause" }));
+                    session.cansend = false;
+                    return;
+                  }
+                  if (sequence_id === "-1") {
+                    wss.send(
+                      JSON.stringify({ type: "continue", msg: "continue" })
+                    );
+                    return;
+                  }
+                  if (sequence_id === "1") {
+                    const latency = Date.now() - session.latency;
+                    console.log("First Response Latency: ", latency, "ms");
+                    session.latency = 0;
+                    // wss.send(JSON.stringify({ type: "clear", msg: "clear" }));
+                    session.cansend = true;
+                  }
+                  if (!buffer) {
+                    return;
+                  }
+                  if (!session.cansend && sequence_id !== "0") {
+                    return;
+                  }
+                  // Combine header and PCM data into a single Buffer
+                  const wavBuffer = Buffer.concat([
+                    Buffer.from(metadata),
+                    Buffer.from([0]),
+                    buffer,
+                  ]);
+                  const base64buffer = wavBuffer.toString("base64");
+                  wss.send(
+                    JSON.stringify({ type: "media", msg: base64buffer })
+                  );
+                  updateChathistory(transcript, false, session);
+                  wss.send(
+                    JSON.stringify({
+                      type: "chathistory",
+                      msg: session.chathistory,
+                    })
+                  );
+                });
+                session.call.on("end", async () => {
+                  console.log("Ended");
+                  await cleanupConnection(session);
+                  try {
+                    wss.send(JSON.stringify({ type: "end", msg: "end" }));
+                  } catch (err) { }
+                  console.log("Stream ended");
+                });
+                session.call.on("error", async (error) => {
+                  console.error(`Stream error: ${error}`);
+                  try {
+                    wss.send(JSON.stringify({ type: "end", msg: "end" }));
+                  } catch (err) { }
+                  await cleanupConnection(session);
+                });
+                break;
+              case "status":
+                const { session_id, sequence_id, transcript } = msg;
+                const status = {
+                  status: {
+                    transcript : transcript,
+                    played_seq: sequence_id,
+                    interrupt_seq: sequence_id,
+                  },
+                };
+                if (session.call) {
+                  session.call.write(status);
+                }
+                updateChathistory(transcript, true, session);
+                break;
+              case "stop":
+                console.log("Client Stoped the stream.");
+                await cleanupConnection(session);
+                break;
+              default:
+                console.log("Type not handled.");
+            }
+          } catch (err) {
+            console.log(`Not a valid json : ${err}`);
+          }
+        }
+      } catch (err) {
+        console.error(`Error in wss.onmessage : ${err}`);
+      }
+    });
+    wss.on("close", async () => {
+      await cleanupConnection(session);
+      console.log("WebSocket connection closed.");
+    });
+    wss.on("error", async (err) => {
+      console.error(`WebSocket error: ${err}`);
+      await cleanupConnection(session);
+    });
+  } catch (err) {
+    try {
+      console.log(err)
+      wss.send(JSON.stringify({ type: "end", msg: "end" }));
+    } catch (err) { }
+  }
+};
+module.exports = { audio_stream };

backend/handle-realtime-tts/text_to_speech.proto ADDED Viewed

	@@ -0,0 +1,32 @@

+syntax = "proto3";
+package texttospeech;
+service TextToSpeechService {
+  rpc ProcessText (stream ProcessTextRequest) returns (stream ProcessTextResponse);
+}
+message ProcessTextRequest {
+  oneof request_data {
+    string text = 1;
+    Meta metadata = 2;
+    Status status = 3;
+  }
+}
+message ProcessTextResponse {
+  bytes buffer = 1;
+  string session_id = 2;
+  string sequence_id = 3;
+  string transcript = 4;
+}
+message Meta {
+  string session_id = 1;
+}
+message Status {
+  string transcript = 1;
+  string played_seq = 2;
+  string interrupt_seq = 3;
+}

backend/package-lock.json ADDED Viewed

The diff for this file is too large to render. See raw diff

backend/package.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+  "name": "backend",
+  "version": "1.0.0",
+  "description": "",
+  "main": "index.js",
+  "scripts": {
+    "test": "echo \"Error: no test specified\" && exit 1"
+  },
+  "keywords": [],
+  "author": "",
+  "license": "ISC",
+  "dependencies": {
+    "@deepgram/sdk": "^3.9.0",
+    "@geckos.io/server": "^3.0.0",
+    "@grpc/grpc-js": "^1.11.3",
+    "axios": "^1.7.9",
+    "bcryptjs": "^2.4.3",
+    "cors": "^2.8.5",
+    "crypto": "^1.0.1",
+    "dotenv": "^16.4.5",
+    "express": "^4.21.0",
+    "express-ws": "^5.0.2",
+    "is-buffer": "^2.0.5",
+    "jsonwebtoken": "^9.0.2",
+    "module": "^1.2.5"
+  }
+}

backend/providers/updateChathistory.js ADDED Viewed

	@@ -0,0 +1,46 @@

+const updateChathistory = (transcript, backup , session) => {
+  try {
+    if (backup) {
+      if (
+        session.chathistorybackup.length > 0 &&
+        session.chathistorybackup[session.chathistorybackup.length - 1]
+          .speaker === "USER"
+      ) {
+        session.chathistorybackup.push({
+          speaker: "AI",
+          content: ``,
+        });
+      }
+      if (
+        session.chathistory &&
+        session.chathistory.length > 0 &&
+        session.chathistory[session.chathistory.length - 1].speaker === "AI"
+      ) {
+        session.chathistorybackup[
+          session.chathistorybackup.length - 1
+        ].content += ` ${transcript}`;
+      }
+    } else if (!backup) {
+      if (
+        session.chathistory.length > 0 &&
+        session.chathistory[session.chathistory.length - 1].speaker === "USER"
+      ) {
+        session.chathistory.push({ speaker: "AI", content: `` });
+      }
+      if (
+        session.chathistory &&
+        session.chathistory.length > 0 &&
+        session.chathistory[session.chathistory.length - 1].speaker === "AI"
+      ) {
+        session.chathistory[
+          session.chathistory.length - 1
+        ].content += ` ${transcript}`;
+      }
+    }
+  } catch (error) {
+    console.log("Error in updating chathistory : ", error);
+  }
+};
+module.exports = {updateChathistory}

backend/utils/session.js ADDED Viewed

	@@ -0,0 +1,15 @@

+class Session {
+  constructor() {
+    this.call = null;
+    this.client = null;
+    this.channel = null;
+    this.saved = false;
+    this.starttime = Date.now();
+    this.cansend = false;
+    this.chathistory = [];
+    this.chathistorybackup = [];
+    this.latency = 0;
+  }
+}
+module.exports = Session;

chat_database.py ADDED Viewed

	@@ -0,0 +1,64 @@

+import pickle
+import os
+FILE_PATH = "chat_history.pkl"
+if not os.path.exists(FILE_PATH):
+    with open(FILE_PATH, "wb") as file:
+        pickle.dump({}, file)
+def save_chat_entry(session_id, role, transcript):
+    try:
+        with open(FILE_PATH, "rb") as file:
+            data = pickle.load(file)
+        if session_id not in data:
+            data[session_id] = []
+        if role == "user":
+            data[session_id].append({
+                "role": role,
+                "transcript": transcript
+            })
+        else:
+            if data[session_id] and data[session_id][-1]['role'] == "assistant":
+                data[session_id][-1]['transcript'] += " " + transcript
+            else:
+                data[session_id].append({
+                    "role": role,
+                    "transcript": transcript
+                })
+        with open(FILE_PATH, "wb") as file:
+            pickle.dump(data, file)
+    except Exception as e:
+        print(f"Error saving chat entry: {e}")
+def get_chat_history(session_id):
+    try:
+        with open(FILE_PATH, "rb") as file:
+            data = pickle.load(file)
+        chat_history = data.get(session_id, [])
+        if not chat_history:
+            return []
+        message_history = []
+        for entry in chat_history:
+            role = entry.get('role', '')
+            transcript = entry.get('transcript', '')
+            if role and transcript:
+                message_history.append({"role": role, "content": transcript})
+        return message_history
+    except (FileNotFoundError, pickle.UnpicklingError) as e:
+        print(f"Error reading or parsing the file: {e}")
+        return []
+    except Exception as e:
+        print(f"Unexpected error: {e}")
+        return []

chat_history.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d229c43b598eacb4620b5bf033308ff27c1e9979af506afd1e58d7e6ba24c9da
+size 12508

config.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+  "decoder": {
+    "type": "istftnet",
+    "upsample_kernel_sizes": [20, 12],
+    "upsample_rates": [10, 6],
+    "gen_istft_hop_size": 5,
+    "gen_istft_n_fft": 20,
+    "resblock_dilation_sizes": [
+      [1, 3, 5],
+      [1, 3, 5],
+      [1, 3, 5]
+    ],
+    "resblock_kernel_sizes": [3, 7, 11],
+    "upsample_initial_channel": 512
+  },
+  "dim_in": 64,
+  "dropout": 0.2,
+  "hidden_dim": 512,
+  "max_conv_dim": 512,
+  "max_dur": 50,
+  "multispeaker": true,
+  "n_layer": 3,
+  "n_mels": 80,
+  "n_token": 178,
+  "style_dim": 128
+}

demo/HEARME.txt ADDED Viewed

	@@ -0,0 +1,47 @@

+Kokoro is a frontier TTS model for its size of 82 million parameters.
+On the 25th of December, 2024, Kokoro v0 point 19 weights were permissively released in full fp32 precision along with 2 voicepacks (Bella and Sarah), all under an Apache 2 license.
+At the time of release, Kokoro v0 point 19 was the number 1 ranked model in TTS Spaces Arena. With 82 million parameters trained for under 20 epics on under 100 total hours of audio, Kokoro achieved higher Eelo in this single-voice Arena setting, over larger models. Kokoro's ability to top this Eelo ladder using relatively low compute and data, suggests that the scaling law for traditional TTS models might have a steeper slope than previously expected.
+Licenses. Apache 2 weights in this repository. MIT inference code. GPLv3 dependency in espeak NG.
+The inference code was originally MIT licensed by the paper author. Note that this card applies only to this model, Kokoro.
+Evaluation. Metric: Eelo rating. Leaderboard: TTS Spaces Arena.
+The voice ranked in the Arena is a 50 50 mix of Bella and Sarah. For your convenience, this mix is included in this repository as A-F dot PT, but you can trivially re-produce it.
+Training Details.
+Compute: Kokoro was trained on "A100 80GB v-ram instances" rented from Vast.ai. Vast was chosen over other compute providers due to its competitive on-demand hourly rates. The average hourly cost for the A100 80GB v-ram instances used for training was below $1 per hour per GPU, which was around half the quoted rates from other providers at the time.
+Data: Kokoro was trained exclusively on permissive non-copyrighted audio data and IPA phoneme labels. Examples of permissive non-copyrighted audio include:
+Public domain audio. Audio licensed under Apache, MIT, etc.
+Synthetic audio[1] generated by closed[2] TTS models from large providers.
+Epics: Less than 20 Epics. Total Dataset Size: Less than 100 hours of audio.
+Limitations. Kokoro v0 point 19 is limited in some ways, in its training set and architecture:
+Lacks voice cloning capability, likely due to small, under 100 hour training set.
+Relies on external g2p, which introduces a class of g2p failure modes.
+Training dataset is mostly long-form reading and narration, not conversation.
+At 82 million parameters, Kokoro almost certainly falls to a well-trained 1B+ parameter diffusion transformer, or a many-billion-parameter M LLM like GPT 4o or Gemini 2 Flash.
+Multilingual capability is architecturally feasible, but training data is almost entirely English.
+Will the other voicepacks be released?
+There is currently no release date scheduled for the other voicepacks, but in the meantime you can try them in the hosted demo.
+Acknowledgements. yL4 5 7 9 for architecting StyleTTS 2.
+Pendrokar for adding Kokoro as a contender in the TTS Spaces Arena.
+Model Card Contact. @rzvzn on Discord.

demo/HEARME.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a66b2d1022df066f39773d650ff6c523d635407e88054c02066a821d50246800
+size 3407872

demo/TTS-Spaces-Arena-25-Dec-2024.png ADDED Viewed

Git LFS Details

SHA256: e78b5ec1557323fa0e62681c83f6b81777f9834b91bbf26bf7567b036f011d52
Pointer size: 132 Bytes
Size of remote file: 1.07 MB

demo/af_sky.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+Last September, I received an offer from Sam Altman, who wanted to hire me to voice the current ChatGPT 4 system. He told me that he felt that by my voicing the system, I could bridge the gap between tech companies and creatives and help consumers to feel comfortable with the seismic shift concerning humans and AI. He said he felt that my voice would be comforting to people.
+After much consideration and for personal reasons, I declined the offer. Nine months later, my friends, family and the general public all noted how much the newest system named Sky sounded like me.
+When I heard the released demo, I was shocked, angered and in disbelief that Mr. Altman would pursue a voice that sounded so eerily similar to mine that my closest friends and news ou'tlits could not tell the difference. Mr. Altman even insinuated that the similarity was intentional, tweeting a single word — hur — a reference to the film in which I voiced a chat system, Samantha, who forms an intimate relationship with a human.
+Two days before the ChatGPT 4 demo was released, Mr. Altman contacted my agent, asking me to reconsider. Before we could connect, the system was out there.
+As a result of their actions, I was forced to hire legal counsel, who wrote two letters to Mr. Altman and OpenAI, setting out what they had done and asking them to detail the exact process by which they created the Sky voice. Consequently, OpenAI reluctantly agreed to take down the Sky voice.
+In a time when we are all grappling with deepfakes and the protection of our own likeness, our own work, our own identities, I believe these are questions that deserve absolute clarity. I look forward to resolution in the form of transparency and the passage of appropriate legislation to help ensure that individual rights are protected.

demo/af_sky.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:96491c978f85727b49b638beb94ad591a3830ce4a657db02d740acd61ec4322c
+size 3407872

demo/restoring-sky.md ADDED Viewed

	@@ -0,0 +1,42 @@

+# Restoring Sky & reflecting on Kokoro
+<img src="https://static0.gamerantimages.com/wordpress/wp-content/uploads/2024/08/terminator-zero-41-1.jpg" width="400" alt="kokoro" />
+For those who don't know, [Kokoro](https://huggingface.co/hexgrad/Kokoro-82M) is an Apache TTS model that uses a skinny version of the open [StyleTTS 2](https://github.com/yl4579/StyleTTS2/tree/main) architecture.
+Based on leaderboard [Elo rating](https://huggingface.co/hexgrad/Kokoro-82M#evaluation) (prior to getting [review bombed](https://huggingface.co/datasets/Pendrokar/TTS_Arena/discussions/2)), Kokoro appears to do more with less, a theme that is surely [top-of-mind](https://huggingface.co/deepseek-ai/DeepSeek-V3) for many. It's peak performance on specific voices is comparable or better than much larger models, but it has not yet been trained on enough data to effectively zero-shot out of distribution (aka voice cloning).
+Tonight on NYE, `af_sky` joins Kokoro's roster of downloadable voices. This follows last night's quiet release of `af_nicole`, and an additional 8 voices are currently available: 2F 2M voices each for American & British English.
+Nicole in particular was trained on ~10 hours of synthetic data, and demonstrates that you _can_ include unique speaking styles in a general-purpose TTS model without affecting the stock voices (even in a low data small model): a good sign for scalability.
+Sky is interesting because it is the voice that ScarJo [got OpenAI to take down](https://x.com/OpenAI/status/1792443575839678909), so new training data cannot be generated. However, OpenAI did not remove 2023 samples of Sky from their [blog post](https://openai.com/index/chatgpt-can-now-see-hear-and-speak/), and along with a few seconds lying around various other parts of the internet, we can cobble together about 3 minutes of 2023 Sky.
+```sh
+wget https://cdn.openai.com/new-voice-and-image-capabilities-in-chatgpt/hd/story-sky.mp3
+wget https://cdn.openai.com/new-voice-and-image-capabilities-in-chatgpt/hd/recipe-sky.mp3
+wget https://cdn.openai.com/new-voice-and-image-capabilities-in-chatgpt/hd/speech-sky.mp3
+wget https://cdn.openai.com/new-voice-and-image-capabilities-in-chatgpt/hd/poem-sky.mp3
+wget https://cdn.openai.com/new-voice-and-image-capabilities-in-chatgpt/hd/info-sky.mp3
+```
+To be clear, this is not the first attempt to reconstruct Sky. On X, Benjamin De Kraker posted:
+> Here's the official statement released by Scarlett Johansson, detailing OpenAI's alleged illegal usage of her voice...
+> ...read by the Sky AI voice, because irony.
+> https://x.com/BenjaminDEKR/status/1792693868497871086
+and in the replies, he [stated](https://x.com/BenjaminDEKR/status/1792714347275501595):
+> It's an ElevenLabs clone I made based on Sky audio before they removed it. Not perfect.
+Here is `Kokoro/af_sky`'s rendition of the same:
+<audio controls><source src="https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/demo/af_sky.wav" type="audio/wav"></audio>
+A crude reconstruction, but the model that produced that voice is Apache FOSS that can be downloaded from HF and run locally. You can reproduce the above by dragging the [text script](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/demo/af_sky.txt) (note a handful of modified chars for better delivery) into the "Long Form" tab of this [hosted demo](https://huggingface.co/spaces/hexgrad/Kokoro-TTS), or you can download the [model weights](https://huggingface.co/hexgrad/Kokoro-82M), install dependencies and DIY.
+Sky shows that it is possible to reconstruct a voice—maybe a shadow of its former self, but a reconstruction nonetheless—from fairly little training data.
+### What's next
+Kokoro is a good start, but I can think of some tricks that might make it better, beginning with better data. More on this in another article.
+Feel free to check out [Kokoro's weights](https://huggingface.co/hexgrad/Kokoro-82M), try out a no-install [hosted demo](https://huggingface.co/spaces/hexgrad/Kokoro-TTS), and/or [join the Discord](https://discord.gg/QuGxSWBfQy).