| # OpenedAI Speech | |
| An OpenAI API compatible text to speech server. | |
| * Compatible with the OpenAI audio/speech API | |
| * Serves the [/v1/audio/speech endpoint](https://platform.openai.com/docs/api-reference/audio/createSpeech) | |
| * Not affiliated with OpenAI in any way, does not require an OpenAI API Key | |
| * A free, private, text-to-speech server with custom voice cloning | |
| Full Compatibility: | |
| * `tts-1`: `alloy`, `echo`, `fable`, `onyx`, `nova`, and `shimmer` (configurable) | |
| * `tts-1-hd`: `alloy`, `echo`, `fable`, `onyx`, `nova`, and `shimmer` (configurable, uses OpenAI samples by default) | |
| * response_format: `mp3`, `opus`, `aac`, `flac`, `wav` and `pcm` | |
| * speed 0.25-4.0 (and more) | |
| Details: | |
| * Model `tts-1` via [piper tts](https://github.com/rhasspy/piper) (very fast, runs on cpu) | |
| * You can map your own [piper voices](https://rhasspy.github.io/piper-samples/) via the `voice_to_speaker.yaml` configuration file | |
| * Model `tts-1-hd` via [coqui-ai/TTS](https://github.com/coqui-ai/TTS) xtts_v2 voice cloning (fast, but requires around 4GB GPU VRAM) | |
| * Custom cloned voices can be used for tts-1-hd, See: [Custom Voices Howto](#custom-voices-howto) | |
| * π [Multilingual](#multilingual) support with XTTS voices, the language is automatically detected if not set | |
| * [Custom fine-tuned XTTS model support](#custom-fine-tuned-model-support) | |
| * Configurable [generation parameters](#generation-parameters) | |
| * Streamed output while generating | |
| * Occasionally, certain words or symbols may sound incorrect, you can fix them with regex via `pre_process_map.yaml` | |
| * Tested with python 3.9-3.11, piper does not install on python 3.12 yet | |
| If you find a better voice match for `tts-1` or `tts-1-hd`, please let me know so I can update the defaults. | |
| ## Recent Changes | |
| Version 0.17.2, 2024-07-01 | |
| * fix -min image (re: langdetect) | |
| Version 0.17.1, 2024-07-01 | |
| * fix ROCm (add langdetect to requirements-rocm.txt) | |
| * Fix zh-cn for xtts | |
| Version 0.17.0, 2024-07-01 | |
| * Automatic language detection, thanks [@RodolfoCastanheira](https://github.com/RodolfoCastanheira) | |
| Version 0.16.0, 2024-06-29 | |
| * Multi-client safe version. Audio generation is synchronized in a single process. The estimated 'realtime' factor of XTTS on a GPU is roughly 1/3, this means that multiple streams simultaneously, or `speed` over 2, may experience audio underrun (delays or pauses in playback). This makes multiple clients possible and safe, but in practice 2 or 3 simultaneous streams is the maximum without audio underrun. | |
| Version 0.15.1, 2024-06-27 | |
| * Remove deepspeed from requirements.txt, it's too complex for typical users. A more detailed deepspeed install document will be required. | |
| Version 0.15.0, 2024-06-26 | |
| * Switch to [coqui-tts](https://github.com/idiap/coqui-ai-TTS) (updated fork), updated simpler dependencies, torch 2.3, etc. | |
| * Resolve cuda threading issues | |
| Version 0.14.1, 2024-06-26 | |
| * Make deepspeed possible (`--use-deepspeed`), but not enabled in pre-built docker images (too large). Requires the cuda-toolkit installed, see the Dockerfile comment for details | |
| Version 0.14.0, 2024-06-26 | |
| * Added `response_format`: `wav` and `pcm` support | |
| * Output streaming (while generating) for `tts-1` and `tts-1-hd` | |
| * Enhanced [generation parameters](#generation-parameters) for xtts models (temperature, top_p, etc.) | |
| * Idle unload timer (optional) - doesn't work perfectly yet | |
| * Improved error handling | |
| Version 0.13.0, 2024-06-25 | |
| * Added [Custom fine-tuned XTTS model support](#custom-fine-tuned-model-support) | |
| * Initial prebuilt arm64 image support (Apple M-series, Raspberry Pi - MPS is not supported in XTTS/torch), thanks [@JakeStevenson](https://github.com/JakeStevenson), [@hchasens](https://github.com/hchasens) | |
| * Initial attempt at AMD GPU (ROCm 5.7) support | |
| * Parler-tts support removed | |
| * Move the *.default.yaml to the root folder | |
| * Run the docker as a service by default (`restart: unless-stopped`) | |
| * Added `audio_reader.py` for streaming text input and reading long texts | |
| Version 0.12.3, 2024-06-17 | |
| * Additional logging details for BadRequests (400) | |
| Version 0.12.2, 2024-06-16 | |
| * Fix :min image requirements (numpy<2?) | |
| Version 0.12.0, 2024-06-16 | |
| * Improved error handling and logging | |
| * Restore the original alloy tts-1-hd voice by default, use alloy-alt for the old voice. | |
| Version 0.11.0, 2024-05-29 | |
| * π [Multilingual](#multilingual) support (16 languages) with XTTS | |
| * Remove high Unicode filtering from the default `config/pre_process_map.yaml` | |
| * Update Docker build & app startup. thanks @justinh-rahb | |
| * Fix: "Plan failed with a cudnnException" | |
| * Remove piper cuda support | |
| Version: 0.10.1, 2024-05-05 | |
| * Remove `runtime: nvidia` from docker-compose.yml, this assumes nvidia/cuda compatible runtime is available by default. thanks [@jmtatsch](https://github.com/jmtatsch) | |
| Version: 0.10.0, 2024-04-27 | |
| * Pre-built & tested docker images, smaller docker images (8GB or 860MB) | |
| * Better upgrades: reorganize config files under `config/`, voice models under `voices/` | |
| * **Compatibility!** If you customized your `voice_to_speaker.yaml` or `pre_process_map.yaml` you need to move them to the `config/` folder. | |
| * default listen host to 0.0.0.0 | |
| Version: 0.9.0, 2024-04-23 | |
| * Fix bug with yaml and loading UTF-8 | |
| * New sample text-to-speech application `say.py` | |
| * Smaller docker base image | |
| * Add beta [parler-tts](https://huggingface.co/parler-tts/parler_tts_mini_v0.1) support (you can describe very basic features of the speaker voice), See: (https://www.text-description-to-speech.com/) for some examples of how to describe voices. Voices can be defined in the `voice_to_speaker.default.yaml`. Two example [parler-tts](https://huggingface.co/parler-tts/parler_tts_mini_v0.1) voices are included in the `voice_to_speaker.default.yaml` file. `parler-tts` is experimental software and is kind of slow. The exact voice will be slightly different each generation but should be similar to the basic description. | |
| ... | |
| Version: 0.7.3, 2024-03-20 | |
| * Allow different xtts versions per voice in `voice_to_speaker.yaml`, ex. xtts_v2.0.2 | |
| * Quality: Fix xtts sample rate (24000 vs. 22050 for piper) and pops | |
| ## Installation instructions | |
| ### Create a `speech.env` environment file | |
| Copy the `sample.env` to `speech.env` (customize if needed) | |
| ```bash | |
| cp sample.env speech.env | |
| ``` | |
| #### Defaults | |
| ```bash | |
| TTS_HOME=voices | |
| HF_HOME=voices | |
| #PRELOAD_MODEL=xtts | |
| #PRELOAD_MODEL=xtts_v2.0.2 | |
| #EXTRA_ARGS=--log-level DEBUG --unload-timer 300 | |
| #USE_ROCM=1 | |
| ``` | |
| ### Option A: Manual installation | |
| ```shell | |
| # install curl and ffmpeg | |
| sudo apt install curl ffmpeg | |
| # Create & activate a new virtual environment (optional but recommended) | |
| python -m venv .venv | |
| source .venv/bin/activate | |
| # Install the Python requirements | |
| # - use requirements-rocm.txt for AMD GPU (ROCm support) | |
| # - use requirements-min.txt for piper only (CPU only) | |
| pip install -U -r requirements.txt | |
| # run the server | |
| bash startup.sh | |
| ``` | |
| > On first run, the voice models will be downloaded automatically. This might take a while depending on your network connection. | |
| ### Option B: Docker Image (*recommended*) | |
| #### Nvidia GPU (cuda) | |
| ```shell | |
| docker compose up | |
| ``` | |
| #### AMD GPU (ROCm support) | |
| ```shell | |
| docker compose -f docker-compose.rocm.yml up | |
| ``` | |
| #### ARM64 (Apple M-series, Raspberry Pi) | |
| > XTTS only has CPU support here and will be very slow, you can use the Nvidia image for XTTS with CPU (slow), or use the piper only image (recommended) | |
| #### CPU only, No GPU (piper only) | |
| > For a minimal docker image with only piper support (<1GB vs. 8GB). | |
| ```shell | |
| docker compose -f docker-compose.min.yml up | |
| ``` | |
| ## Server Options | |
| ```shell | |
| usage: speech.py [-h] [--xtts_device XTTS_DEVICE] [--preload PRELOAD] [--unload-timer UNLOAD_TIMER] [--use-deepspeed] [--no-cache-speaker] [-P PORT] [-H HOST] | |
| [-L {DEBUG,INFO,WARNING,ERROR,CRITICAL}] | |
| OpenedAI Speech API Server | |
| options: | |
| -h, --help show this help message and exit | |
| --xtts_device XTTS_DEVICE | |
| Set the device for the xtts model. The special value of 'none' will use piper for all models. (default: cuda) | |
| --preload PRELOAD Preload a model (Ex. 'xtts' or 'xtts_v2.0.2'). By default it's loaded on first use. (default: None) | |
| --unload-timer UNLOAD_TIMER | |
| Idle unload timer for the XTTS model in seconds, Ex. 900 for 15 minutes (default: None) | |
| --use-deepspeed Use deepspeed with xtts (this option is unsupported) (default: False) | |
| --no-cache-speaker Don't use the speaker wav embeddings cache (default: False) | |
| -P PORT, --port PORT Server tcp port (default: 8000) | |
| -H HOST, --host HOST Host to listen on, Ex. 0.0.0.0 (default: 0.0.0.0) | |
| -L {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL} | |
| Set the log level (default: INFO) | |
| ``` | |
| ## Sample Usage | |
| You can use it like this: | |
| ```shell | |
| curl http://localhost:8000/v1/audio/speech -H "Content-Type: application/json" -d '{ | |
| "model": "tts-1", | |
| "input": "The quick brown fox jumped over the lazy dog.", | |
| "voice": "alloy", | |
| "response_format": "mp3", | |
| "speed": 1.0 | |
| }' > speech.mp3 | |
| ``` | |
| Or just like this: | |
| ```shell | |
| curl -s http://localhost:8000/v1/audio/speech -H "Content-Type: application/json" -d '{ | |
| "input": "The quick brown fox jumped over the lazy dog."}' > speech.mp3 | |
| ``` | |
| Or like this example from the [OpenAI Text to speech guide](https://platform.openai.com/docs/guides/text-to-speech): | |
| ```python | |
| import openai | |
| client = openai.OpenAI( | |
| # This part is not needed if you set these environment variables before import openai | |
| # export OPENAI_API_KEY=sk-11111111111 | |
| # export OPENAI_BASE_URL=http://localhost:8000/v1 | |
| api_key = "sk-111111111", | |
| base_url = "http://localhost:8000/v1", | |
| ) | |
| with client.audio.speech.with_streaming_response.create( | |
| model="tts-1", | |
| voice="alloy", | |
| input="Today is a wonderful day to build something people love!" | |
| ) as response: | |
| response.stream_to_file("speech.mp3") | |
| ``` | |
| Also see the `say.py` sample application for an example of how to use the openai-python API. | |
| ```shell | |
| # play the audio, requires 'pip install playsound' | |
| python say.py -t "The quick brown fox jumped over the lazy dog." -p | |
| # save to a file in flac format | |
| python say.py -t "The quick brown fox jumped over the lazy dog." -m tts-1-hd -v onyx -f flac -o fox.flac | |
| ``` | |
| You can also try the included `audio_reader.py` for listening to longer text and streamed input. | |
| Example usage: | |
| ```bash | |
| python audio_reader.py -s 2 < LICENSE # read the software license - fast | |
| ``` | |
| ## OpenAI API Documentation and Guide | |
| * [OpenAI Text to speech guide](https://platform.openai.com/docs/guides/text-to-speech) | |
| * [OpenAI API Reference](https://platform.openai.com/docs/api-reference/audio/createSpeech) | |
| ## Custom Voices Howto | |
| ### Piper | |
| 1. Select the piper voice and model from the [piper samples](https://rhasspy.github.io/piper-samples/) | |
| 2. Update the `config/voice_to_speaker.yaml` with a new section for the voice, for example: | |
| ```yaml | |
| ... | |
| tts-1: | |
| ryan: | |
| model: voices/en_US-ryan-high.onnx | |
| speaker: # default speaker | |
| ``` | |
| 3. New models will be downloaded as needed, of you can download them in advance with `download_voices_tts-1.sh`. For example: | |
| ```shell | |
| bash download_voices_tts-1.sh en_US-ryan-high | |
| ``` | |
| ### Coqui XTTS v2 | |
| Coqui XTTS v2 voice cloning can work with as little as 6 seconds of clear audio. To create a custom voice clone, you must prepare a WAV file sample of the voice. | |
| #### Guidelines for preparing good sample files for Coqui XTTS v2 | |
| * Mono (single channel) 22050 Hz WAV file | |
| * 6-30 seconds long - longer isn't always better (I've had some good results with as little as 4 seconds) | |
| * low noise (no hiss or hum) | |
| * No partial words, breathing, laughing, music or backgrounds sounds | |
| * An even speaking pace with a variety of words is best, like in interviews or audiobooks. | |
| You can use FFmpeg to prepare your audio files, here are some examples: | |
| ```shell | |
| # convert a multi-channel audio file to mono, set sample rate to 22050 hz, trim to 6 seconds, and output as WAV file. | |
| ffmpeg -i input.mp3 -ac 1 -ar 22050 -t 6 -y me.wav | |
| # use a simple noise filter to clean up audio, and select a start time start for sampling. | |
| ffmpeg -i input.wav -af "highpass=f=200, lowpass=f=3000" -ac 1 -ar 22050 -ss 00:13:26.2 -t 6 -y me.wav | |
| # A more complex noise reduction setup, including volume adjustment | |
| ffmpeg -i input.mkv -af "highpass=f=200, lowpass=f=3000, volume=5, afftdn=nf=25" -ac 1 -ar 22050 -ss 00:13:26.2 -t 6 -y me.wav | |
| ``` | |
| Once your WAV file is prepared, save it in the `/voices/` directory and update the `config/voice_to_speaker.yaml` file with the new file name. | |
| For example: | |
| ```yaml | |
| ... | |
| tts-1-hd: | |
| me: | |
| model: xtts | |
| speaker: voices/me.wav # this could be you | |
| ``` | |
| ## Multilingual | |
| Multilingual cloning support was added in version 0.11.0 and is available only with the XTTS v2 model. To use multilingual voices with piper simply download a language specific voice. | |
| Coqui XTTSv2 has support for multiple languages: English (`en`), Spanish (`es`), French (`fr`), German (`de`), Italian (`it`), Portuguese (`pt`), Polish (`pl`), Turkish (`tr`), Russian (`ru`), Dutch (`nl`), Czech (`cs`), Arabic (`ar`), Chinese (`zh-cn`), Hungarian (`hu`), Korean (`ko`), Japanese (`ja`), and Hindi (`hi`). When not set, an attempt will be made to automatically detect the language, falling back to English (`en`). | |
| Unfortunately the OpenAI API does not support language, but you can create your own custom speaker voice and set the language for that. | |
| 1) Create the WAV file for your speaker, as in [Custom Voices Howto](#custom-voices-howto) | |
| 2) Add the voice to `config/voice_to_speaker.yaml` and include the correct Coqui `language` code for the speaker. For example: | |
| ```yaml | |
| xunjiang: | |
| model: xtts | |
| speaker: voices/xunjiang.wav | |
| language: zh-cn | |
| ``` | |
| 3) Don't remove high unicode characters in your `config/pre_process_map.yaml`! If you have these lines, you will need to remove them. For example: | |
| Remove: | |
| ```yaml | |
| - - '[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F700-\U0001F77F\U0001F780-\U0001F7FF\U0001F800-\U0001F8FF\U0001F900-\U0001F9FF\U0001FA00-\U0001FA6F\U0001FA70-\U0001FAFF\U00002702-\U000027B0\U000024C2-\U0001F251]+' | |
| - '' | |
| ``` | |
| These lines were added to the `config/pre_process_map.yaml` config file by default before version 0.11.0: | |
| 4) Your new multi-lingual speaker voice is ready to use! | |
| ## Custom Fine-Tuned Model Support | |
| Adding a custom xtts model is simple. Here is an example of how to add a custom fine-tuned 'halo' XTTS model. | |
| 1) Save the model folder under `voices/` (all 4 files are required, including the vocab.json from the model) | |
| ``` | |
| openedai-speech$ ls voices/halo/ | |
| config.json vocab.json model.pth sample.wav | |
| ``` | |
| 2) Add the custom voice entry under the `tts-1-hd` section of `config/voice_to_speaker.yaml`: | |
| ```yaml | |
| tts-1-hd: | |
| ... | |
| halo: | |
| model: halo # This name is required to be unique | |
| speaker: voices/halo/sample.wav # voice sample is required | |
| model_path: voices/halo | |
| ``` | |
| 3) The model will be loaded when you access the voice for the first time (`--preload` doesn't work with custom models yet) | |
| ## Generation Parameters | |
| The generation of XTTSv2 voices can be fine tuned with the following options (defaults included below): | |
| ```yaml | |
| tts-1-hd: | |
| alloy: | |
| model: xtts | |
| speaker: voices/alloy.wav | |
| enable_text_splitting: True | |
| length_penalty: 1.0 | |
| repetition_penalty: 10 | |
| speed: 1.0 | |
| temperature: 0.75 | |
| top_k: 50 | |
| top_p: 0.85 | |
| ``` |