gabrielclark3330 commited on
Commit
831c771
·
verified ·
1 Parent(s): e8f1618

Add detailed readme with install and examples

Browse files
Files changed (1) hide show
  1. README.md +102 -4
README.md CHANGED
@@ -1,10 +1,108 @@
1
  ---
2
  license: apache-2.0
3
  ---
4
- Zonos-v0.1-transformer is a leading open-weight text-to-speech transformer model. In our testing we have found it performs comparably or better in expressiveness and quality compared to leading TTS providers.
5
 
6
- Zonos enables highly expressive and naturalistic speech generation from text prompts given a speaker embedding or audio prefix. Zonos is capable of high fidelity voice cloning given clips of between 5 and 30s of speech. Zonos also can be conditioned based on speaking rate, pitch standard deviation, audio quality, and emotions such as sadness, fear, anger, happiness, and joy. Zonos outputs speech natively at 44Khz.
 
 
 
 
 
 
7
 
8
- Zonos was trained on approximately 200k hours of primarily English speech data.
9
 
10
- Zonos follows a simple architecture comprising text normalisation and phonemization by espeak, followed by DAC token prediction by a transformer backbone.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+ # Zonos-v0.1
5
 
6
+ <div align="center">
7
+ <img src="https://github.com/Zyphra/Zonos/blob/main/content/ZonosHeader.png?raw=true"
8
+ alt="Title card"
9
+ style="width: 500px;
10
+ height: auto;
11
+ object-position: center top;">
12
+ </div>
13
 
14
+ Zonos-v0.1 is a leading open-weight text-to-speech model, delivering expressiveness and quality on par with—or even surpassing—top TTS providers.
15
 
16
+ It enables highly naturalistic speech generation from text prompts when given a speaker embedding or audio prefix. With just 5 to 30 seconds of speech, Zonos can achieve high-fidelity voice cloning. It also allows conditioning based on speaking rate, pitch variation, audio quality, and emotions such as sadness, fear, anger, happiness, and joy. The model outputs speech natively at 44kHz.
17
+
18
+ Trained on approximately 200,000 hours of primarily English speech data, Zonos follows a straightforward architecture: text normalization and phonemization via eSpeak, followed by DAC token prediction through a transformer or hybrid backbone. An architecture overview can be seen below.
19
+
20
+ <div align="center">
21
+ <img src="https://github.com/Zyphra/Zonos/blob/main/content/ArchitectureDiagram.png?raw=true"
22
+ alt="Architecture diagram"
23
+ style="width: 1000px;
24
+ height: auto;
25
+ object-position: center top;">
26
+ </div>
27
+
28
+ Read more about our models [here](https://www.zyphra.com/post/beta-release-of-zonos-v0-1).
29
+
30
+ ## Features
31
+ * Zero-shot TTS with voice cloning: Input desired text and a 10-30s speaker sample to generate high quality TTS output
32
+ * Audio prefix inputs: Add text plus an audio prefix for even richer speaker matching. Audio prefixes can be used to elicit behaviours such as whispering which are challenging to obtain from pure voice cloning
33
+ * Multilingual support: Zonos-v0.1 supports English, Japanese, Chinese, French, and German
34
+ * Audio quality and emotion control: Zonos offers fine-grained control of many aspects of the generated audio. These include speaking rate, pitch, maximum frequency, audio quality, and various emotions such as happiness, anger, sadness, and fear.
35
+ * Fast: our model runs with a real-time factor of ~2x on an RTX 4090
36
+ * WebUI gradio interface: Zonos comes packaged with an easy to use gradio interface to generate speech
37
+ * Simple installation and deployment: Zonos can be installed and deployed simply using the docker file packaged with our repository.
38
+
39
+
40
+ ## Docker Installation
41
+
42
+ ```bash
43
+ git clone [email protected]:Zyphra/Zonos.git
44
+ cd Zonos
45
+
46
+ # For gradio
47
+ docker compose up
48
+
49
+ # Or for development you can do
50
+ docker build -t Zonos .
51
+ docker run -it --gpus=all --net=host -v /path/to/Zonos:/Zonos -t Zonos
52
+ cd /Zonos
53
+ python3 sample.py # this will generate a sample.wav in /Zonos
54
+ ```
55
+
56
+ ## DIY Installation
57
+ ### eSpeak
58
+
59
+ ```bash
60
+ apt install espeak-ng
61
+ ```
62
+
63
+ ### Python dependencies
64
+
65
+ Make sure you have a recent version of [uv](https://docs.astral.sh/uv/#installation), then run the following commands in sequence:
66
+
67
+ ```bash
68
+ uv venv
69
+ uv sync --no-group main
70
+ uv sync
71
+ ```
72
+
73
+ ## Usage example
74
+
75
+ ```bash
76
+ Python3 sample.py
77
+ ```
78
+ This will produce `sample.wav` in the `Zonos` directory.
79
+
80
+ ## Getting started with Zonos in python
81
+ Once you have Zonos installed try generating audio programmatically in python
82
+ ```python3
83
+ import torch
84
+ import torchaudio
85
+ from zonos.model import Zonos
86
+ from zonos.conditioning import make_cond_dict
87
+
88
+ # Use the hybrid with "Zyphra/Zonos-v0.1-hybrid"
89
+ model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")
90
+ model.bfloat16()
91
+
92
+ wav, sampling_rate = torchaudio.load("./exampleaudio.mp3")
93
+ spk_embedding = model.embed_spk_audio(wav, sampling_rate)
94
+
95
+ torch.manual_seed(421)
96
+
97
+ cond_dict = make_cond_dict(
98
+ text="Hello, world!",
99
+ speaker=spk_embedding.to(torch.bfloat16),
100
+ language="en-us",
101
+ )
102
+ conditioning = model.prepare_conditioning(cond_dict)
103
+
104
+ codes = model.generate(conditioning)
105
+
106
+ wavs = model.autoencoder.decode(codes).cpu()
107
+ torchaudio.save("sample.wav", wavs[0], model.autoencoder.sampling_rate)
108
+ ```