File size: 5,992 Bytes
f13d144
 
 
 
 
 
 
14e1f3b
 
4056d41
14e1f3b
373d5f7
14e1f3b
 
 
fa6de38
 
 
 
 
 
14e1f3b
853cf78
14e1f3b
bda8552
c16968a
14e1f3b
 
 
 
 
 
 
 
 
10f6091
8b69a0d
 
 
 
 
 
 
 
 
17d236f
8b69a0d
 
 
 
 
 
 
 
 
 
 
 
 
 
6ab9b8b
8b69a0d
 
 
 
10e7ec2
853cf78
 
 
 
 
 
 
 
 
 
 
 
ca0bd50
853cf78
 
 
17420f4
853cf78
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e52c10a
14e1f3b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
853cf78
c16968a
 
 
5466032
9545c20
c16968a
094a0e1
c16968a
9545c20
c16968a
 
 
 
 
 
 
 
e52c10a
14e1f3b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e52c10a
14e1f3b
 
 
a82b1e3
14e1f3b
 
 
 
 
 
 
 
fa6de38
14e1f3b
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
---
language:
- en
tags:
- myshell
- speech-to-speech
---
<!-- might put a [width=2000 * height=xxx] img here, this size best fits git page
<img src="resources\cover.png"> -->
<img src="resources/dreamvoice.png">

# DreamVoice: Text-guided Voice Conversion

--------------------

## Introduction

DreamVoice is an innovative approach to voice conversion (VC) that leverages text-guided generation to create personalized and versatile voice experiences. 
Unlike traditional VC methods, which require a target recording during inference, DreamVoice introduces a more intuitive solution by allowing users to specify desired voice timbres through text prompts.

For more details, please check our interspeech paper: [DreamVoice](https://arxiv.org/abs/2406.16314)

To listen to demos and download dataset, please check dreamvoice's homepage: [Homepage](https://haidog-yaqub.github.io/dreamvoice_demo/)


# How to Use

To load the models, you need to install packages:

```
pip install -r requirements.txt
```

Then you can use the model with the following code:

- DreamVoice Plugin for FreeVC (DreamVG + [FreeVC](https://github.com/OlaWod/FreeVC))
   
```python
import torch
import librosa
import soundfile as sf
from dreamvoice import DreamVoice_Plugin
from dreamvoice.freevc_wrapper import get_freevc_models, convert

device = 'cuda'
freevc, cmodel, hps = get_freevc_models('ckpts_freevc/', 'dreamvoice/', device)

# init dreamvoice
dreamvoice = DreamVoice_Plugin(config='plugin_freevc.yaml', device=device)

# generate speaker
prompt = "old female's voice, deep and dark"
target_se = dreamvoice.gen_spk(prompt)

# content source
source_path = 'examples/test1.wav'
audio_clip = librosa.load(source_path, sr=16000)[0]
audio_clip = torch.tensor(audio_clip).unsqueeze(0).to(device)
content = cmodel(audio_clip).last_hidden_state.transpose(1, 2).to(device)

# voice conversion
output, out_sr = convert(freevc, content, target_se)
sf.write('output.wav', output, out_sr)
```

- DreamVoice Plugin for OpenVoice (DreamVG + [OpneVoice](https://github.com/myshell-ai/OpenVoice))

```python
import torch
from dreamvoice import DreamVoice_Plugin
from dreamvoice.openvoice_utils import se_extractor
from openvoice.api import ToneColorConverter

# init dreamvoice
dreamvoice = DreamVoice_Plugin(device='cuda')

# init openvoice
ckpt_converter = 'checkpoints_v2/converter'
openvoice = ToneColorConverter(f'{ckpt_converter}/config.json', device='cuda')
openvoice.load_ckpt(f'{ckpt_converter}/checkpoint.pth')

# generate speaker
prompt = 'young female voice, sounds young and cute'
target_se = dreamvoice.gen_spk(prompt)
target_se = target_se.unsqueeze(-1)

# content source
source_path = 'examples/test2.wav'
source_se = se_extractor(source_path, openvoice).to(device)

# voice conversion
encode_message = "@MyShell"
openvoice.convert(
    audio_src_path=source_path,
    src_se=source_se,
    tgt_se=target_se,
    output_path='output.wav',
    message=encode_message)
```

- DreamVoice Plugin for DiffVC (Diffusion-based VC Model)

```python
from dreamvoice import DreamVoice

# Initialize DreamVoice in plugin mode with CUDA device
dreamvoice = DreamVoice(mode='plugin', device='cuda')
# Description of the target voice
prompt = 'young female voice, sounds young and cute'
# Provide the path to the content audio and generate the converted audio
gen_audio, sr = dreamvoice.genvc('examples/test1.wav', prompt)
# Save the converted audio
dreamvoice.save_audio('gen1.wav', gen_audio, sr)

# Save the speaker embedding if you like the generated voice
dreamvoice.save_spk_embed('voice_stash1.pt')
# Load the saved speaker embedding
dreamvoice.load_spk_embed('voice_stash1.pt')
# Use the saved speaker embedding for another audio sample
gen_audio2, sr = dreamvoice.simplevc('examples/test2.wav', use_spk_cache=True)
dreamvoice.save_audio('gen2.wav', gen_audio2, sr)
```

# Training Guide

1. download VCTK and LibriTTS-R
2. download [DreamVoice DataSet](https://haidog-yaqub.github.io/dreamvoice_demo/)
3. extract speaker embeddings and cache in local path:
```
python dreamvoice/train_utils/prepare/prepare_se.py
```
4. modify trainning config and train your dreamvoice plugin:
```
cd dreamvoice/train_utils/src
accelerate launch train.py
```


# Extra Features

- End-to-end DreamVoice VC Model

```python
from dreamvoice import DreamVoice

# Initialize DreamVoice in end-to-end mode with CUDA device
dreamvoice = DreamVoice(mode='end2end', device='cuda')
# Provide the path to the content audio and generate the converted audio
gen_end2end, sr = dreamvoice.genvc('examples/test1.wav', prompt)
# Save the converted audio
dreamvoice.save_audio('gen_end2end.wav', gen_end2end, sr)

# Note: End-to-end mode does not support saving speaker embeddings
# To use a voice generated in end-to-end mode, switch back to plugin mode
# and extract the speaker embedding from the generated audio
# Switch back to plugin mode
dreamvoice = DreamVoice(mode='plugin', device='cuda')
# Load the speaker audio from the previously generated file
gen_end2end2, sr = dreamvoice.simplevc('examples/test2.wav', speaker_audio='gen_end2end.wav')
# Save the new converted audio
dreamvoice.save_audio('gen_end2end2.wav', gen_end2end2, sr)
```

- DiffVC (Diffusion-based VC Model)

```python
from dreamvoice import DreamVoice

# Plugin mode can be used for traditional one-shot voice conversion
dreamvoice = DreamVoice(mode='plugin', device='cuda')
# Generate audio using traditional one-shot voice conversion
gen_tradition, sr = dreamvoice.simplevc('examples/test1.wav', speaker_audio='examples/speaker.wav')
# Save the converted audio
dreamvoice.save_audio('gen_tradition.wav', gen_tradition, sr)
```

## Reference

If you find the code useful for your research, please consider citing:

```bibtex
@article{hai2024dreamvoice,
  title={DreamVoice: Text-Guided Voice Conversion},
  author={Hai, Jiarui and Thakkar, Karan and Wang, Helin and Qin, Zengyi and Elhilali, Mounya},
  journal={arXiv preprint arXiv:2406.16314},
  year={2024}
}
```