Update README.md
Browse files
README.md
CHANGED
@@ -1,86 +1,90 @@
|
|
1 |
-
---
|
2 |
-
license: cc-by-sa-3.0
|
3 |
-
tags:
|
4 |
-
- Voice
|
5 |
-
- Speaker
|
6 |
-
language:
|
7 |
-
- en
|
8 |
-
base_model:
|
9 |
-
- microsoft/wavlm-large
|
10 |
-
datasets:
|
11 |
-
- VoxCeleb
|
12 |
-
---
|
13 |
-
|
14 |
-
# Speaker Embeddings extractor
|
15 |
-
This model produces speaker embeddings for automatic speaker verification (ASV).
|
16 |
-
Speaker verification is performed by computing embeddings vectors by applying this model to any two voice signals. Then the cosine similarity between the two embeddings can be used to compare the two voices.
|
17 |
-
|
18 |
-
The model has been derived from the self-supervised pretrained model [WavLM-large](https://huggingface.co/microsoft/wavlm-large).
|
19 |
-
|
20 |
-
# Usage
|
21 |
-
The following code snippet uses the file [spk_embeddings.py](https://huggingface.co/Orange/w-pro/blob/main/spk_embeddings.py)
|
22 |
-
to build the architecture of the model.
|
23 |
-
Its weights are then downloaded from this repository.
|
24 |
-
```
|
25 |
-
from spk_embeddings import EmbeddingsModel, compute_embedding
|
26 |
-
import torch
|
27 |
-
|
28 |
-
model = EmbeddingsModel.from_pretrained("Orange/Speaker-wavLM-id")
|
29 |
-
model.eval()
|
30 |
-
```
|
31 |
-
|
32 |
-
The model produces normalized vectors as embeddings.
|
33 |
-
|
34 |
-
The python file also contains the function to compute the embeddings vector of an audio file.
|
35 |
-
In this tutorial version, the audio file is expected to be sampled at 16kHz.
|
36 |
-
Depending on the available memory (cpu or gpu), you may change the value of the *max_size* parameter,
|
37 |
-
which is used to truncate the long audio signals.
|
38 |
-
|
39 |
-
finally, we can compute two embeddings from two different files and compare them with a cosine similarity:
|
40 |
-
|
41 |
-
```
|
42 |
-
wav1 = "/voxceleb1_2019/test/wav/id10270/x6uYqmx31kE/00001.wav"
|
43 |
-
wav2 = "/voxceleb1_2019/test/wav/id10270/8jEAjG6SegY/00008.wav"
|
44 |
-
|
45 |
-
e1 = compute_embedding(wav1, model)
|
46 |
-
e2 = compute_embedding(wav2, model)
|
47 |
-
sim = float(torch.matmul(e1,e2.t()))
|
48 |
-
|
49 |
-
print(sim) #0.7334115505218506
|
50 |
-
```
|
51 |
-
|
52 |
-
# Evaluations
|
53 |
-
The model has been evaluated on the standard ASV [VoxCeleb1-clean test set](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/veri_test2.txt).
|
54 |
-
It results in an Equal Error Rate (EER, lower value denotes a better identification, random prediction leads to a value of 50%) of **0.946%**
|
55 |
-
(with a decision threshold of **0.388**).
|
56 |
-
|
57 |
-
Please note that the EER value can vary a little depending on the max_size defined to reduce long audios (max 30 seconds in our case).
|
58 |
-
|
59 |
-
# Limitations
|
60 |
-
The fine tuning data used to produce this model (VoxCeleb1 and 2) are mostly in english, which may affect the performance on other languages. The performance may also vary with the audio quality (recording device, background noise, ...), specially for audio qualities not covered by the training set, as no specific algorithm, e.g. data augmentation, was used during training to tackle this problem.
|
61 |
-
|
62 |
-
# Publication
|
63 |
-
This model was used as a baseline in the context of voice characterization (prosodic and timbral cues) in the study described in the following research paper:
|
64 |
-
[Disentangling prosody and timbre embeddings via voice conversion](https://www.isca-archive.org/interspeech_2024/gengembre24_interspeech.pdf).
|
65 |
-
|
66 |
-
In this paper the model is denoted as W-SPK. The other two models used in this study can also be found on HuggingFace :
|
67 |
-
- [W-TBR](https://huggingface.co/Orange/Speaker-wavLM-tbr) for timber related embeddings
|
68 |
-
- [W-PRO](https://huggingface.co/Orange/Speaker-wavLM-pro) for non-timbral embeddings
|
69 |
-
|
70 |
-
|
71 |
-
### Citation
|
72 |
-
Gengembre, N., Le Blouch, O., Gendrot, C. (2024) Disentangling prosody and timbre embeddings via voice conversion. Proc. Interspeech 2024, 2765-2769, doi: 10.21437/Interspeech.2024-207
|
73 |
-
|
74 |
-
### BibteX citation
|
75 |
-
```
|
76 |
-
@inproceedings{gengembre24_interspeech,
|
77 |
-
title = {Disentangling prosody and timbre embeddings via voice conversion},
|
78 |
-
author = {Nicolas Gengembre and Olivier {Le Blouch} and Cédric Gendrot},
|
79 |
-
year = {2024},
|
80 |
-
booktitle = {Interspeech 2024},
|
81 |
-
pages = {2765--2769},
|
82 |
-
doi = {10.21437/Interspeech.2024-207},
|
83 |
-
issn = {2958-1796},
|
84 |
-
}
|
85 |
-
```
|
86 |
-
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-sa-3.0
|
3 |
+
tags:
|
4 |
+
- Voice
|
5 |
+
- Speaker
|
6 |
+
language:
|
7 |
+
- en
|
8 |
+
base_model:
|
9 |
+
- microsoft/wavlm-large
|
10 |
+
datasets:
|
11 |
+
- VoxCeleb
|
12 |
+
---
|
13 |
+
|
14 |
+
# Speaker Embeddings extractor
|
15 |
+
This model produces speaker embeddings for automatic speaker verification (ASV).
|
16 |
+
Speaker verification is performed by computing embeddings vectors by applying this model to any two voice signals. Then the cosine similarity between the two embeddings can be used to compare the two voices.
|
17 |
+
|
18 |
+
The model has been derived from the self-supervised pretrained model [WavLM-large](https://huggingface.co/microsoft/wavlm-large).
|
19 |
+
|
20 |
+
# Usage
|
21 |
+
The following code snippet uses the file [spk_embeddings.py](https://huggingface.co/Orange/w-pro/blob/main/spk_embeddings.py)
|
22 |
+
to build the architecture of the model.
|
23 |
+
Its weights are then downloaded from this repository.
|
24 |
+
```
|
25 |
+
from spk_embeddings import EmbeddingsModel, compute_embedding
|
26 |
+
import torch
|
27 |
+
|
28 |
+
model = EmbeddingsModel.from_pretrained("Orange/Speaker-wavLM-id")
|
29 |
+
model.eval()
|
30 |
+
```
|
31 |
+
|
32 |
+
The model produces normalized vectors as embeddings.
|
33 |
+
|
34 |
+
The python file also contains the function to compute the embeddings vector of an audio file.
|
35 |
+
In this tutorial version, the audio file is expected to be sampled at 16kHz.
|
36 |
+
Depending on the available memory (cpu or gpu), you may change the value of the *max_size* parameter,
|
37 |
+
which is used to truncate the long audio signals.
|
38 |
+
|
39 |
+
finally, we can compute two embeddings from two different files and compare them with a cosine similarity:
|
40 |
+
|
41 |
+
```
|
42 |
+
wav1 = "/voxceleb1_2019/test/wav/id10270/x6uYqmx31kE/00001.wav"
|
43 |
+
wav2 = "/voxceleb1_2019/test/wav/id10270/8jEAjG6SegY/00008.wav"
|
44 |
+
|
45 |
+
e1 = compute_embedding(wav1, model)
|
46 |
+
e2 = compute_embedding(wav2, model)
|
47 |
+
sim = float(torch.matmul(e1,e2.t()))
|
48 |
+
|
49 |
+
print(sim) #0.7334115505218506
|
50 |
+
```
|
51 |
+
|
52 |
+
# Evaluations
|
53 |
+
The model has been evaluated on the standard ASV [VoxCeleb1-clean test set](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/veri_test2.txt).
|
54 |
+
It results in an Equal Error Rate (EER, lower value denotes a better identification, random prediction leads to a value of 50%) of **0.946%**
|
55 |
+
(with a decision threshold of **0.388**).
|
56 |
+
|
57 |
+
Please note that the EER value can vary a little depending on the max_size defined to reduce long audios (max 30 seconds in our case).
|
58 |
+
|
59 |
+
# Limitations
|
60 |
+
The fine tuning data used to produce this model (VoxCeleb1 and 2) are mostly in english, which may affect the performance on other languages. The performance may also vary with the audio quality (recording device, background noise, ...), specially for audio qualities not covered by the training set, as no specific algorithm, e.g. data augmentation, was used during training to tackle this problem.
|
61 |
+
|
62 |
+
# Publication
|
63 |
+
This model was used as a baseline in the context of voice characterization (prosodic and timbral cues) in the study described in the following research paper:
|
64 |
+
[Disentangling prosody and timbre embeddings via voice conversion](https://www.isca-archive.org/interspeech_2024/gengembre24_interspeech.pdf).
|
65 |
+
|
66 |
+
In this paper the model is denoted as W-SPK. The other two models used in this study can also be found on HuggingFace :
|
67 |
+
- [W-TBR](https://huggingface.co/Orange/Speaker-wavLM-tbr) for timber related embeddings
|
68 |
+
- [W-PRO](https://huggingface.co/Orange/Speaker-wavLM-pro) for non-timbral embeddings
|
69 |
+
|
70 |
+
|
71 |
+
### Citation
|
72 |
+
Gengembre, N., Le Blouch, O., Gendrot, C. (2024) Disentangling prosody and timbre embeddings via voice conversion. Proc. Interspeech 2024, 2765-2769, doi: 10.21437/Interspeech.2024-207
|
73 |
+
|
74 |
+
### BibteX citation
|
75 |
+
```
|
76 |
+
@inproceedings{gengembre24_interspeech,
|
77 |
+
title = {Disentangling prosody and timbre embeddings via voice conversion},
|
78 |
+
author = {Nicolas Gengembre and Olivier {Le Blouch} and Cédric Gendrot},
|
79 |
+
year = {2024},
|
80 |
+
booktitle = {Interspeech 2024},
|
81 |
+
pages = {2765--2769},
|
82 |
+
doi = {10.21437/Interspeech.2024-207},
|
83 |
+
issn = {2958-1796},
|
84 |
+
}
|
85 |
+
```
|
86 |
+
|
87 |
+
# License
|
88 |
+
|
89 |
+
CREATIVE COMMONS Attribution-ShareAlike 3.0 Unported
|
90 |
+
|