File size: 5,906 Bytes
aa1be04
 
 
4affcfc
 
 
 
 
 
 
bbef05c
 
 
aa1be04
 
bbef05c
00bcf21
 
 
bbef05c
00bcf21
bbef05c
6456d11
00bcf21
bbef05c
 
ac2d6fb
 
7afcc3b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ac2d6fb
 
 
 
 
8dc32b4
 
 
ac2d6fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7afcc3b
 
 
ac2d6fb
af9a5b8
 
45d906c
ac2d6fb
8dc32b4
 
bbef05c
6b3caaa
 
 
 
 
af9a5b8
 
6b3caaa
 
 
 
6456d11
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
license: cc-by-sa-3.0
tags:
- Speaker traits
- Voice
- Speaker
language:
- en
base_model:
- microsoft/wavlm-large
datasets:
- VCTK
- VoxCeleb
---

# Non-timbral Embeddings extractor
This model produces embeddings that represent the non-timbral traits (prosody, accent, ...) of a speaker's voice. These embeddings can be used the same way as for a classical 
speaker verification (ASV): to compare two voice signals, extract an embeddings for each of them and compute the cosine similarity between the two embeddings. 
The main difference with classical ASV embeddings is that here only the non-timbral traits are compared.

The model has been derived from the self-supervised pretrained model [WavLM-large](https://huggingface.co/microsoft/wavlm-large). 

The next section explains how to compute these non-timbral embeddings.


# Usage
This first code snippet is for the model creation and download:

```
import torch
import torch.nn as nn
from transformers.models.wavlm.modeling_wavlm import WavLMPreTrainedModel, WavLMModel

class TopLayers(nn.Module):
    def __init__(self, embd_size = 250, top_interm_size = 512):
        super(TopLayers, self).__init__()
        self.affine1 = nn.Conv1d(in_channels=2048, out_channels=top_interm_size, kernel_size=1)
        self.batchnorm1 = nn.BatchNorm1d(num_features=top_interm_size, affine=False, eps=1e-03)
        self.affine2 = nn.Conv1d(in_channels=top_interm_size, out_channels=embd_size, kernel_size=1)
        self.batchnorm2 = nn.BatchNorm1d(num_features=embd_size, affine=False, eps=1e-03)
        self.activation = nn.ReLU(inplace=True)

    def forward(self, x):
        out = self.batchnorm1(self.activation(self.affine1(x)))
        out = self.batchnorm2(self.activation(self.affine2(out)))
        return nn.functional.normalize(out[:,:,0])
    
class EmbeddingsModel(WavLMPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.wavlm = WavLMModel(config)
        self.top_layers = TopLayers(config.embd_size, config.top_interm_size)
        
    def forward(self, input_values):
        # MVN normalization
        x_norm = (input_values - input_values.mean(dim=1).unsqueeze(1)) / (input_values.std(dim=1).unsqueeze(1))
        base_out = self.wavlm(input_values=x_norm, output_hidden_states=False).last_hidden_state
        v = base_out.var(dim=1).clamp(min=1e-10)
        x_stats = torch.cat((base_out.mean(dim=1),v.pow(0.5)),dim=1).unsqueeze(dim=2)
        return self.top_layers(x_stats)

nt_extractor = EmbeddingsModel.from_pretrained("ggmbr/wnt")
nt_extractor.eval()
```

You may have noticed that the model produces normalized vectors as embeddings.

Next, we define a function that extracts the non-timbral embedding from an audio signal. In this tutorial version, the audio file is expected to be sampled at 16kHz. 
Depending on the available memory (cpu or gpu), you may change the value of MAX_SIZE, which is used to truncate the long audio signals.

```
import torchaudio

MAX_SIZE = 320000 # max number of audio samples

def compute_embedding(fnm, model):
    sig, sr = torchaudio.load(fnm)
    assert sr == 16000, "please convert your audio file to a sampling rate of 16 kHz"
    sig = sig.mean(dim=0).to(device)
    if sig.shape[0] > MAX_SIZE:
        print(f"truncating long signal {fnm}")
        sig = sig[:MAX_SIZE]
    embd = model(sig.unsqueeze(dim=0))
    return embd.clone().detach()
```

And finally, we can compute two embeddings from two different files and compare them with a cosine similarity:

```
wav1 = "/data/AUDIO/speakerid/corpus/voxceleb1_2019/test/wav/id10270/x6uYqmx31kE/00001.wav"
wav2 = "/data/AUDIO/speakerid/corpus/voxceleb1_2019/test/wav/id10270/8jEAjG6SegY/00008.wav"

e1 = compute_embedding(wav1, nt_extractor)
e2 = compute_embedding(wav2, nt_extractor)
sim = float(torch.matmul(e1,e2.t()))
```

# Evaluations
Although it is not directly designed for this use case, evaluation on a standard ASV task can be performed with this model. Applied to 
the [VoxCeleb1-clean test set](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/veri_test2.txt), it leads to an equal error rate (EER, lower denotes a better identification, random prediction leads to a value of 50%) of **10.681%** 
(with a decision threshold of **0.467**). This value can be interpreted as the ability to identify speakers only with non-timbral cues. A discussion about this interpretation can be
found in the paper mentioned hereafter, as well as other experiments showing correlations between these embeddings and non-timbral voice attributes.

Please note that the EER value can vary a little depending on the MAX_SIZE defined to reduce long audios (max 30 seconds in our case).

# Limitations
The fine tuning data used to produce this model (VoxCeleb, VCTK) are mostly in english, which may affect the performance on other languages.

# Publication
Details about the method used to build this model have been published at Interspeech 2024 in the paper entitled 
[Disentangling prosody and timbre embeddings via voice conversion](https://www.isca-archive.org/interspeech_2024/gengembre24_interspeech.pdf). 
In this paper, the model has been refered to as W-PRO, and the *non-timbral embeddings* as *prosody embeddings*
(what we finally found a little bit confusing, hence the name modification).

### Citation
Gengembre, N., Le Blouch, O., Gendrot, C. (2024) Disentangling prosody and timbre embeddings via voice conversion. Proc. Interspeech 2024, 2765-2769, doi: 10.21437/Interspeech.2024-207

### BibteX citation
```
@inproceedings{gengembre24_interspeech,
  title     = {Disentangling prosody and timbre embeddings via voice conversion},
  author    = {Nicolas Gengembre and Olivier {Le Blouch} and Cédric Gendrot},
  year      = {2024},
  booktitle = {Interspeech 2024},
  pages     = {2765--2769},
  doi       = {10.21437/Interspeech.2024-207},
  issn      = {2958-1796},
}
```