ggmbr commited on
Commit
8dc32b4
·
1 Parent(s): 6456d11

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -1
README.md CHANGED
@@ -64,7 +64,9 @@ nt_extractor.eval()
64
  ```
65
 
66
  You may have noticed that the model produces normalized vectors as embeddings.
67
- Next, we define a function that extracts the non-timbral embedding from an audio signal. In this tutorial version, the audio file is expected to be sampled at 16kHz.
 
 
68
 
69
  ```
70
  import torchaudio
@@ -99,6 +101,8 @@ the [VoxCeleb1-clean test set](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/me
99
  (with a decision threshold of **0.467**). This value can be interpreted as the ability to identify speakers only with non-timbral cues. A discussion about this interpretation can be
100
  found in the paper mentioned hereabove, as well as other experiments showing correlations between these embeddings and non-timbral voice attributes.
101
 
 
 
102
  # Limitations
103
  The fine tuning data used to produce this model (VoxCeleb, VCTK) are mostly in english, which may affect the performance on other languages.
104
 
 
64
  ```
65
 
66
  You may have noticed that the model produces normalized vectors as embeddings.
67
+
68
+ Next, we define a function that extracts the non-timbral embedding from an audio signal. In this tutorial version, the audio file is expected to be sampled at 16kHz.
69
+ Depending on the available memory (cpu or gpu), you may change the value of MAX_SIZE, which is used to truncate the long audio signals.
70
 
71
  ```
72
  import torchaudio
 
101
  (with a decision threshold of **0.467**). This value can be interpreted as the ability to identify speakers only with non-timbral cues. A discussion about this interpretation can be
102
  found in the paper mentioned hereabove, as well as other experiments showing correlations between these embeddings and non-timbral voice attributes.
103
 
104
+ Please note that the EER value can vary a little depending on the MAX_SIZE defined to reduce long audios (max 30 seconds in our case).
105
+
106
  # Limitations
107
  The fine tuning data used to produce this model (VoxCeleb, VCTK) are mostly in english, which may affect the performance on other languages.
108