Commit
·
3259f6e
1
Parent(s):
675a69a
update README with inference
Browse files
README.md
CHANGED
@@ -22,7 +22,47 @@ metrics:
|
|
22 |
|
23 |
# EfficientTDNN
|
24 |
|
25 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
- **Dynamic Kernel**: The model enables various kernel sizes in {1,3,5}, `kernel/kernel.torchparams`.
|
28 |
- **Dynamic Depth**: The model enables additional various depth in {2,3,4} based on **Dynamic Kernel** version, `depth/depth.torchparams`.
|
@@ -59,10 +99,10 @@ Furthermore, some subnets are given in the form of the weights of batchnorm corr
|
|
59 |
|
60 |
The tag is described as follows.
|
61 |
|
62 |
-
- max:
|
63 |
-
- Kmin:
|
64 |
-
- Dmin:
|
65 |
-
- C1min:
|
66 |
-
- C2min:
|
67 |
|
68 |
More details about EfficentTDNN can be found in the paper [EfficientTDNN](https://arxiv.org/abs/2103.13581).
|
|
|
22 |
|
23 |
# EfficientTDNN
|
24 |
|
25 |
+
This repository provides all the necessary tools to perform speaker verification with a NAS alternative, named as EfficientTDNN.
|
26 |
+
The system can be used to extract speaker embeddings with different model size.
|
27 |
+
It is trained on Voxceleb2 training data using data augmentation.
|
28 |
+
The model performance on Voxceleb1-test set(Cleaned)/Vox1-O are reported as follows.
|
29 |
+
|
30 |
+
| Supernet Stage | Subnet | MACs (3-second) | Params | EER(%) w/ AS-Norm | EER(%) w/o AS-Norm | minDCF w/ AS-Norm | minDCF w/o AS-Norm |
|
31 |
+
|:-------------:|:--------------:|:--------------:|:--------------:|:--------------:|:--------------:|:--------------:|:--------------:|
|
32 |
+
| depth | Base | 1.45G | 5.79M | 0.94 | 1.14 | 0.089 | 0.106 |
|
33 |
+
| width 1 | Mobile | 570.98M | 2.42M | 1.41 | 1.61 | 0.124 | 0.152 |
|
34 |
+
| width 2 | Small | 204.07M | 899.20K | 2.20 | 2.33 | 0.219 | 0.241 |
|
35 |
+
|
36 |
+
The details of three subnets are:
|
37 |
+
|
38 |
+
- Base: (3, [512, 512, 512, 512], [5, 3, 3, 3], 1536)
|
39 |
+
- Mobile: (3, [384, 256, 256, 256], [5, 3, 3, 3], 768)
|
40 |
+
- Small: (2, [256, 256, 256], [3, 3, 3], 400)
|
41 |
+
|
42 |
+
## Compute your speaker embeddings
|
43 |
+
|
44 |
+
```python
|
45 |
+
import torchaudio
|
46 |
+
from sugar.models import WrappedModel
|
47 |
+
wav_file = f"{vox1_root}/id10270/x6uYqmx31kE/00001.wav"
|
48 |
+
signal, fs =torchaudio.load(wav_file)
|
49 |
+
|
50 |
+
repo_id = "mechanicalsea/efficient-tdnn"
|
51 |
+
supernet_filename = "depth/depth.torchparams"
|
52 |
+
subnet_filename = "depth/depth.ecapa-tdnn.3.512.512.512.512.5.3.3.3.1536.bn.tar"
|
53 |
+
subnet, info = WrappedModel.from_pretrained(
|
54 |
+
repo_id=repo_id, supernet_filename=supernet_filename, subnet_filename=subnet_filename)
|
55 |
+
|
56 |
+
embedding = subnet(signal)
|
57 |
+
```
|
58 |
+
|
59 |
+
## Inference on GPU
|
60 |
+
|
61 |
+
To perform inference on the GPU, add `subnet = subnet.to(device)` after calling the `from_pretrained` method.
|
62 |
+
|
63 |
+
## Model Description
|
64 |
+
|
65 |
+
Models are listed as follows.
|
66 |
|
67 |
- **Dynamic Kernel**: The model enables various kernel sizes in {1,3,5}, `kernel/kernel.torchparams`.
|
68 |
- **Dynamic Depth**: The model enables additional various depth in {2,3,4} based on **Dynamic Kernel** version, `depth/depth.torchparams`.
|
|
|
99 |
|
100 |
The tag is described as follows.
|
101 |
|
102 |
+
- max: (4, [512, 512, 512, 512, 512], [5, 5, 5, 5, 5], 1536)
|
103 |
+
- Kmin: (4, [512, 512, 512, 512, 512], [1, 1, 1, 1, 1], 1536)
|
104 |
+
- Dmin: (2, [512, 512, 512], [1, 1, 1], 1536)
|
105 |
+
- C1min: (2, [256, 256, 256], [1, 1, 1], 768)
|
106 |
+
- C2min: (2, [128, 128, 128], [1, 1, 1], 384)
|
107 |
|
108 |
More details about EfficentTDNN can be found in the paper [EfficientTDNN](https://arxiv.org/abs/2103.13581).
|