mechanicalsea
/

efficient-tdnn

@@ -22,7 +22,47 @@ metrics:
 # EfficientTDNN
-Model Version are listed as follows.
 - **Dynamic Kernel**: The model enables various kernel sizes in {1,3,5}, `kernel/kernel.torchparams`.
 - **Dynamic Depth**: The model enables additional various depth in {2,3,4} based on **Dynamic Kernel** version, `depth/depth.torchparams`.
@@ -59,10 +99,10 @@ Furthermore, some subnets are given in the form of the weights of batchnorm corr
 The tag is described as follows.
-- max: `(4, [512, 512, 512, 512, 512], [5, 5, 5, 5, 5], 1536)`
-- Kmin: `(4, [512, 512, 512, 512, 512], [1, 1, 1, 1, 1], 1536)`
-- Dmin: `(2, [512, 512, 512], [1, 1, 1], 1536)`
-- C1min: `(2, [256, 256, 256], [1, 1, 1], 768)`
-- C2min: `(2, [128, 128, 128], [1, 1, 1], 384)`
 More details about EfficentTDNN can be found in the paper [EfficientTDNN](https://arxiv.org/abs/2103.13581).

 # EfficientTDNN
+This repository provides all the necessary tools to perform speaker verification with a NAS alternative, named as EfficientTDNN.
+The system can be used to extract speaker embeddings with different model size.
+It is trained on Voxceleb2 training data using data augmentation.
+The model performance on Voxceleb1-test set(Cleaned)/Vox1-O are reported as follows.
+| Supernet Stage | Subnet | MACs (3-second) | Params | EER(%) w/ AS-Norm | EER(%) w/o AS-Norm | minDCF w/ AS-Norm | minDCF w/o AS-Norm |
+|:-------------:|:--------------:|:--------------:|:--------------:|:--------------:|:--------------:|:--------------:|:--------------:|
+| depth | Base | 1.45G | 5.79M | 0.94 | 1.14 | 0.089 | 0.106 |
+| width 1 | Mobile | 570.98M | 2.42M | 1.41 | 1.61 | 0.124 | 0.152 |
+| width 2 | Small | 204.07M | 899.20K | 2.20 | 2.33 | 0.219 | 0.241 |
+The details of three subnets are:
+- Base: (3, [512, 512, 512, 512], [5, 3, 3, 3], 1536)
+- Mobile: (3, [384, 256, 256, 256], [5, 3, 3, 3], 768)
+- Small: (2, [256, 256, 256], [3, 3, 3], 400)
+## Compute your speaker embeddings
+```python
+import torchaudio
+from sugar.models import WrappedModel
+wav_file = f"{vox1_root}/id10270/x6uYqmx31kE/00001.wav"
+signal, fs =torchaudio.load(wav_file)
+repo_id = "mechanicalsea/efficient-tdnn"
+supernet_filename = "depth/depth.torchparams"
+subnet_filename = "depth/depth.ecapa-tdnn.3.512.512.512.512.5.3.3.3.1536.bn.tar"
+subnet, info = WrappedModel.from_pretrained(
+    repo_id=repo_id, supernet_filename=supernet_filename, subnet_filename=subnet_filename)
+embedding = subnet(signal)
+```
+## Inference on GPU
+To perform inference on the GPU, add  `subnet = subnet.to(device)`  after calling the `from_pretrained` method.
+## Model Description
+Models are listed as follows.
 - **Dynamic Kernel**: The model enables various kernel sizes in {1,3,5}, `kernel/kernel.torchparams`.
 - **Dynamic Depth**: The model enables additional various depth in {2,3,4} based on **Dynamic Kernel** version, `depth/depth.torchparams`.
 The tag is described as follows.
+- max: (4, [512, 512, 512, 512, 512], [5, 5, 5, 5, 5], 1536)
+- Kmin: (4, [512, 512, 512, 512, 512], [1, 1, 1, 1, 1], 1536)
+- Dmin: (2, [512, 512, 512], [1, 1, 1], 1536)
+- C1min: (2, [256, 256, 256], [1, 1, 1], 768)
+- C2min: (2, [128, 128, 128], [1, 1, 1], 384)
 More details about EfficentTDNN can be found in the paper [EfficientTDNN](https://arxiv.org/abs/2103.13581).