FBAGSTM commited on
Commit
973d250
·
verified ·
1 Parent(s): f8d8fc7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +145 -3
README.md CHANGED
@@ -1,3 +1,145 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: audio-classification
4
+ ---
5
+ # Quantized Yamnet
6
+
7
+ ## **Use case** : `AED`
8
+
9
+ # Model description
10
+
11
+ Yamnet is a very well-known audio classification model, pre-trained on Audioset and released by Google. The default model outputs embedding vectors of size 1024.
12
+
13
+ As the default Yamnet is a bit too large to fit on most microcontrollers (over 3M parameters), we provide in this model zoo a much downsized version of Yamnet which outputs embeddings of size 256.
14
+
15
+ We now also provide the original Yamnet (named Yamnet-1024 in this repo), with its original 3.2 million parameters, for use on the STM32N6.
16
+
17
+ Additionally, the default Yamnet provided by Google expects waveforms as input and has specific custom layers to perform conversion to mel-spectrogram and patch extraction.
18
+ These custom layers are not included in Yamnet-256 or Yamnet-1024, as STEDGEAI cannot convert them to C code, and more efficient implementations of these operations already exist on microcontrollers.
19
+ Thus, Yamnet-256 and Yamnet-1024 expect mel-spectrogram patches of size 64x96, format (n_mels, n_frames)
20
+
21
+ The model is quantized in int8 using tensorflow lite converter for Yamnet-256, and ONNX quantizer for Yamnet-1024.
22
+
23
+ We provide Yamnet-256s for two different datasets : ESC-10, which is a small research dataset, and FSD50K, a large generalist dataset using the audioset ontology.
24
+ For FSD50K, the model is trained to detect a small subset of the classes included in the dataset. This subset is : Knock, Glass, Gunshots and gunfire, Crying and sobbing, Speech.
25
+
26
+ The inference time & footprints are very similar in both cases, with the FSD50K model being very slightly smaller and faster.
27
+
28
+ ## Network information
29
+
30
+ Yamnet-256
31
+
32
+ | Network Information | Value |
33
+ |-------------------------|-----------------|
34
+ | Framework | TensorFlow Lite |
35
+ | Parameters Yamnet-256 | 130 K |
36
+ | Quantization | int8 |
37
+ | Provenance | https://tfhub.dev/google/yamnet/1 |
38
+
39
+ Yamnet-1024
40
+
41
+ | Network Information | Value |
42
+ |-------------------------|-----------------|
43
+ | Framework | TensorFlow Lite |
44
+ | Parameters Yamnet-1024 | 3.2 M |
45
+ | Quantization | int8 |
46
+ | Provenance | https://tfhub.dev/google/yamnet/1 |
47
+
48
+
49
+ ## Network inputs / outputs
50
+
51
+
52
+ The network expects spectrogram patches of 96 frames and 64 mels, of shape (64, 96, 1).
53
+ Additionally, the original Yamnet converts waveforms to spectrograms by using an FFT and window size of 25 ms, a hop length of 10ms, and by clipping frequencies between 125 and 7500 Hz.
54
+
55
+ Yamnet-256 outputs embedding vectors of size 256. If you use the model zoo scripts to perform transfer learning, a classification head with the specified number of classes will automatically be added to the network.
56
+
57
+ Yamnet-1024 is the original yamnet without the TF preprocessing layers attached, and outputs embedding vectors of size 1024. If you use the model zoo scripts to perform transfer learning, a classification head with the specified number of classes will automatically be added to the network.
58
+
59
+
60
+ ## Recommended platforms
61
+
62
+ For Yamnet-256
63
+ | Platform | Supported | Recommended |
64
+ |----------|-----------|-----------|
65
+ | STM32U5 |[x]|[x]|
66
+ | STM32N6 |[x]|[x]|
67
+
68
+ For Yamnet-1024
69
+ | Platform | Supported | Recommended |
70
+ |----------|-----------|-----------|
71
+ | STM32N6 |[x]|[x]|
72
+
73
+
74
+
75
+ # Performances
76
+
77
+ ## Metrics
78
+ Measures are done with default STEDGEAI configuration with enabled input / output allocated option.
79
+
80
+ ### Reference **NPU** memory footprint based on ESC-10 dataset
81
+ |Model | Dataset | Format | Resolution | Series | Internal RAM (KiB) | External RAM (KiB) | Weights Flash (KiB) | STM32Cube.AI version | STEdgeAI Core version |
82
+ |----------|------------------|--------|-------------|------------------|------------------|---------------------|-------|----------------------|-------------------------|
83
+ | [Yamnet 256](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/esc10/yamnet_256_64x96_tl/yamnet_256_64x96_tl_int8.tflite) | esc-10 | Int8 | 64x96x1 | STM32N6 | 144 | 0.0 | 176.59 | 10.0.0 | 2.0.0 |
84
+ | [Yamnet 1024](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/esc10/yamnet_1024_64x96_tl/yamnet_1024_64x96_tl_qdq_int8.onnx) | esc-10 | Int8 | 64x96x1 | STM32N6 | 144 | 0.0 | 3497.24 | 10.0.0 | 2.0.0 |
85
+
86
+ ### Reference **NPU** inference time based on ESC-10 dataset
87
+ | Model | Dataset | Format | Resolution | Board | Execution Engine | Inference time (ms) | Inf / sec | STM32Cube.AI version | STEdgeAI Core version |
88
+ |--------|------------------|--------|-------------|------------------|------------------|---------------------|-------|----------------------|-------------------------|
89
+ | [Yamnet 256](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/esc10/yamnet_256_64x96_tl/yamnet_256_64x96_tl_int8.tflite) | esc-10 | Int8 | 64x96x1 | STM32N6570-DK | NPU/MCU | 1.07 | 934.58 | 10.0.0 | 2.0.0 |
90
+ | [Yamnet 1024](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/esc10/yamnet_1024_64x96_tl/yamnet_1024_64x96_tl_qdq_int8.onnx) | esc-10 | Int8 | 64x96x1 | STM32N6570-DK | NPU/MCU | 9.88 | 101.21 | 10.0.0 | 2.0.0 |
91
+
92
+
93
+ ### Reference **MCU** memory footprint based on ESC-10 dataset
94
+ | Model | Format | Resolution | Series | Activation RAM (kB) | Runtime RAM (kB) | Weights Flash (kB) | Code Flash (kB) | Total RAM (kB) | Total Flash (kB) | STM32Cube.AI version |
95
+ |-------------------|--------|------------|---------|----------------|-------------|---------------|------------|-------------|-------------|-----------------------|
96
+ |[Yamnet 256](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/esc10/yamnet_256_64x96_tl/yamnet_256_64x96_tl_int8.tflite) | Int8 | 64x96x1 | B-U585I-IOT02A | 109.57 | 7.61 | 135.91 | 57.74 | 117.18 | 193.65 | 10.0.0 |
97
+ |[Yamnet 1024](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/esc10/yamnet_1024_64x96_tl/yamnet_1024_64x96_tl_qdq_int8.onnx) | Int8 | 64x96x1 | STM32N6 | 108.59 | 35.41 | 3162.66 | 334.30 | 144.0 | 3496.96 | 10.0.0 |
98
+
99
+ ### Reference inference time based on ESC-10 dataset
100
+ | Model | Format | Resolution | Board | Execution Engine | Frequency | Inference time | STM32Cube.AI version |
101
+ |-------------------|--------|------------|------------------|------------------|--------------|-----------------|-----------------------|
102
+ | [Yamnet 256](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/esc10/yamnet_256_64x96_tl/yamnet_256_64x96_tl_int8.tflite) | Int8 | 64x96x1 | B-U585I-IOT02A | 1 CPU | 160 MHz | 281.95 ms | 10.0.0
103
+ |[Yamnet 1024](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/esc10/yamnet_1024_64x96_tl/yamnet_1024_64x96_tl_qdq_int8.onnx) | Int8 | 64x96x1 | STM32N6 | 1 CPU + 1 NPU | 800MhZ/1000MhZ | 11.949 ms | 10.0.0
104
+
105
+
106
+ ### Accuracy with ESC-10 dataset
107
+
108
+ A note on clip-level accuracy : In a traditional AED data processing pipeline, audio is converted to a spectral representation (in this model zoo, mel-spectrograms), which is then cut into patches. Each patch is fed to the inference network, and a label vector is output for each patch. The labels on these patches are then aggregated based on which clip the patch belongs to, to form a single aggregate label vector for each clip. Accuracy is then computed on these aggregate label vectors.
109
+
110
+ The reason this metric is used instead of patch-level accuracy is because patch-level accuracy varies immensely depending on the specific manner used to cut spectrogram into patches, and also because clip-level accuracy is the metric most often reported in research papers.
111
+
112
+ | Model | Format | Resolution | Clip-level Accuracy |
113
+ |-------|--------|------------|----------------|
114
+ | [Yamnet 256](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/esc10/yamnet_256_64x96_tl/yamnet_256_64x96_tl.h5) | float32 | 64x96x1 | 94.9% |
115
+ | [Yamnet 256](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/esc10/yamnet_256_64x96_tl/yamnet_256_64x96_tl_int8.tflite) | int8 | 64x96x1 | 94.9% |
116
+ | [Yamnet 1024](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/esc10/yamnet_1024_64x96_tl/yamnet_1024_64x96_tl.h5) | float32 | 64x96x1 | 100.0% |
117
+ | [Yamnet 1024](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/esc10/yamnet_1024_64x96_tl/yamnet_1024_64x96_tl_qdq_int8.onnx) | int8 | 64x96x1 | 100.0% |
118
+
119
+
120
+
121
+ ### Accuracy with FSD50K dataset - Domestic AED use case
122
+ In this use case, the model is trained to detect a small subset of the classes included in the dataset. This subset is : Knock, Glass, Gunshots and gunfire, Crying and sobbing, Speech.
123
+
124
+ A note on clip-level accuracy : In a traditional AED data processing pipeline, audio is converted to a spectral representation (in this model zoo, mel-spectrograms), which is then cut into patches. Each patch is fed to the inference network, and a label vector is output for each patch. The labels on these patches are then aggregated based on which clip the patch belongs to, to form a single aggregate label vector for each clip. Accuracy is then computed on these aggregate label vectors.
125
+
126
+ The reason this metric is used instead of patch-level accuracy is because patch-level accuracy varies immensely depending on the specific manner used to cut spectrogram into patches, and also because clip-level accuracy is the metric most often reported in research papers.
127
+
128
+ **IMPORTANT NOTE** : The accuracy for the model with the "unknown class" added is significantly lower when performing inference on PC. This is because this additional class regroups a lot (appromiatively 194 in this specific case) of other classes, and thus drags performance down a bit.
129
+
130
+ However, contrary to what the numbers might suggest online performance on device is much improved in practice by this addition, in this specific case.
131
+
132
+ Note that accuracy with unknown class is lower. This is normal
133
+ | Model | Format | Resolution | Clip-level Accuracy |
134
+ |-------|--------|------------|----------------|
135
+ | [Yamnet 256 without unknown class](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/fsd50k/yamnet_256_64x96_tl/without_unknown_class/yamnet_256_64x96_tl.h5) | float32 | 64x96x1 | 86.0% |
136
+ | [Yamnet 256 without unknown class](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/fsd50k/yamnet_256_64x96_tl/without_unknown_class/yamnet_256_64x96_tl_int8.tflite) | float32 | 64x96x1 | 87.0% |
137
+ | [Yamnet 256 with unknown class](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/fsd50k/yamnet_256_64x96_tl/with_unknown_class/yamnet_256_64x96_tl.h5) | float32 | 64x96x1 | 73.0% |
138
+ | [Yamnet 256 with unknown class](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/fsd50k/yamnet_256_64x96_tl/with_unknown_class/yamnet_256_64x96_tl_int8.tflite) | int8 | 64x96x1 | 73.9% |
139
+
140
+ ## Retraining and Integration in a simple example:
141
+
142
+ Please refer to the stm32ai-modelzoo-services GitHub [here](https://github.com/STMicroelectronics/stm32ai-modelzoo-services)
143
+
144
+
145
+