Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,4 @@
|
|
|
|
1 |
# Audio Spectrogram Transformer (AST) Fine-Tuned on MLCommons Multilingual Spoken Words + Google Speech Commands
|
2 |
|
3 |
## Model Details
|
@@ -25,4 +26,74 @@
|
|
25 |
"9": "cake",
|
26 |
"10": "car",
|
27 |
// ... up to 79: "zoo"
|
28 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
````markdown
|
2 |
# Audio Spectrogram Transformer (AST) Fine-Tuned on MLCommons Multilingual Spoken Words + Google Speech Commands
|
3 |
|
4 |
## Model Details
|
|
|
26 |
"9": "cake",
|
27 |
"10": "car",
|
28 |
// ... up to 79: "zoo"
|
29 |
+
}
|
30 |
+
````
|
31 |
+
|
32 |
+
## Training Data
|
33 |
+
|
34 |
+
* Total samples: \~XX,XXX utterances
|
35 |
+
* **Sources:**
|
36 |
+
|
37 |
+
* MLCommons Multilingual Spoken Words corpus (covering 40+ languages)
|
38 |
+
* Google Speech Commands v0.02 for silence and unknown categories
|
39 |
+
* **Preprocessing:**
|
40 |
+
|
41 |
+
* Resampling to 16 kHz
|
42 |
+
* Fixed-length one-second windows with zero-padding or cropping
|
43 |
+
* Data augmentation: time shift (±100 ms), additive background noise (SNR 10–20 dB)
|
44 |
+
|
45 |
+
## Evaluation Results
|
46 |
+
|
47 |
+
* **Test split:** Held-out 20% of the combined dataset (stratified across classes)
|
48 |
+
|
49 |
+
| Metric | Value |
|
50 |
+
| --------- | ------ |
|
51 |
+
| Loss | 0.0685 |
|
52 |
+
| Precision | 0.9862 |
|
53 |
+
| Recall | 0.9862 |
|
54 |
+
| F1-score | 0.9861 |
|
55 |
+
|
56 |
+
## Intended Uses and Limitations
|
57 |
+
|
58 |
+
* **Suitable for:**
|
59 |
+
|
60 |
+
* Real-time keyword spotting on-device
|
61 |
+
* Low-latency voice command detection in noisy environments
|
62 |
+
* **Limitations:**
|
63 |
+
|
64 |
+
* May misclassify under unseen noise conditions or heavy accents
|
65 |
+
* `_unknown_` class may not cover all out-of-vocabulary words; false positives possible
|
66 |
+
* Performance may degrade on dialects or languages underrepresented in training
|
67 |
+
|
68 |
+
## Recommendations for Use
|
69 |
+
|
70 |
+
* **On-device deployment:** Convert to `safetensors` format to reduce size and improve loading speed
|
71 |
+
* **Runtime:** \~20M parameters; inference latency \~30 ms on mobile SoC
|
72 |
+
* **Performance tips:**
|
73 |
+
|
74 |
+
* Fine-tune threshold per class for high-recall vs. high-precision scenarios
|
75 |
+
* Use simple VAD front-end to suppress silent frames
|
76 |
+
|
77 |
+
## Ethical Considerations and Bias
|
78 |
+
|
79 |
+
* Data covers several languages but is unbalanced: some languages underrepresented
|
80 |
+
* Potential for misrecognition in low-resource languages or non-standard accents
|
81 |
+
* Not intended for security-sensitive applications (e.g., authentication)
|
82 |
+
|
83 |
+
## Citation
|
84 |
+
|
85 |
+
```bibtex
|
86 |
+
@inproceedings{gong2021ast,
|
87 |
+
title={AST: Audio Spectrogram Transformer},
|
88 |
+
author={Gong, Yufei and Tian, Wei and Shen, Ding and Ermon, Stefano and Liu, Fei and Lazebnik, Svetlana},
|
89 |
+
booktitle={ICASSP},
|
90 |
+
year={2022}
|
91 |
+
}
|
92 |
+
```
|
93 |
+
|
94 |
+
---
|
95 |
+
|
96 |
+
*This model card was automatically generated.*
|
97 |
+
|
98 |
+
```
|
99 |
+
```
|