|
--- |
|
library_name: transformers |
|
license: bsd-3-clause |
|
base_model: MIT/ast-finetuned-speech-commands-v2 |
|
tags: |
|
- generated_from_trainer |
|
datasets: |
|
- audiofolder |
|
metrics: |
|
- precision |
|
- recall |
|
- f1 |
|
model-index: |
|
- name: ast-finetuned-speech-commands-v2-finetuned-keyword-spotting-finetuned-keyword-spotting |
|
results: |
|
- task: |
|
name: Audio Classification |
|
type: audio-classification |
|
dataset: |
|
name: audiofolder |
|
type: audiofolder |
|
config: default |
|
split: validation |
|
args: default |
|
metrics: |
|
- name: Precision |
|
type: precision |
|
value: 0.9861935383961439 |
|
- name: Recall |
|
type: recall |
|
value: 0.9861649413727126 |
|
- name: F1 |
|
type: f1 |
|
value: 0.9861100898918743 |
|
--- |
|
|
|
# Audio Spectrogram Transformer (AST) Fine-Tuned on MLCommons Multilingual Spoken Words + Google Speech Commands |
|
|
|
## Model Details |
|
- **Model name:** `ast-mlcommons-speech-commands` |
|
- **Architecture:** Audio Spectrogram Transformer (AST) |
|
- **Base pre-trained checkpoint:** MIT AST fine-tuned on Google Speech Commands v0.02 |
|
- **Fine-tuning dataset:** Custom dataset drawn from MLCommons Multilingual Spoken Words corpus, augmented with `_silence_` and `_unknown_` categories sampled from Google Speech Commands v0.02 |
|
- **License:** bsd-3-clause |
|
|
|
|
|
|
|
## Model Inputs and Outputs |
|
- **Input:** 16 kHz mono audio, 1-second clips (or padded/truncated to 1 sec), converted to log-mel spectrograms with 128 mel bins and 10 ms hop length |
|
- **Output:** Softmax over 80 classes (indices 0–79). Classes mapping: |
|
```json |
|
{ |
|
"0": "_silence_", |
|
"1": "_unknown_", |
|
"2": "air", |
|
// ... 3–9 omitted for brevity ... |
|
"9": "cake", |
|
"10": "car", |
|
// ... up to 79: "zoo" |
|
} |
|
|
|
## Training Data |
|
|
|
* Total samples: \~145,005 utterances |
|
* **Sources:** |
|
|
|
* MLCommons Multilingual Spoken Words corpus (covering 40+ languages) |
|
* Google Speech Commands v0.02 for silence and unknown categories |
|
* **Preprocessing:** |
|
|
|
* Resampling to 16 kHz |
|
* Fixed-length one-second windows with zero-padding or cropping |
|
|
|
## Evaluation Results |
|
|
|
| Metric | Value | |
|
| --------- | ------ | |
|
| Loss | 0.0685 | |
|
| Precision | 0.9862 | |
|
| Recall | 0.9862 | |
|
| F1-score | 0.9861 | |
|
|
|
## Intended Uses and Limitations |
|
|
|
* **Suitable for:** |
|
|
|
* Real-time keyword spotting on-device |
|
* Low-latency voice command detection in noisy environments |
|
* **Limitations:** |
|
|
|
* May misclassify under unseen noise conditions or heavy accents |
|
* `_unknown_` class may not cover all out-of-vocabulary words; false positives possible |
|
* Performance may degrade on dialects or languages underrepresented in training |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@inproceedings{gong2021ast, |
|
title={AST: Audio Spectrogram Transformer}, |
|
author={Gong, Yufei and Tian, Wei and Shen, Ding and Ermon, Stefano and Liu, Fei and Lazebnik, Svetlana}, |
|
booktitle={ICASSP}, |
|
year={2022} |
|
} |
|
|