File size: 2,870 Bytes
7698bcc 96a3ecb 249f6c9 96a3ecb 5bf0c4d 96a3ecb 7ce9db8 db2ddb2 7ce9db8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
---
library_name: transformers
license: bsd-3-clause
base_model: MIT/ast-finetuned-speech-commands-v2
tags:
- generated_from_trainer
datasets:
- audiofolder
metrics:
- precision
- recall
- f1
model-index:
- name: ast-finetuned-speech-commands-v2-finetuned-keyword-spotting-finetuned-keyword-spotting
results:
- task:
name: Audio Classification
type: audio-classification
dataset:
name: audiofolder
type: audiofolder
config: default
split: validation
args: default
metrics:
- name: Precision
type: precision
value: 0.9861935383961439
- name: Recall
type: recall
value: 0.9861649413727126
- name: F1
type: f1
value: 0.9861100898918743
---
# Audio Spectrogram Transformer (AST) Fine-Tuned on MLCommons Multilingual Spoken Words + Google Speech Commands
## Model Details
- **Model name:** `ast-mlcommons-speech-commands`
- **Architecture:** Audio Spectrogram Transformer (AST)
- **Base pre-trained checkpoint:** MIT AST fine-tuned on Google Speech Commands v0.02
- **Fine-tuning dataset:** Custom dataset drawn from MLCommons Multilingual Spoken Words corpus, augmented with `_silence_` and `_unknown_` categories sampled from Google Speech Commands v0.02
- **License:** bsd-3-clause
## Model Inputs and Outputs
- **Input:** 16 kHz mono audio, 1-second clips (or padded/truncated to 1 sec), converted to log-mel spectrograms with 128 mel bins and 10 ms hop length
- **Output:** Softmax over 80 classes (indices 0–79). Classes mapping:
```json
{
"0": "_silence_",
"1": "_unknown_",
"2": "air",
// ... 3–9 omitted for brevity ...
"9": "cake",
"10": "car",
// ... up to 79: "zoo"
}
## Training Data
* Total samples: \~145,005 utterances
* **Sources:**
* MLCommons Multilingual Spoken Words corpus (covering 40+ languages)
* Google Speech Commands v0.02 for silence and unknown categories
* **Preprocessing:**
* Resampling to 16 kHz
* Fixed-length one-second windows with zero-padding or cropping
## Evaluation Results
| Metric | Value |
| --------- | ------ |
| Loss | 0.0685 |
| Precision | 0.9862 |
| Recall | 0.9862 |
| F1-score | 0.9861 |
## Intended Uses and Limitations
* **Suitable for:**
* Real-time keyword spotting on-device
* Low-latency voice command detection in noisy environments
* **Limitations:**
* May misclassify under unseen noise conditions or heavy accents
* `_unknown_` class may not cover all out-of-vocabulary words; false positives possible
* Performance may degrade on dialects or languages underrepresented in training
## Citation
```bibtex
@inproceedings{gong2021ast,
title={AST: Audio Spectrogram Transformer},
author={Gong, Yufei and Tian, Wei and Shen, Ding and Ermon, Stefano and Liu, Fei and Lazebnik, Svetlana},
booktitle={ICASSP},
year={2022}
}
|