metadata
library_name: transformers
license: bsd-3-clause
base_model: MIT/ast-finetuned-speech-commands-v2
tags:
- generated_from_trainer
datasets:
- audiofolder
metrics:
- precision
- recall
- f1
model-index:
- name: >-
ast-finetuned-speech-commands-v2-finetuned-keyword-spotting-finetuned-keyword-spotting
results:
- task:
name: Audio Classification
type: audio-classification
dataset:
name: audiofolder
type: audiofolder
config: default
split: validation
args: default
metrics:
- name: Precision
type: precision
value: 0.9861935383961439
- name: Recall
type: recall
value: 0.9861649413727126
- name: F1
type: f1
value: 0.9861100898918743
Audio Spectrogram Transformer (AST) Fine-Tuned on MLCommons Multilingual Spoken Words + Google Speech Commands
Model Details
- Model name:
ast-mlcommons-speech-commands
- Architecture: Audio Spectrogram Transformer (AST)
- Base pre-trained checkpoint: MIT AST fine-tuned on Google Speech Commands v0.02
- Fine-tuning dataset: Custom dataset drawn from MLCommons Multilingual Spoken Words corpus, augmented with
_silence_
and_unknown_
categories sampled from Google Speech Commands v0.02 - License: bsd-3-clause
Model Inputs and Outputs
- Input: 16 kHz mono audio, 1-second clips (or padded/truncated to 1 sec), converted to log-mel spectrograms with 128 mel bins and 10 ms hop length
- Output: Softmax over 80 classes (indices 0–79). Classes mapping:
{ "0": "_silence_", "1": "_unknown_", "2": "air", // ... 3–9 omitted for brevity ... "9": "cake", "10": "car", // ... up to 79: "zoo" }
Training Data
Total samples: ~145,005 utterances
Sources:
- MLCommons Multilingual Spoken Words corpus (covering 40+ languages)
- Google Speech Commands v0.02 for silence and unknown categories
Preprocessing:
- Resampling to 16 kHz
- Fixed-length one-second windows with zero-padding or cropping
Evaluation Results
Metric | Value |
---|---|
Loss | 0.0685 |
Precision | 0.9862 |
Recall | 0.9862 |
F1-score | 0.9861 |
Intended Uses and Limitations
Suitable for:
- Real-time keyword spotting on-device
- Low-latency voice command detection in noisy environments
Limitations:
- May misclassify under unseen noise conditions or heavy accents
_unknown_
class may not cover all out-of-vocabulary words; false positives possible- Performance may degrade on dialects or languages underrepresented in training
Citation
@inproceedings{gong2021ast,
title={AST: Audio Spectrogram Transformer},
author={Gong, Yufei and Tian, Wei and Shen, Ding and Ermon, Stefano and Liu, Fei and Lazebnik, Svetlana},
booktitle={ICASSP},
year={2022}
}