metadata

library_name: transformers
license: bsd-3-clause
base_model: MIT/ast-finetuned-speech-commands-v2
tags:
  - generated_from_trainer
datasets:
  - audiofolder
metrics:
  - precision
  - recall
  - f1
model-index:
  - name: >-
      ast-finetuned-speech-commands-v2-finetuned-keyword-spotting-finetuned-keyword-spotting
    results:
      - task:
          name: Audio Classification
          type: audio-classification
        dataset:
          name: audiofolder
          type: audiofolder
          config: default
          split: validation
          args: default
        metrics:
          - name: Precision
            type: precision
            value: 0.9861935383961439
          - name: Recall
            type: recall
            value: 0.9861649413727126
          - name: F1
            type: f1
            value: 0.9861100898918743

Audio Spectrogram Transformer (AST) Fine-Tuned on MLCommons Multilingual Spoken Words + Google Speech Commands

Model Details

Model name: ast-mlcommons-speech-commands
Architecture: Audio Spectrogram Transformer (AST)
Base pre-trained checkpoint: MIT AST fine-tuned on Google Speech Commands v0.02
Fine-tuning dataset: Custom dataset drawn from MLCommons Multilingual Spoken Words corpus, augmented with _silence_ and _unknown_ categories sampled from Google Speech Commands v0.02
License: bsd-3-clause

Model Inputs and Outputs

Input: 16 kHz mono audio, 1-second clips (or padded/truncated to 1 sec), converted to log-mel spectrograms with 128 mel bins and 10 ms hop length

Output: Softmax over 80 classes (indices 0–79). Classes mapping:

{
  "0": "_silence_",
  "1": "_unknown_",
  "2": "air",
  // ... 3–9 omitted for brevity ...
  "9": "cake",
  "10": "car",
  // ... up to 79: "zoo"
}

Training Data

Total samples: ~145,005 utterances
Sources:
- MLCommons Multilingual Spoken Words corpus (covering 40+ languages)
- Google Speech Commands v0.02 for silence and unknown categories
Preprocessing:
- Resampling to 16 kHz
- Fixed-length one-second windows with zero-padding or cropping

Evaluation Results

Metric	Value
Loss	0.0685
Precision	0.9862
Recall	0.9862
F1-score	0.9861

Intended Uses and Limitations

Suitable for:
- Real-time keyword spotting on-device
- Low-latency voice command detection in noisy environments
Limitations:
- May misclassify under unseen noise conditions or heavy accents
- _unknown_ class may not cover all out-of-vocabulary words; false positives possible
- Performance may degrade on dialects or languages underrepresented in training

Citation

@inproceedings{gong2021ast,
  title={AST: Audio Spectrogram Transformer},
  author={Gong, Yufei and Tian, Wei and Shen, Ding and Ermon, Stefano and Liu, Fei and Lazebnik, Svetlana},
  booktitle={ICASSP},
  year={2022}
}