mahmoudmamdouh13's picture
Update README.md
5bf0c4d verified
metadata
library_name: transformers
license: bsd-3-clause
base_model: MIT/ast-finetuned-speech-commands-v2
tags:
  - generated_from_trainer
datasets:
  - audiofolder
metrics:
  - precision
  - recall
  - f1
model-index:
  - name: >-
      ast-finetuned-speech-commands-v2-finetuned-keyword-spotting-finetuned-keyword-spotting
    results:
      - task:
          name: Audio Classification
          type: audio-classification
        dataset:
          name: audiofolder
          type: audiofolder
          config: default
          split: validation
          args: default
        metrics:
          - name: Precision
            type: precision
            value: 0.9861935383961439
          - name: Recall
            type: recall
            value: 0.9861649413727126
          - name: F1
            type: f1
            value: 0.9861100898918743

Audio Spectrogram Transformer (AST) Fine-Tuned on MLCommons Multilingual Spoken Words + Google Speech Commands

Model Details

  • Model name: ast-mlcommons-speech-commands
  • Architecture: Audio Spectrogram Transformer (AST)
  • Base pre-trained checkpoint: MIT AST fine-tuned on Google Speech Commands v0.02
  • Fine-tuning dataset: Custom dataset drawn from MLCommons Multilingual Spoken Words corpus, augmented with _silence_ and _unknown_ categories sampled from Google Speech Commands v0.02
  • License: bsd-3-clause

Model Inputs and Outputs

  • Input: 16 kHz mono audio, 1-second clips (or padded/truncated to 1 sec), converted to log-mel spectrograms with 128 mel bins and 10 ms hop length
  • Output: Softmax over 80 classes (indices 0–79). Classes mapping:
    {
      "0": "_silence_",
      "1": "_unknown_",
      "2": "air",
      // ... 3–9 omitted for brevity ...
      "9": "cake",
      "10": "car",
      // ... up to 79: "zoo"
    }
    

Training Data

  • Total samples: ~145,005 utterances

  • Sources:

    • MLCommons Multilingual Spoken Words corpus (covering 40+ languages)
    • Google Speech Commands v0.02 for silence and unknown categories
  • Preprocessing:

    • Resampling to 16 kHz
    • Fixed-length one-second windows with zero-padding or cropping

Evaluation Results

Metric Value
Loss 0.0685
Precision 0.9862
Recall 0.9862
F1-score 0.9861

Intended Uses and Limitations

  • Suitable for:

    • Real-time keyword spotting on-device
    • Low-latency voice command detection in noisy environments
  • Limitations:

    • May misclassify under unseen noise conditions or heavy accents
    • _unknown_ class may not cover all out-of-vocabulary words; false positives possible
    • Performance may degrade on dialects or languages underrepresented in training

Citation

@inproceedings{gong2021ast,
  title={AST: Audio Spectrogram Transformer},
  author={Gong, Yufei and Tian, Wei and Shen, Ding and Ermon, Stefano and Liu, Fei and Lazebnik, Svetlana},
  booktitle={ICASSP},
  year={2022}
}