mahmoudmamdouh13
/

ast-mlcommons-speech-commands

Audio Classification

audio-spectrogram-transformer

Generated from Trainer

Model card Files Files and versions Metrics Training metrics Community

ast-mlcommons-speech-commands / README.md

mahmoudmamdouh13's picture

mahmoudmamdouh13

Update README.md

5bf0c4d verified about 1 month ago

|

history blame contribute delete

2.87 kB

	---
	library_name: transformers
	license: bsd-3-clause
	base_model: MIT/ast-finetuned-speech-commands-v2
	tags:
	- generated_from_trainer
	datasets:
	- audiofolder
	metrics:
	- precision
	- recall
	- f1
	model-index:
	- name: ast-finetuned-speech-commands-v2-finetuned-keyword-spotting-finetuned-keyword-spotting
	results:
	- task:
	name: Audio Classification
	type: audio-classification
	dataset:
	name: audiofolder
	type: audiofolder
	config: default
	split: validation
	args: default
	metrics:
	- name: Precision
	type: precision
	value: 0.9861935383961439
	- name: Recall
	type: recall
	value: 0.9861649413727126
	- name: F1
	type: f1
	value: 0.9861100898918743
	---

	# Audio Spectrogram Transformer (AST) Fine-Tuned on MLCommons Multilingual Spoken Words + Google Speech Commands

	## Model Details
	- Model name: `ast-mlcommons-speech-commands`
	- Architecture: Audio Spectrogram Transformer (AST)
	- Base pre-trained checkpoint: MIT AST fine-tuned on Google Speech Commands v0.02
	- Fine-tuning dataset: Custom dataset drawn from MLCommons Multilingual Spoken Words corpus, augmented with `_silence_` and `_unknown_` categories sampled from Google Speech Commands v0.02
	- License: bsd-3-clause



	## Model Inputs and Outputs
	- Input: 16 kHz mono audio, 1-second clips (or padded/truncated to 1 sec), converted to log-mel spectrograms with 128 mel bins and 10 ms hop length
	- Output: Softmax over 80 classes (indices 0–79). Classes mapping:
	```json
	{
	"0": "_silence_",
	"1": "_unknown_",
	"2": "air",
	// ... 3–9 omitted for brevity ...
	"9": "cake",
	"10": "car",
	// ... up to 79: "zoo"
	}

	## Training Data

	* Total samples: \~145,005 utterances
	* Sources:

	* MLCommons Multilingual Spoken Words corpus (covering 40+ languages)
	* Google Speech Commands v0.02 for silence and unknown categories
	* Preprocessing:

	* Resampling to 16 kHz
	* Fixed-length one-second windows with zero-padding or cropping

	## Evaluation Results

	\| Metric \| Value \|
	\| --------- \| ------ \|
	\| Loss \| 0.0685 \|
	\| Precision \| 0.9862 \|
	\| Recall \| 0.9862 \|
	\| F1-score \| 0.9861 \|

	## Intended Uses and Limitations

	* Suitable for:

	* Real-time keyword spotting on-device
	* Low-latency voice command detection in noisy environments
	* Limitations:

	* May misclassify under unseen noise conditions or heavy accents
	* `_unknown_` class may not cover all out-of-vocabulary words; false positives possible
	* Performance may degrade on dialects or languages underrepresented in training

	## Citation

	```bibtex
	@inproceedings{gong2021ast,
	title={AST: Audio Spectrogram Transformer},
	author={Gong, Yufei and Tian, Wei and Shen, Ding and Ermon, Stefano and Liu, Fei and Lazebnik, Svetlana},
	booktitle={ICASSP},
	year={2022}
	}