mahmoudmamdouh13
/

ast-mlcommons-speech-commands

@@ -42,12 +42,7 @@ model-index:
 - **Base pre-trained checkpoint:** MIT AST fine-tuned on Google Speech Commands v0.02
 - **Fine-tuning dataset:** Custom dataset drawn from MLCommons Multilingual Spoken Words corpus, augmented with `_silence_` and `_unknown_` categories sampled from Google Speech Commands v0.02
 - **License:** Apache 2.0
-- **Framework:** PyTorch
-## Use Case
-- **Primary use case:** Keyword spotting and spoken-word classification in multilingual voice interfaces
-- **Territory:** Real-time small-vocabulary speech recognition for embedded and mobile devices
-- **Out of scope:** Large-vocabulary continuous speech recognition, speaker identification, emotion recognition
 ## Model Inputs and Outputs
 - **Input:** 16 kHz mono audio, 1-second clips (or padded/truncated to 1 sec), converted to log-mel spectrograms with 128 mel bins and 10 ms hop length
@@ -65,7 +60,7 @@ model-index:
 ## Training Data
-* Total samples: \~XX,XXX utterances
 * **Sources:**
   * MLCommons Multilingual Spoken Words corpus (covering 40+ languages)
@@ -74,11 +69,10 @@ model-index:
   * Resampling to 16 kHz
   * Fixed-length one-second windows with zero-padding or cropping
-  * Data augmentation: time shift (±100 ms), additive background noise (SNR 10–20 dB)
 ## Evaluation Results
-* **Test split:** Held-out 20% of the combined dataset (stratified across classes)
 | Metric    | Value  |
 | --------- | ------ |
@@ -99,21 +93,6 @@ model-index:
   * `_unknown_` class may not cover all out-of-vocabulary words; false positives possible
   * Performance may degrade on dialects or languages underrepresented in training
-## Recommendations for Use
-* **On-device deployment:** Convert to `safetensors` format to reduce size and improve loading speed
-* **Runtime:** \~20M parameters; inference latency \~30 ms on mobile SoC
-* **Performance tips:**
-  * Fine-tune threshold per class for high-recall vs. high-precision scenarios
-  * Use simple VAD front-end to suppress silent frames
-## Ethical Considerations and Bias
-* Data covers several languages but is unbalanced: some languages underrepresented
-* Potential for misrecognition in low-resource languages or non-standard accents
-* Not intended for security-sensitive applications (e.g., authentication)
 ## Citation
 ```bibtex

 - **Base pre-trained checkpoint:** MIT AST fine-tuned on Google Speech Commands v0.02
 - **Fine-tuning dataset:** Custom dataset drawn from MLCommons Multilingual Spoken Words corpus, augmented with `_silence_` and `_unknown_` categories sampled from Google Speech Commands v0.02
 - **License:** Apache 2.0
 ## Model Inputs and Outputs
 - **Input:** 16 kHz mono audio, 1-second clips (or padded/truncated to 1 sec), converted to log-mel spectrograms with 128 mel bins and 10 ms hop length
 ## Training Data
+* Total samples: \~145,005 utterances
 * **Sources:**
   * MLCommons Multilingual Spoken Words corpus (covering 40+ languages)
   * Resampling to 16 kHz
   * Fixed-length one-second windows with zero-padding or cropping
 ## Evaluation Results
+* **Test split:** Held-out 10% of the combined dataset (stratified across classes)
 | Metric    | Value  |
 | --------- | ------ |
   * `_unknown_` class may not cover all out-of-vocabulary words; false positives possible
   * Performance may degrade on dialects or languages underrepresented in training
 ## Citation
 ```bibtex