Update README.md
Browse files
README.md
CHANGED
@@ -42,12 +42,7 @@ model-index:
|
|
42 |
- **Base pre-trained checkpoint:** MIT AST fine-tuned on Google Speech Commands v0.02
|
43 |
- **Fine-tuning dataset:** Custom dataset drawn from MLCommons Multilingual Spoken Words corpus, augmented with `_silence_` and `_unknown_` categories sampled from Google Speech Commands v0.02
|
44 |
- **License:** Apache 2.0
|
45 |
-
- **Framework:** PyTorch
|
46 |
|
47 |
-
## Use Case
|
48 |
-
- **Primary use case:** Keyword spotting and spoken-word classification in multilingual voice interfaces
|
49 |
-
- **Territory:** Real-time small-vocabulary speech recognition for embedded and mobile devices
|
50 |
-
- **Out of scope:** Large-vocabulary continuous speech recognition, speaker identification, emotion recognition
|
51 |
|
52 |
## Model Inputs and Outputs
|
53 |
- **Input:** 16 kHz mono audio, 1-second clips (or padded/truncated to 1 sec), converted to log-mel spectrograms with 128 mel bins and 10 ms hop length
|
@@ -65,7 +60,7 @@ model-index:
|
|
65 |
|
66 |
## Training Data
|
67 |
|
68 |
-
* Total samples: \~
|
69 |
* **Sources:**
|
70 |
|
71 |
* MLCommons Multilingual Spoken Words corpus (covering 40+ languages)
|
@@ -74,11 +69,10 @@ model-index:
|
|
74 |
|
75 |
* Resampling to 16 kHz
|
76 |
* Fixed-length one-second windows with zero-padding or cropping
|
77 |
-
* Data augmentation: time shift (±100 ms), additive background noise (SNR 10–20 dB)
|
78 |
|
79 |
## Evaluation Results
|
80 |
|
81 |
-
* **Test split:** Held-out
|
82 |
|
83 |
| Metric | Value |
|
84 |
| --------- | ------ |
|
@@ -99,21 +93,6 @@ model-index:
|
|
99 |
* `_unknown_` class may not cover all out-of-vocabulary words; false positives possible
|
100 |
* Performance may degrade on dialects or languages underrepresented in training
|
101 |
|
102 |
-
## Recommendations for Use
|
103 |
-
|
104 |
-
* **On-device deployment:** Convert to `safetensors` format to reduce size and improve loading speed
|
105 |
-
* **Runtime:** \~20M parameters; inference latency \~30 ms on mobile SoC
|
106 |
-
* **Performance tips:**
|
107 |
-
|
108 |
-
* Fine-tune threshold per class for high-recall vs. high-precision scenarios
|
109 |
-
* Use simple VAD front-end to suppress silent frames
|
110 |
-
|
111 |
-
## Ethical Considerations and Bias
|
112 |
-
|
113 |
-
* Data covers several languages but is unbalanced: some languages underrepresented
|
114 |
-
* Potential for misrecognition in low-resource languages or non-standard accents
|
115 |
-
* Not intended for security-sensitive applications (e.g., authentication)
|
116 |
-
|
117 |
## Citation
|
118 |
|
119 |
```bibtex
|
|
|
42 |
- **Base pre-trained checkpoint:** MIT AST fine-tuned on Google Speech Commands v0.02
|
43 |
- **Fine-tuning dataset:** Custom dataset drawn from MLCommons Multilingual Spoken Words corpus, augmented with `_silence_` and `_unknown_` categories sampled from Google Speech Commands v0.02
|
44 |
- **License:** Apache 2.0
|
|
|
45 |
|
|
|
|
|
|
|
|
|
46 |
|
47 |
## Model Inputs and Outputs
|
48 |
- **Input:** 16 kHz mono audio, 1-second clips (or padded/truncated to 1 sec), converted to log-mel spectrograms with 128 mel bins and 10 ms hop length
|
|
|
60 |
|
61 |
## Training Data
|
62 |
|
63 |
+
* Total samples: \~145,005 utterances
|
64 |
* **Sources:**
|
65 |
|
66 |
* MLCommons Multilingual Spoken Words corpus (covering 40+ languages)
|
|
|
69 |
|
70 |
* Resampling to 16 kHz
|
71 |
* Fixed-length one-second windows with zero-padding or cropping
|
|
|
72 |
|
73 |
## Evaluation Results
|
74 |
|
75 |
+
* **Test split:** Held-out 10% of the combined dataset (stratified across classes)
|
76 |
|
77 |
| Metric | Value |
|
78 |
| --------- | ------ |
|
|
|
93 |
* `_unknown_` class may not cover all out-of-vocabulary words; false positives possible
|
94 |
* Performance may degrade on dialects or languages underrepresented in training
|
95 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
96 |
## Citation
|
97 |
|
98 |
```bibtex
|