mahmoudmamdouh13 commited on
Commit
96a3ecb
·
verified ·
1 Parent(s): 1222a80

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -90
README.md CHANGED
@@ -1,90 +1,28 @@
1
- ---
2
- library_name: transformers
3
- license: bsd-3-clause
4
- base_model: MIT/ast-finetuned-speech-commands-v2
5
- tags:
6
- - generated_from_trainer
7
- datasets:
8
- - audiofolder
9
- metrics:
10
- - precision
11
- - recall
12
- - f1
13
- model-index:
14
- - name: ast-finetuned-keyword-spotting
15
- results:
16
- - task:
17
- name: Audio Classification
18
- type: audio-classification
19
- dataset:
20
- name: audiofolder
21
- type: audiofolder
22
- config: default
23
- split: validation
24
- args: default
25
- metrics:
26
- - name: Precision
27
- type: precision
28
- value: 0.9861935383961439
29
- - name: Recall
30
- type: recall
31
- value: 0.9861649413727126
32
- - name: F1
33
- type: f1
34
- value: 0.9861100898918743
35
- ---
36
-
37
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
38
- should probably proofread and complete it, then remove this comment. -->
39
-
40
- # ast-finetuned-keyword-spotting
41
-
42
- This model is a fine-tuned version of [MIT/ast-finetuned-speech-commands-v2](https://huggingface.co/MIT/ast-finetuned-speech-commands-v2) on the audiofolder dataset.
43
- It achieves the following results on the evaluation set:
44
- - Loss: 0.0685
45
- - Precision: 0.9862
46
- - Recall: 0.9862
47
- - F1: 0.9861
48
-
49
- ## Model description
50
-
51
- More information needed
52
-
53
- ## Intended uses & limitations
54
-
55
- More information needed
56
-
57
- ## Training and evaluation data
58
-
59
- More information needed
60
-
61
- ## Training procedure
62
-
63
- ### Training hyperparameters
64
-
65
- The following hyperparameters were used during training:
66
- - learning_rate: 5e-05
67
- - train_batch_size: 64
68
- - eval_batch_size: 64
69
- - seed: 42
70
- - optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
71
- - lr_scheduler_type: linear
72
- - lr_scheduler_warmup_ratio: 0.1
73
- - num_epochs: 3
74
- - mixed_precision_training: Native AMP
75
-
76
- ### Training results
77
-
78
- | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 |
79
- |:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|
80
- | 0.0682 | 1.0 | 1630 | 0.0976 | 0.9752 | 0.9751 | 0.9749 |
81
- | 0.0179 | 2.0 | 3260 | 0.0743 | 0.9847 | 0.9846 | 0.9846 |
82
- | 0.0008 | 3.0 | 4890 | 0.0685 | 0.9862 | 0.9862 | 0.9861 |
83
-
84
-
85
- ### Framework versions
86
-
87
- - Transformers 4.51.3
88
- - Pytorch 2.7.0+cu128
89
- - Datasets 3.6.0
90
- - Tokenizers 0.21.1
 
1
+ # Audio Spectrogram Transformer (AST) Fine-Tuned on MLCommons Multilingual Spoken Words + Google Speech Commands
2
+
3
+ ## Model Details
4
+ - **Model name:** `my-ast-mlcommons-speech-commands`
5
+ - **Architecture:** Audio Spectrogram Transformer (AST)
6
+ - **Base pre-trained checkpoint:** MIT AST fine-tuned on Google Speech Commands v0.02
7
+ - **Fine-tuning dataset:** Custom dataset drawn from MLCommons Multilingual Spoken Words corpus, augmented with `_silence_` and `_unknown_` categories sampled from Google Speech Commands v0.02
8
+ - **License:** Apache 2.0
9
+ - **Framework:** PyTorch
10
+
11
+ ## Use Case
12
+ - **Primary use case:** Keyword spotting and spoken-word classification in multilingual voice interfaces
13
+ - **Territory:** Real-time small-vocabulary speech recognition for embedded and mobile devices
14
+ - **Out of scope:** Large-vocabulary continuous speech recognition, speaker identification, emotion recognition
15
+
16
+ ## Model Inputs and Outputs
17
+ - **Input:** 16 kHz mono audio, 1-second clips (or padded/truncated to 1 sec), converted to log-mel spectrograms with 128 mel bins and 10 ms hop length
18
+ - **Output:** Softmax over 80 classes (indices 0–79). Classes mapping:
19
+ ```json
20
+ {
21
+ "0": "_silence_",
22
+ "1": "_unknown_",
23
+ "2": "air",
24
+ // ... 3–9 omitted for brevity ...
25
+ "9": "cake",
26
+ "10": "car",
27
+ // ... up to 79: "zoo"
28
+ }