khanhld commited on
Commit
8aaca4c
·
verified ·
1 Parent(s): 851d946

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +183 -178
README.md CHANGED
@@ -1,178 +1,183 @@
1
- ---
2
- language: vie
3
- datasets:
4
- - legacy-datasets/common_voice
5
- - vlsp2020_vinai_100h
6
- - AILAB-VNUHCM/vivos
7
- - doof-ferb/vlsp2020_vinai_100h
8
- - doof-ferb/fpt_fosd
9
- - doof-ferb/infore1_25hours
10
- - linhtran92/viet_bud500
11
- - doof-ferb/LSVSC
12
- - doof-ferb/vais1000
13
- - doof-ferb/VietMed_labeled
14
- - NhutP/VSV-1100
15
- - doof-ferb/Speech-MASSIVE_vie
16
- - doof-ferb/BibleMMS_vie
17
- - capleaf/viVoice
18
- metrics:
19
- - wer
20
- pipeline_tag: automatic-speech-recognition
21
- tags:
22
- - transcription
23
- - audio
24
- - speech
25
- - chunkformer
26
- - asr
27
- - automatic-speech-recognition
28
- - long-form transcription
29
- license: cc-by-nc-4.0
30
- model-index:
31
- - name: ChunkFormer Large Vietnamese
32
- results:
33
- - task:
34
- name: Speech Recognition
35
- type: automatic-speech-recognition
36
- dataset:
37
- name: common-voice-vietnamese
38
- type: common_voice
39
- args: vi
40
- metrics:
41
- - name: Test WER
42
- type: wer
43
- value: 6.66
44
- - task:
45
- name: Speech Recognition
46
- type: automatic-speech-recognition
47
- dataset:
48
- name: VIVOS
49
- type: vivos
50
- args: vi
51
- metrics:
52
- - name: Test WER
53
- type: wer
54
- value: 4.18
55
- - task:
56
- name: Speech Recognition
57
- type: automatic-speech-recognition
58
- dataset:
59
- name: VLSP - Task 1
60
- type: vlsp
61
- args: vi
62
- metrics:
63
- - name: Test WER
64
- type: wer
65
- value: 14.09
66
- ---
67
-
68
- # **ChunkFormer-Large-Vie: Large-Scale Pretrained ChunkFormer for Vietnamese Automatic Speech Recognition**
69
- [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chunkformer-masked-chunking-conformer-for/speech-recognition-on-common-voice-vi)](https://paperswithcode.com/sota/speech-recognition-on-common-voice-vi?p=chunkformer-masked-chunking-conformer-for)
70
- [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chunkformer-masked-chunking-conformer-for/speech-recognition-on-vivos)](https://paperswithcode.com/sota/speech-recognition-on-vivos?p=chunkformer-masked-chunking-conformer-for)
71
-
72
- [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
73
- [![GitHub](https://img.shields.io/badge/GitHub-ChunkFormer-blue)](https://github.com/khanld/chunkformer)
74
- [![Paper](https://img.shields.io/badge/Paper-ICASSP%202025-green)](paper.pdf)
75
-
76
- ---
77
- ## Table of contents
78
- 1. [Model Description](#description)
79
- 2. [Documentation and Implementation](#implementation)
80
- 3. [Benchmark Results](#benchmark)
81
- 4. [Usage](#usage)
82
- 6. [Citation](#citation)
83
- 7. [Contact](#contact)
84
-
85
- ---
86
- <a name = "description" ></a>
87
- ## Model Description
88
- **ChunkFormer-Large-Vie** is a large-scale Vietnamese Automatic Speech Recognition (ASR) model based on the **ChunkFormer** architecture, introduced at **ICASSP 2025**. The model has been fine-tuned on approximately **3000 hours** of public Vietnamese speech data sourced from diverse datasets. A list of datasets can be found [**HERE**](dataset.tsv).
89
-
90
- **!!! Please note that only the \[train-subset\] was used for tuning the model.**
91
-
92
- ---
93
- <a name = "implementation" ></a>
94
- ## Documentation and Implementation
95
- The [Documentation]() and [Implementation](https://github.com/khanld/chunkformer) of ChunkFormer are publicly available.
96
-
97
- ---
98
- <a name = "benchmark" ></a>
99
- ## Benchmark Results
100
- We evaluate the models using **Word Error Rate (WER)**. To ensure consistency and fairness in comparison, we manually apply **Text Normalization**, including the handling of numbers, uppercase letters, and punctuation.
101
-
102
- 1. **Public Models**:
103
- | STT | Model | #Params | Vivos | Common Voice | VLSP - Task 1 | Avg. |
104
- |-----|------------------------------------------------------------------------|---------|-------|--------------|---------------|------|
105
- | 1 | **ChunkFormer** | 110M | 4.18 | 6.66 | 14.09 | **8.31** |
106
- | 2 | [vinai/PhoWhisper-large](https://huggingface.co/vinai/PhoWhisper-large) | 1.55B | 4.67 | 8.14 | 13.75 | 8.85 |
107
- | 3 | [nguyenvulebinh/wav2vec2-base-vietnamese-250h](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h) | 95M | 10.77 | 18.34 | 13.33 | 14.15 |
108
- | 4 | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 1.55B | 8.81 | 15.45 | 20.41 | 14.89 |
109
- | 5 | [khanhld/wav2vec2-base-vietnamese-160h](https://huggingface.co/khanhld/wav2vec2-base-vietnamese-160h) | 95M | 15.05 | 10.78 | 31.62 | 19.16 |
110
- | 6 | [homebrewltd/Ichigo-whisper-v0.1](https://huggingface.co/homebrewltd/Ichigo-whisper-v0.1) | 22M | 13.46 | 23.52 | 21.64 | 19.54 |
111
-
112
- 2. **Private Models (API)**:
113
- | STT | Model | VLSP - Task 1 |
114
- |-----|--------|---------------|
115
- | 1 | **ChunkFormer** | **14.1** |
116
- | 2 | Viettel | 14.5 |
117
- | 3 | Google | 19.5 |
118
- | 4 | FPT | 28.8 |
119
-
120
- ---
121
- <a name = "usage" ></a>
122
- ## Quick Usage
123
- To use the ChunkFormer model for Vietnamese Automatic Speech Recognition, follow these steps:
124
-
125
- 1. **Download the ChunkFormer Repository**
126
- ```bash
127
- git clone https://github.com/khanld/chunkformer.git
128
- cd chunkformer
129
- pip install -r requirements.txt
130
- ```
131
- 2. **Download the Model Checkpoint from Hugging Face**
132
- ```bash
133
- pip install huggingface_hub
134
- huggingface-cli download khanhld/chunkformer-large-vie --local-dir "./chunkformer-large-vie"
135
- ```
136
- or
137
- ```bash
138
- git lfs install
139
- git clone https://huggingface.co/khanhld/chunkformer-large-vie
140
- ```
141
- This will download the model checkpoint to the checkpoints folder inside your chunkformer directory.
142
-
143
- 3. **Run the model**
144
- ```bash
145
- python decode.py \
146
- --model_checkpoint path/to/local/chunkformer-large-vie \
147
- --long_form_audio path/to/audio.wav \
148
- --max_duration 14400 \ #in second, default is 1800
149
- --chunk_size 64 \
150
- --left_context_size 128 \
151
- --right_context_size 128
152
- ```
153
- **Advanced Usage** can be found [HERE](https://github.com/khanld/chunkformer/tree/main?tab=readme-ov-file#usage)
154
-
155
- ---
156
- <a name = "citation" ></a>
157
- ## Citation
158
- If you use this work in your research, please cite:
159
-
160
- ```bibtex
161
- @inproceedings{chunkformer,
162
- title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
163
- author={Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau},
164
- booktitle={ICASSP},
165
- year={2025}
166
- }
167
- ```
168
-
169
- ---
170
- <a name = "contact"></a>
171
- ## Contact
172
173
- - [![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/khanld)
174
- - [![LinkedIn](https://img.shields.io/badge/linkedin-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/khanhld257/)
175
-
176
-
177
-
178
-
 
 
 
 
 
 
1
+ ---
2
+ language: vie
3
+ datasets:
4
+ - legacy-datasets/common_voice
5
+ - vlsp2020_vinai_100h
6
+ - AILAB-VNUHCM/vivos
7
+ - doof-ferb/vlsp2020_vinai_100h
8
+ - doof-ferb/fpt_fosd
9
+ - doof-ferb/infore1_25hours
10
+ - linhtran92/viet_bud500
11
+ - doof-ferb/LSVSC
12
+ - doof-ferb/vais1000
13
+ - doof-ferb/VietMed_labeled
14
+ - NhutP/VSV-1100
15
+ - doof-ferb/Speech-MASSIVE_vie
16
+ - doof-ferb/BibleMMS_vie
17
+ - capleaf/viVoice
18
+ metrics:
19
+ - wer
20
+ pipeline_tag: automatic-speech-recognition
21
+ tags:
22
+ - transcription
23
+ - audio
24
+ - speech
25
+ - chunkformer
26
+ - asr
27
+ - automatic-speech-recognition
28
+ - long-form transcription
29
+ license: cc-by-nc-4.0
30
+ model-index:
31
+ - name: ChunkFormer Large Vietnamese
32
+ results:
33
+ - task:
34
+ name: Speech Recognition
35
+ type: automatic-speech-recognition
36
+ dataset:
37
+ name: common-voice-vietnamese
38
+ type: common_voice
39
+ args: vi
40
+ metrics:
41
+ - name: Test WER
42
+ type: wer
43
+ value: 6.66
44
+ - task:
45
+ name: Speech Recognition
46
+ type: automatic-speech-recognition
47
+ dataset:
48
+ name: VIVOS
49
+ type: vivos
50
+ args: vi
51
+ metrics:
52
+ - name: Test WER
53
+ type: wer
54
+ value: 4.18
55
+ - task:
56
+ name: Speech Recognition
57
+ type: automatic-speech-recognition
58
+ dataset:
59
+ name: VLSP - Task 1
60
+ type: vlsp
61
+ args: vi
62
+ metrics:
63
+ - name: Test WER
64
+ type: wer
65
+ value: 14.09
66
+ ---
67
+
68
+ # **ChunkFormer-Large-Vie: Large-Scale Pretrained ChunkFormer for Vietnamese Automatic Speech Recognition**
69
+ [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chunkformer-masked-chunking-conformer-for/speech-recognition-on-common-voice-vi)](https://paperswithcode.com/sota/speech-recognition-on-common-voice-vi?p=chunkformer-masked-chunking-conformer-for)
70
+ [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chunkformer-masked-chunking-conformer-for/speech-recognition-on-vivos)](https://paperswithcode.com/sota/speech-recognition-on-vivos?p=chunkformer-masked-chunking-conformer-for)
71
+
72
+ [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
73
+ [![GitHub](https://img.shields.io/badge/GitHub-ChunkFormer-blue)](https://github.com/khanld/chunkformer)
74
+ [![Paper](https://img.shields.io/badge/Paper-ICASSP%202025-green)](paper.pdf)
75
+
76
+ ---
77
+ ## Table of contents
78
+ 1. [Model Description](#description)
79
+ 2. [Documentation and Implementation](#implementation)
80
+ 3. [Benchmark Results](#benchmark)
81
+ 4. [Usage](#usage)
82
+ 6. [Citation](#citation)
83
+ 7. [Contact](#contact)
84
+
85
+ ---
86
+ <a name = "description" ></a>
87
+ ## Model Description
88
+ **ChunkFormer-Large-Vie** is a large-scale Vietnamese Automatic Speech Recognition (ASR) model based on the **ChunkFormer** architecture, introduced at **ICASSP 2025**. The model has been fine-tuned on approximately **3000 hours** of public Vietnamese speech data sourced from diverse datasets. A list of datasets can be found [**HERE**](dataset.tsv).
89
+
90
+ **!!! Please note that only the \[train-subset\] was used for tuning the model.**
91
+
92
+ ---
93
+ <a name = "implementation" ></a>
94
+ ## Documentation and Implementation
95
+ The [Documentation]() and [Implementation](https://github.com/khanld/chunkformer) of ChunkFormer are publicly available.
96
+
97
+ ---
98
+ <a name = "benchmark" ></a>
99
+ ## Benchmark Results
100
+ We evaluate the models using **Word Error Rate (WER)**. To ensure consistency and fairness in comparison, we manually apply **Text Normalization**, including the handling of numbers, uppercase letters, and punctuation.
101
+
102
+ 1. **Public Models**:
103
+ | STT | Model | #Params | Vivos | Common Voice | VLSP - Task 1 | Avg. |
104
+ |-----|------------------------------------------------------------------------|---------|-------|--------------|---------------|------|
105
+ | 1 | **ChunkFormer** | 110M | 4.18 | 6.66 | 14.09 | **8.31** |
106
+ | 2 | [vinai/PhoWhisper-large](https://huggingface.co/vinai/PhoWhisper-large) | 1.55B | 4.67 | 8.14 | 13.75 | 8.85 |
107
+ | 3 | [nguyenvulebinh/wav2vec2-base-vietnamese-250h](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h) | 95M | 10.77 | 18.34 | 13.33 | 14.15 |
108
+ | 4 | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 1.55B | 8.81 | 15.45 | 20.41 | 14.89 |
109
+ | 5 | [khanhld/wav2vec2-base-vietnamese-160h](https://huggingface.co/khanhld/wav2vec2-base-vietnamese-160h) | 95M | 15.05 | 10.78 | 31.62 | 19.16 |
110
+ | 6 | [homebrewltd/Ichigo-whisper-v0.1](https://huggingface.co/homebrewltd/Ichigo-whisper-v0.1) | 22M | 13.46 | 23.52 | 21.64 | 19.54 |
111
+
112
+ 2. **Private Models (API)**:
113
+ | STT | Model | VLSP - Task 1 |
114
+ |-----|--------|---------------|
115
+ | 1 | **ChunkFormer** | **14.1** |
116
+ | 2 | Viettel | 14.5 |
117
+ | 3 | Google | 19.5 |
118
+ | 4 | FPT | 28.8 |
119
+
120
+ ---
121
+ <a name = "usage" ></a>
122
+ ## Quick Usage
123
+ To use the ChunkFormer model for Vietnamese Automatic Speech Recognition, follow these steps:
124
+
125
+ 1. **Download the ChunkFormer Repository**
126
+ ```bash
127
+ git clone https://github.com/khanld/chunkformer.git
128
+ cd chunkformer
129
+ pip install -r requirements.txt
130
+ ```
131
+ 2. **Download the Model Checkpoint from Hugging Face**
132
+ ```bash
133
+ pip install huggingface_hub
134
+ huggingface-cli download khanhld/chunkformer-large-vie --local-dir "./chunkformer-large-vie"
135
+ ```
136
+ or
137
+ ```bash
138
+ git lfs install
139
+ git clone https://huggingface.co/khanhld/chunkformer-large-vie
140
+ ```
141
+ This will download the model checkpoint to the checkpoints folder inside your chunkformer directory.
142
+
143
+ 3. **Run the model**
144
+ ```bash
145
+ python decode.py \
146
+ --model_checkpoint path/to/local/chunkformer-large-vie \
147
+ --long_form_audio path/to/audio.wav \
148
+ --max_duration 14400 \ #in second, default is 1800
149
+ --chunk_size 64 \
150
+ --left_context_size 128 \
151
+ --right_context_size 128
152
+ ```
153
+ Example Output:
154
+ ```
155
+ [00:00:01.200] - [00:00:02.400]: this is a transcription example
156
+ [00:00:02.500] - [00:00:03.700]: testing the long-form audio
157
+ ```
158
+ **Advanced Usage** can be found [HERE](https://github.com/khanld/chunkformer/tree/main?tab=readme-ov-file#usage)
159
+
160
+ ---
161
+ <a name = "citation" ></a>
162
+ ## Citation
163
+ If you use this work in your research, please cite:
164
+
165
+ ```bibtex
166
+ @inproceedings{chunkformer,
167
+ title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
168
+ author={Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau},
169
+ booktitle={ICASSP},
170
+ year={2025}
171
+ }
172
+ ```
173
+
174
+ ---
175
+ <a name = "contact"></a>
176
+ ## Contact
177
178
+ - [![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/khanld)
179
+ - [![LinkedIn](https://img.shields.io/badge/linkedin-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/khanhld257/)
180
+
181
+
182
+
183
+