File size: 7,365 Bytes
8aaca4c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e8a247f
8aaca4c
 
 
 
 
 
 
e8a247f
 
 
ed50050
 
 
e8a247f
8aaca4c
 
 
 
 
 
 
e8a247f
 
 
e12deb1
ed50050
e12deb1
e8a247f
8aaca4c
 
 
 
 
 
 
e8a247f
 
 
8aaca4c
 
 
11da310
 
 
 
 
8aaca4c
 
 
11da310
8aaca4c
49c3fc3
11da310
 
8aaca4c
f3b683f
8aaca4c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce95bf7
8aaca4c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e8a247f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
---
language: vie
datasets:
- legacy-datasets/common_voice
- vlsp2020_vinai_100h
- AILAB-VNUHCM/vivos
- doof-ferb/vlsp2020_vinai_100h
- doof-ferb/fpt_fosd
- doof-ferb/infore1_25hours
- linhtran92/viet_bud500
- doof-ferb/LSVSC
- doof-ferb/vais1000
- doof-ferb/VietMed_labeled
- NhutP/VSV-1100
- doof-ferb/Speech-MASSIVE_vie
- doof-ferb/BibleMMS_vie
- capleaf/viVoice
metrics:
- wer
pipeline_tag: automatic-speech-recognition
tags:
- transcription
- audio
- speech
- chunkformer
- asr
- automatic-speech-recognition
license: cc-by-nc-4.0
model-index:
- name: ChunkFormer Large Vietnamese
  results:
  - task:
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: common-voice-vietnamese
      type: common_voice
      args: vi
    metrics:
    - name: Test WER
      type: wer
      value: 6.66
    source:
      name: Common Voice Vi Leaderboard
      url: https://paperswithcode.com/sota/speech-recognition-on-common-voice-vi
  - task:
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: VIVOS
      type: vivos
      args: vi
    metrics:
    - name: Test WER
      type: wer
      value: 4.18
    source:
      name: Vivos Leaderboard
      url: https://paperswithcode.com/sota/speech-recognition-on-vivos
  - task:
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: VLSP - Task 1
      type: vlsp
      args: vi
    metrics:
    - name: Test WER
      type: wer
      value: 14.09
---

# **ChunkFormer-Large-Vie: Large-Scale Pretrained ChunkFormer for Vietnamese Automatic Speech Recognition**
<style>
img {
 display: inline;
}
</style>
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chunkformer-masked-chunking-conformer-for/speech-recognition-on-common-voice-vi)](https://paperswithcode.com/sota/speech-recognition-on-common-voice-vi?p=chunkformer-masked-chunking-conformer-for)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chunkformer-masked-chunking-conformer-for/speech-recognition-on-vivos)](https://paperswithcode.com/sota/speech-recognition-on-vivos?p=chunkformer-masked-chunking-conformer-for)

[![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/) 
[![GitHub](https://img.shields.io/badge/GitHub-ChunkFormer-blue)](https://github.com/khanld/chunkformer)
[![Paper](https://img.shields.io/badge/Paper-ICASSP%202025-green)](https://arxiv.org/abs/2502.14673)
[![Model size](https://img.shields.io/badge/Params-110M-lightgrey#model-badge)](#description)


**!!!ATTENTION: Input audio must be MONO (1 channel) at 16,000 sample rate** 
---
## Table of contents
1. [Model Description](#description)
2. [Documentation and Implementation](#implementation)
3. [Benchmark Results](#benchmark)
4. [Usage](#usage)
6. [Citation](#citation)
7. [Contact](#contact)

---
<a name = "description" ></a>
## Model Description
**ChunkFormer-Large-Vie** is a large-scale Vietnamese Automatic Speech Recognition (ASR) model based on the **ChunkFormer** architecture, introduced at **ICASSP 2025**. The model has been fine-tuned on approximately **3000 hours** of public Vietnamese speech data sourced from diverse datasets. A list of datasets can be found [**HERE**](dataset.tsv). 

**!!! Please note that only the \[train-subset\] was used for tuning the model.**

---
<a name = "implementation" ></a>
## Documentation and Implementation
The [Documentation]() and [Implementation](https://github.com/khanld/chunkformer) of ChunkFormer are publicly available.

---
<a name = "benchmark" ></a>
## Benchmark Results
We evaluate the models using **Word Error Rate (WER)**. To ensure consistency and fairness in comparison, we manually apply **Text Normalization**, including the handling of numbers, uppercase letters, and punctuation.

1. **Public Models**:
| STT | Model                                                                  | #Params | Vivos | Common Voice | VLSP - Task 1 | Avg. |
|-----|------------------------------------------------------------------------|---------|-------|--------------|---------------|------|
| 1   | **ChunkFormer**                                                            | 110M    | 4.18   | 6.66           | 14.09             | **8.31**    |
| 2   | [vinai/PhoWhisper-large](https://huggingface.co/vinai/PhoWhisper-large)  | 1.55B   | 4.67  | 8.14         | 13.75         | 8.85 |
| 3   | [nguyenvulebinh/wav2vec2-base-vietnamese-250h](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h) | 95M     | 10.77 | 18.34        | 13.33         | 14.15 |
| 4   | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 1.55B   | 8.81     | 15.45            | 20.41          | 14.89    |
| 5   | [khanhld/wav2vec2-base-vietnamese-160h](https://huggingface.co/khanhld/wav2vec2-base-vietnamese-160h) | 95M     | 15.05 | 10.78        | 31.62             | 19.16    |
| 6   | [homebrewltd/Ichigo-whisper-v0.1](https://huggingface.co/homebrewltd/Ichigo-whisper-v0.1) | 22M   | 13.46     | 23.52            | 21.64          | 19.54    |

2. **Private Models (API)**:
| STT | Model  | VLSP - Task 1 |
|-----|--------|---------------|
| 1   | **ChunkFormer** | **14.1**             |
| 2   | Viettel     | 14.5          |
| 3   | Google  | 19.5          |
| 4   | FPT   | 28.8          |

---
<a name = "usage" ></a>
## Quick Usage
To use the ChunkFormer model for Vietnamese Automatic Speech Recognition, follow these steps:

1. **Download the ChunkFormer Repository**
```bash
git clone https://github.com/khanld/chunkformer.git
cd chunkformer
pip install -r requirements.txt   
```
2. **Download the Model Checkpoint from Hugging Face**
```bash
pip install huggingface_hub
huggingface-cli download khanhld/chunkformer-large-vie --local-dir "./chunkformer-large-vie"
```
or
```bash
git lfs install
git clone https://huggingface.co/khanhld/chunkformer-large-vie
```
This will download the model checkpoint to the checkpoints folder inside your chunkformer directory.

3. **Run the model**
```bash
python decode.py \
    --model_checkpoint path/to/local/chunkformer-large-vie \
    --long_form_audio path/to/audio.wav \
    --total_batch_duration 14400 \ #in second, default is 1800
    --chunk_size 64 \
    --left_context_size 128 \
    --right_context_size 128
```
Example Output:
```
[00:00:01.200] - [00:00:02.400]: this is a transcription example
[00:00:02.500] - [00:00:03.700]: testing the long-form audio
```
**Advanced Usage** can be found [HERE](https://github.com/khanld/chunkformer/tree/main?tab=readme-ov-file#usage)

---
<a name = "citation" ></a>
## Citation
If you use this work in your research, please cite:

```bibtex
@inproceedings{chunkformer,
  title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
  author={Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau},
  booktitle={ICASSP},
  year={2025}
}
```

---
<a name = "contact"></a>
## Contact
- [email protected]
- [![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/khanld)
- [![LinkedIn](https://img.shields.io/badge/linkedin-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/khanhld257/)