khanhld
/

chunkformer-large-vie

@@ -1,178 +1,183 @@
----
-language: vie
-datasets:
-- legacy-datasets/common_voice
-- vlsp2020_vinai_100h
-- AILAB-VNUHCM/vivos
-- doof-ferb/vlsp2020_vinai_100h
-- doof-ferb/fpt_fosd
-- doof-ferb/infore1_25hours
-- linhtran92/viet_bud500
-- doof-ferb/LSVSC
-- doof-ferb/vais1000
-- doof-ferb/VietMed_labeled
-- NhutP/VSV-1100
-- doof-ferb/Speech-MASSIVE_vie
-- doof-ferb/BibleMMS_vie
-- capleaf/viVoice
-metrics:
-- wer
-pipeline_tag: automatic-speech-recognition
-tags:
-- transcription
-- audio
-- speech
-- chunkformer
-- asr
-- automatic-speech-recognition
-- long-form transcription
-license: cc-by-nc-4.0
-model-index:
-- name: ChunkFormer Large Vietnamese
-  results:
-  - task:
-      name: Speech Recognition
-      type: automatic-speech-recognition
-    dataset:
-      name: common-voice-vietnamese
-      type: common_voice
-      args: vi
-    metrics:
-       - name: Test WER
-         type: wer
-         value: 6.66
-  - task:
-      name: Speech Recognition
-      type: automatic-speech-recognition
-    dataset:
-      name: VIVOS
-      type: vivos
-      args: vi
-    metrics:
-       - name: Test WER
-         type: wer
-         value: 4.18
-  - task:
-      name: Speech Recognition
-      type: automatic-speech-recognition
-    dataset:
-      name: VLSP - Task 1
-      type: vlsp
-      args: vi
-    metrics:
-       - name: Test WER
-         type: wer
-         value: 14.09
----
-# **ChunkFormer-Large-Vie: Large-Scale Pretrained ChunkFormer for Vietnamese Automatic Speech Recognition**
-[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chunkformer-masked-chunking-conformer-for/speech-recognition-on-common-voice-vi)](https://paperswithcode.com/sota/speech-recognition-on-common-voice-vi?p=chunkformer-masked-chunking-conformer-for)
-[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chunkformer-masked-chunking-conformer-for/speech-recognition-on-vivos)](https://paperswithcode.com/sota/speech-recognition-on-vivos?p=chunkformer-masked-chunking-conformer-for)
-[![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
-[![GitHub](https://img.shields.io/badge/GitHub-ChunkFormer-blue)](https://github.com/khanld/chunkformer)
-[![Paper](https://img.shields.io/badge/Paper-ICASSP%202025-green)](paper.pdf)
----
-## Table of contents
-1. [Model Description](#description)
-2. [Documentation and Implementation](#implementation)
-3. [Benchmark Results](#benchmark)
-4. [Usage](#usage)
-6. [Citation](#citation)
-7. [Contact](#contact)
----
-<a name = "description" ></a>
-## Model Description
-**ChunkFormer-Large-Vie** is a large-scale Vietnamese Automatic Speech Recognition (ASR) model based on the **ChunkFormer** architecture, introduced at **ICASSP 2025**. The model has been fine-tuned on approximately **3000 hours** of public Vietnamese speech data sourced from diverse datasets. A list of datasets can be found [**HERE**](dataset.tsv).
-**!!! Please note that only the \[train-subset\] was used for tuning the model.**
----
-<a name = "implementation" ></a>
-## Documentation and Implementation
-The [Documentation]() and [Implementation](https://github.com/khanld/chunkformer) of ChunkFormer are publicly available.
----
-<a name = "benchmark" ></a>
-## Benchmark Results
-We evaluate the models using **Word Error Rate (WER)**. To ensure consistency and fairness in comparison, we manually apply **Text Normalization**, including the handling of numbers, uppercase letters, and punctuation.
-1. **Public Models**:
-| STT | Model                                                                  | #Params | Vivos | Common Voice | VLSP - Task 1 | Avg. |
-|-----|------------------------------------------------------------------------|---------|-------|--------------|---------------|------|
-| 1   | **ChunkFormer**                                                            | 110M    | 4.18   | 6.66           | 14.09             | **8.31**    |
-| 2   | [vinai/PhoWhisper-large](https://huggingface.co/vinai/PhoWhisper-large)  | 1.55B   | 4.67  | 8.14         | 13.75         | 8.85 |
-| 3   | [nguyenvulebinh/wav2vec2-base-vietnamese-250h](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h) | 95M     | 10.77 | 18.34        | 13.33         | 14.15 |
-| 4   | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 1.55B   | 8.81     | 15.45            | 20.41          | 14.89    |
-| 5   | [khanhld/wav2vec2-base-vietnamese-160h](https://huggingface.co/khanhld/wav2vec2-base-vietnamese-160h) | 95M     | 15.05 | 10.78        | 31.62             | 19.16    |
-| 6   | [homebrewltd/Ichigo-whisper-v0.1](https://huggingface.co/homebrewltd/Ichigo-whisper-v0.1) | 22M   | 13.46     | 23.52            | 21.64          | 19.54    |
-2. **Private Models (API)**:
-| STT | Model  | VLSP - Task 1 |
-|-----|--------|---------------|
-| 1   | **ChunkFormer** | **14.1**             |
-| 2   | Viettel     | 14.5          |
-| 3   | Google  | 19.5          |
-| 4   | FPT   | 28.8          |
----
-<a name = "usage" ></a>
-## Quick Usage
-To use the ChunkFormer model for Vietnamese Automatic Speech Recognition, follow these steps:
-1. **Download the ChunkFormer Repository**
-```bash
-git clone https://github.com/khanld/chunkformer.git
-cd chunkformer
-pip install -r requirements.txt
-```
-2. **Download the Model Checkpoint from Hugging Face**
-```bash
-pip install huggingface_hub
-huggingface-cli download khanhld/chunkformer-large-vie --local-dir "./chunkformer-large-vie"
-```
-or
-```bash
-git lfs install
-git clone https://huggingface.co/khanhld/chunkformer-large-vie
-```
-This will download the model checkpoint to the checkpoints folder inside your chunkformer directory.
-3. **Run the model**
-```bash
-python decode.py \
-    --model_checkpoint path/to/local/chunkformer-large-vie \
-    --long_form_audio path/to/audio.wav \
-    --max_duration 14400 \ #in second, default is 1800
-    --chunk_size 64 \
-    --left_context_size 128 \
-    --right_context_size 128
-```
-**Advanced Usage** can be found [HERE](https://github.com/khanld/chunkformer/tree/main?tab=readme-ov-file#usage)
----
-<a name = "citation" ></a>
-## Citation
-If you use this work in your research, please cite:
-```bibtex
-@inproceedings{chunkformer,
-  title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
-  author={Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau},
-  booktitle={ICASSP},
-  year={2025}
-}
-```
----
-<a name = "contact"></a>
-## Contact
-- [email protected]
-- [![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/khanld)
-- [![LinkedIn](https://img.shields.io/badge/linkedin-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/khanhld257/)

+---
+language: vie
+datasets:
+- legacy-datasets/common_voice
+- vlsp2020_vinai_100h
+- AILAB-VNUHCM/vivos
+- doof-ferb/vlsp2020_vinai_100h
+- doof-ferb/fpt_fosd
+- doof-ferb/infore1_25hours
+- linhtran92/viet_bud500
+- doof-ferb/LSVSC
+- doof-ferb/vais1000
+- doof-ferb/VietMed_labeled
+- NhutP/VSV-1100
+- doof-ferb/Speech-MASSIVE_vie
+- doof-ferb/BibleMMS_vie
+- capleaf/viVoice
+metrics:
+- wer
+pipeline_tag: automatic-speech-recognition
+tags:
+- transcription
+- audio
+- speech
+- chunkformer
+- asr
+- automatic-speech-recognition
+- long-form transcription
+license: cc-by-nc-4.0
+model-index:
+- name: ChunkFormer Large Vietnamese
+  results:
+  - task:
+      name: Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: common-voice-vietnamese
+      type: common_voice
+      args: vi
+    metrics:
+       - name: Test WER
+         type: wer
+         value: 6.66
+  - task:
+      name: Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: VIVOS
+      type: vivos
+      args: vi
+    metrics:
+       - name: Test WER
+         type: wer
+         value: 4.18
+  - task:
+      name: Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: VLSP - Task 1
+      type: vlsp
+      args: vi
+    metrics:
+       - name: Test WER
+         type: wer
+         value: 14.09
+---
+# **ChunkFormer-Large-Vie: Large-Scale Pretrained ChunkFormer for Vietnamese Automatic Speech Recognition**
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chunkformer-masked-chunking-conformer-for/speech-recognition-on-common-voice-vi)](https://paperswithcode.com/sota/speech-recognition-on-common-voice-vi?p=chunkformer-masked-chunking-conformer-for)
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chunkformer-masked-chunking-conformer-for/speech-recognition-on-vivos)](https://paperswithcode.com/sota/speech-recognition-on-vivos?p=chunkformer-masked-chunking-conformer-for)
+[![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
+[![GitHub](https://img.shields.io/badge/GitHub-ChunkFormer-blue)](https://github.com/khanld/chunkformer)
+[![Paper](https://img.shields.io/badge/Paper-ICASSP%202025-green)](paper.pdf)
+---
+## Table of contents
+1. [Model Description](#description)
+2. [Documentation and Implementation](#implementation)
+3. [Benchmark Results](#benchmark)
+4. [Usage](#usage)
+6. [Citation](#citation)
+7. [Contact](#contact)
+---
+<a name = "description" ></a>
+## Model Description
+**ChunkFormer-Large-Vie** is a large-scale Vietnamese Automatic Speech Recognition (ASR) model based on the **ChunkFormer** architecture, introduced at **ICASSP 2025**. The model has been fine-tuned on approximately **3000 hours** of public Vietnamese speech data sourced from diverse datasets. A list of datasets can be found [**HERE**](dataset.tsv).
+**!!! Please note that only the \[train-subset\] was used for tuning the model.**
+---
+<a name = "implementation" ></a>
+## Documentation and Implementation
+The [Documentation]() and [Implementation](https://github.com/khanld/chunkformer) of ChunkFormer are publicly available.
+---
+<a name = "benchmark" ></a>
+## Benchmark Results
+We evaluate the models using **Word Error Rate (WER)**. To ensure consistency and fairness in comparison, we manually apply **Text Normalization**, including the handling of numbers, uppercase letters, and punctuation.
+1. **Public Models**:
+| STT | Model                                                                  | #Params | Vivos | Common Voice | VLSP - Task 1 | Avg. |
+|-----|------------------------------------------------------------------------|---------|-------|--------------|---------------|------|
+| 1   | **ChunkFormer**                                                            | 110M    | 4.18   | 6.66           | 14.09             | **8.31**    |
+| 2   | [vinai/PhoWhisper-large](https://huggingface.co/vinai/PhoWhisper-large)  | 1.55B   | 4.67  | 8.14         | 13.75         | 8.85 |
+| 3   | [nguyenvulebinh/wav2vec2-base-vietnamese-250h](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h) | 95M     | 10.77 | 18.34        | 13.33         | 14.15 |
+| 4   | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 1.55B   | 8.81     | 15.45            | 20.41          | 14.89    |
+| 5   | [khanhld/wav2vec2-base-vietnamese-160h](https://huggingface.co/khanhld/wav2vec2-base-vietnamese-160h) | 95M     | 15.05 | 10.78        | 31.62             | 19.16    |
+| 6   | [homebrewltd/Ichigo-whisper-v0.1](https://huggingface.co/homebrewltd/Ichigo-whisper-v0.1) | 22M   | 13.46     | 23.52            | 21.64          | 19.54    |
+2. **Private Models (API)**:
+| STT | Model  | VLSP - Task 1 |
+|-----|--------|---------------|
+| 1   | **ChunkFormer** | **14.1**             |
+| 2   | Viettel     | 14.5          |
+| 3   | Google  | 19.5          |
+| 4   | FPT   | 28.8          |
+---
+<a name = "usage" ></a>
+## Quick Usage
+To use the ChunkFormer model for Vietnamese Automatic Speech Recognition, follow these steps:
+1. **Download the ChunkFormer Repository**
+```bash
+git clone https://github.com/khanld/chunkformer.git
+cd chunkformer
+pip install -r requirements.txt
+```
+2. **Download the Model Checkpoint from Hugging Face**
+```bash
+pip install huggingface_hub
+huggingface-cli download khanhld/chunkformer-large-vie --local-dir "./chunkformer-large-vie"
+```
+or
+```bash
+git lfs install
+git clone https://huggingface.co/khanhld/chunkformer-large-vie
+```
+This will download the model checkpoint to the checkpoints folder inside your chunkformer directory.
+3. **Run the model**
+```bash
+python decode.py \
+    --model_checkpoint path/to/local/chunkformer-large-vie \
+    --long_form_audio path/to/audio.wav \
+    --max_duration 14400 \ #in second, default is 1800
+    --chunk_size 64 \
+    --left_context_size 128 \
+    --right_context_size 128
+```
+Example Output:
+```
+[00:00:01.200] - [00:00:02.400]: this is a transcription example
+[00:00:02.500] - [00:00:03.700]: testing the long-form audio
+```
+**Advanced Usage** can be found [HERE](https://github.com/khanld/chunkformer/tree/main?tab=readme-ov-file#usage)
+---
+<a name = "citation" ></a>
+## Citation
+If you use this work in your research, please cite:
+```bibtex
+@inproceedings{chunkformer,
+  title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
+  author={Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau},
+  booktitle={ICASSP},
+  year={2025}
+}
+```
+---
+<a name = "contact"></a>
+## Contact
+- [email protected]
+- [![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/khanld)
+- [![LinkedIn](https://img.shields.io/badge/linkedin-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/khanhld257/)