lmz commited on
Commit
00ead13
·
verified ·
1 Parent(s): 06bca76

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +114 -3
README.md CHANGED
@@ -1,3 +1,114 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
3
+ # Doc / guide: https://huggingface.co/docs/hub/model-cards
4
+ license: cc-by-4.0
5
+ language:
6
+ - fr
7
+ - en
8
+ library_name: hibiki
9
+ tags:
10
+ - speech
11
+ - translation
12
+ - streaming
13
+ metrics:
14
+ - bleu
15
+ ---
16
+
17
+ # Model Card for Hibiki
18
+
19
+ [Hibiki](https://github.com/kyutai-labs/hibiki) is a model for streaming speech translation (also known as *simultaneous* translation). Unlike offline translation—where one waits for the end of the source utterance to start translating--- Hibiki adapts its flow to accumulate just enough context to produce a correct translation in real-time, chunk by chunk. As the user speaks, Hibiki generates natural speech in the target language, optionally with voice transfer, along with a text translation.
20
+ Hibiki currently only supports French-to-English translation.
21
+
22
+ ## Model Details
23
+
24
+ This is a model referred to as *Hibiki-M* (for *Mobile*) in our [paper](https://arxiv.org/abs/2502.03382), a 1.7B parameter
25
+ hierarchical Transformer producing speech and text tokens at a framerate of 12.5Hz, with audio being generated at a
26
+ 1.1kbps bitrate.
27
+
28
+ ### Model Description
29
+
30
+ Hibiki is a decoder-only model for simultaneous speech translation. Hibiki leverages the multistream architecture of [Moshi](https://arxiv.org/abs/2410.00037)
31
+ to model source and target speech jointly. This allows Hibiki to continuously process the input stream while generating
32
+ the target speech. Hibiki produces text and audio tokens at a constant framerate of 12.5Hz. This allows for a continuous
33
+ output audio stream, along with timestamped text tranlsation. Since Hibiki relies on simple temperature sampling,
34
+ it is compatible with batching unlike models that rely on complex inference policies. Moreover, the fidelity of Hibiki's
35
+ voice transfer can be controlled by changing the coefficient of the Classifier-Free Guidance: a larger coefficient will
36
+ increase voice similarity, but excessive coefficients can lead to worse translations.
37
+
38
+
39
+ - **Developed by:** Kyutai
40
+ - **Model type:** Simultaneous speech-to-speech and speech-to-text translation.
41
+ - **Language(s) (NLP):** French-to-English
42
+ - **License:** CC-BY
43
+
44
+ ### Model Sources
45
+
46
+
47
+ - **Repository:** [repo](https://github.com/kyutai-labs/hibiki)
48
+ - **Paper:** [paper](https://arxiv.org/abs/2502.03382)
49
+ - **Examples:** [demo](https://hf.co/spaces/kyutai/hibiki-samples)
50
+
51
+ ## Uses
52
+
53
+ ### Direct Use
54
+
55
+ The model can be used for streaming translation from French to English in real-time settings, or for batched
56
+ simultaneous translation of many input sequences. It is robust to noisy conditions and is trained on sequences up
57
+ to 120 seconds.
58
+
59
+
60
+ ### Downstream Use
61
+
62
+ Some components of the model can be used independently or repurposed relatively easily.
63
+ For instance the Mimi codec is a state-of-the-art audio neural codec that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps, which make it particularly adapted to train speech language models or text-to-speech systems. Regarding the main Hibiki architecture,
64
+ supporting other pairs of languages would require finetuning.
65
+
66
+
67
+ ### Out-of-Scope Use
68
+
69
+ The model is not intended to be used to impersonate other people or any malicious use of any kind.
70
+
71
+
72
+ ## How to Get Started with the Model
73
+
74
+ See the main [README](https://github.com/kyutai-labs/hibiki) file.
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ - Textual data: The underlying [Helium](https://huggingface.co/kyutai/helium-1-preview-2b) model is trained on a mix of
81
+ data including: Wikipedia, Stack Exchange, open-access scientific articles (from peS2o) and Common Crawl.
82
+
83
+ - Audio data
84
+
85
+ - **Unsupervised audio dataset:** used for pre-training, this is a collection of 7M hours of readily available audio content in English and 450k hours in French, following the preprocessing and recipe of [Moshi](https://arxiv.org/abs/2410.00037).
86
+ - **Synthetic translation dataset**: Around 40k hours of parallel French-English data synthesized with *contextual alignment* (see [Section 3.2](https://arxiv.org/pdf/2502.03382)) with various levels of speaker similarity.
87
+ - **Translation finetuning:** A 900 hours mixture of a resynthesized version of [CVSS-T](https://github.com/google-research-datasets/cvss) and synthetic long-form utterances.
88
+
89
+ ### Training procedure and hyper-parameters
90
+
91
+ The different stages of the training procedure are detailled in the paper along with the hyper-parameters.
92
+
93
+ ### Compute Infrastructure
94
+
95
+ The final model was trained on 48 H100 Nvidia GPUs.
96
+
97
+ ## Citation
98
+
99
+ ```
100
+ @misc{labiausse2025hibiki,
101
+ title={High-Fidelity Simultaneous Speech-To-Speech Translation},
102
+ author={Tom Labiausse and Laurent Mazaré and Edouard Grave and Patrick Pérez and Alexandre Défossez and Neil Zeghidour},
103
+ year={2025},
104
+ eprint={2502.03382},
105
+ archivePrefix={arXiv},
106
+ primaryClass={cs.CL},
107
+ url={https://arxiv.org/abs/2502.03382},
108
+ }
109
+ ```
110
+
111
+
112
+ ## Model Card Authors
113
+
114
+ Tom Labiausse, Laurent Mazaré, Edouard Grave, Patrick Pérez, Alexandre Défossez, Neil Zeghidour