lmz commited on
Commit
b34416f
·
verified ·
1 Parent(s): 1708070

Upload 4 files

Browse files
Files changed (4) hide show
  1. README.md +114 -3
  2. config.toml +60 -0
  3. [email protected] +3 -0
  4. tokenizer_spm_48k_multi6_2.model +3 -0
README.md CHANGED
@@ -1,3 +1,114 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
3
+ # Doc / guide: https://huggingface.co/docs/hub/model-cards
4
+ license: cc-by-4.0
5
+ language:
6
+ - fr
7
+ - en
8
+ library_name: hibiki
9
+ tags:
10
+ - speech
11
+ - translation
12
+ - streaming
13
+ metrics:
14
+ - bleu
15
+ ---
16
+
17
+ # Model Card for Hibiki
18
+
19
+ [Hibiki](https://github.com/kyutai-labs/hibiki) is a model for streaming speech translation (also known as *simultaneous* translation). Unlike offline translation—where one waits for the end of the source utterance to start translating--- Hibiki adapts its flow to accumulate just enough context to produce a correct translation in real-time, chunk by chunk. As the user speaks, Hibiki generates natural speech in the target language, optionally with voice transfer, along with a text translation.
20
+ Hibiki currently only supports French-to-English translation.
21
+
22
+ ## Model Details
23
+
24
+ This is a model referred to as *Hibiki-M* (for *Mobile*) in our [paper](https://arxiv.org/abs/2502.03382), a 1.7B parameter
25
+ hierarchical Transformer producing speech and text tokens at a framerate of 12.5Hz, with audio being generated at a
26
+ 1.1kbps bitrate.
27
+
28
+ ### Model Description
29
+
30
+ Hibiki is a decoder-only model for simultaneous speech translation. Hibiki leverages the multistream architecture of [Moshi](https://arxiv.org/abs/2410.00037)
31
+ to model source and target speech jointly. This allows Hibiki to continuously process the input stream while generating
32
+ the target speech. Hibiki produces text and audio tokens at a constant framerate of 12.5Hz. This allows for a continuous
33
+ output audio stream, along with timestamped text tranlsation. Since Hibiki relies on simple temperature sampling,
34
+ it is compatible with batching unlike models that rely on complex inference policies. Moreover, the fidelity of Hibiki's
35
+ voice transfer can be controlled by changing the coefficient of the Classifier-Free Guidance: a larger coefficient will
36
+ increase voice similarity, but excessive coefficients can lead to worse translations.
37
+
38
+
39
+ - **Developed by:** Kyutai
40
+ - **Model type:** Simultaneous speech-to-speech and speech-to-text translation.
41
+ - **Language(s) (NLP):** French-to-English
42
+ - **License:** CC-BY
43
+
44
+ ### Model Sources
45
+
46
+
47
+ - **Repository:** [repo](https://github.com/kyutai-labs/hibiki)
48
+ - **Paper:** [paper](https://arxiv.org/abs/2502.03382)
49
+ - **Examples:** [demo](https://hf.co/spaces/kyutai/hibiki-samples)
50
+
51
+ ## Uses
52
+
53
+ ### Direct Use
54
+
55
+ The model can be used for streaming translation from French to English in real-time settings, or for batched
56
+ simultaneous translation of many input sequences. It is robust to noisy conditions and is trained on sequences up
57
+ to 120 seconds.
58
+
59
+
60
+ ### Downstream Use
61
+
62
+ Some components of the model can be used independently or repurposed relatively easily.
63
+ For instance the Mimi codec is a state-of-the-art audio neural codec that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps, which make it particularly adapted to train speech language models or text-to-speech systems. Regarding the main Hibiki architecture,
64
+ supporting other pairs of languages would require finetuning.
65
+
66
+
67
+ ### Out-of-Scope Use
68
+
69
+ The model is not intended to be used to impersonate other people or any malicious use of any kind.
70
+
71
+
72
+ ## How to Get Started with the Model
73
+
74
+ See the main [README](https://github.com/kyutai-labs/hibiki) file.
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ - Textual data: The underlying [Helium](https://huggingface.co/kyutai/helium-1-preview-2b) model is trained on a mix of
81
+ data including: Wikipedia, Stack Exchange, open-access scientific articles (from peS2o) and Common Crawl.
82
+
83
+ - Audio data
84
+
85
+ - **Unsupervised audio dataset:** used for pre-training, this is a collection of 7M hours of readily available audio content in English and 450k hours in French, following the preprocessing and recipe of [Moshi](https://arxiv.org/abs/2410.00037).
86
+ - **Synthetic translation dataset**: Around 40k hours of parallel French-English data synthesized with *contextual alignment* (see [Section 3.2](https://arxiv.org/pdf/2502.03382)) with various levels of speaker similarity.
87
+ - **Translation finetuning:** A 900 hours mixture of a resynthesized version of [CVSS-T](https://github.com/google-research-datasets/cvss) and synthetic long-form utterances.
88
+
89
+ ### Training procedure and hyper-parameters
90
+
91
+ The different stages of the training procedure are detailled in the paper along with the hyper-parameters.
92
+
93
+ ### Compute Infrastructure
94
+
95
+ The final model was trained on 48 H100 Nvidia GPUs.
96
+
97
+ ## Citation
98
+
99
+ ```
100
+ @misc{labiausse2025hibiki,
101
+ title={High-Fidelity Simultaneous Speech-To-Speech Translation},
102
+ author={Tom Labiausse and Laurent Mazaré and Edouard Grave and Patrick Pérez and Alexandre Défossez and Neil Zeghidour},
103
+ year={2025},
104
+ eprint={2502.03382},
105
+ archivePrefix={arXiv},
106
+ primaryClass={cs.CL},
107
+ url={https://arxiv.org/abs/2502.03382},
108
+ }
109
+ ```
110
+
111
+
112
+ ## Model Card Authors
113
+
114
+ Tom Labiausse, Laurent Mazaré, Edouard Grave, Patrick Pérez, Alexandre Défossez, Neil Zeghidour
config.toml ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ mimi_name = "[email protected]"
2
+ moshi_name = "[email protected]"
3
+ tokenizer_name = "tokenizer_spm_48k_multi6_2.model"
4
+
5
+ [model]
6
+ text_in_vocab_size = 48001
7
+ text_out_vocab_size = 48000
8
+ audio_vocab_size = 2049
9
+ audio_codebooks = 16
10
+
11
+ [model.transformer]
12
+ d_model = 2048
13
+ num_heads = 16
14
+ num_layers = 16
15
+ dim_feedforward = 8192
16
+ causal = true
17
+ norm_first = true
18
+ bias_ff = false
19
+ bias_attn = false
20
+ context = 500
21
+ max_period = 100000
22
+ use_conv_block = false
23
+ use_conv_bias = true
24
+ gating = "silu"
25
+ norm = "RmsNorm"
26
+ positional_embedding = "Rope"
27
+ conv_layout = false
28
+ conv_kernel_size = 3
29
+ kv_repeat = 1
30
+ max_seq_len = 4096
31
+
32
+ [model.depformer]
33
+ num_slices = 8
34
+
35
+ [model.depformer.transformer]
36
+ d_model = 1024
37
+ num_heads = 16
38
+ num_layers = 6
39
+ dim_feedforward = 4096
40
+ causal = true
41
+ norm_first = true
42
+ bias_ff = false
43
+ bias_attn = false
44
+ context = 32
45
+ max_period = 10000
46
+ use_conv_block = false
47
+ use_conv_bias = true
48
+ gating = "silu"
49
+ norm = "RmsNorm"
50
+ positional_embedding = "None"
51
+ conv_layout = false
52
+ conv_kernel_size = 3
53
+ kv_repeat = 1
54
+ max_seq_len = 4096
55
+
56
+ [model.conditioners.description]
57
+ type = "Lut"
58
+ n_bins = 31
59
+ dim = 16
60
+ possible_values = ["very_bad", "bad", "neutral", "good", "very_good"]
[email protected] ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:31c14cf365353131094e8248150c6fe58e8642cf91899c50d9e450f861630e55
3
+ size 384644900
tokenizer_spm_48k_multi6_2.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c22110fb855aa049e17346ea2e88355bdd664f06cbfd09948380ab5e85b39697
3
+ size 857314