Update README.md
Browse files
README.md
CHANGED
@@ -1,31 +1,34 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
-
|
5 |
-
|
|
|
|
|
6 |
[Update]
|
7 |
-
8.29
|
8 |
-
8.31
|
9 |
-
9.
|
10 |
-
9.12
|
11 |
|
12 |
|
13 |
-
##
|
14 |
-
|
15 |
-
同时,本项目也受到 OpenAI 发布的 GPT-4o 和其展示的教育场景中的演示视频展现的能力的启发。
|
16 |
|
|
|
17 |
|
18 |
-
## 团队
|
19 |
-
浙江精准学是由阿里巴巴投资,专注于提供教育相关软硬件产品(AI辅学机)的公司。精准学 AI 团队致力于通过 AI 技术实现接近甚至超越人类教育体验的主动式学习,并力求降低技术成本,使之人人可负担。
|
20 |
|
|
|
|
|
21 |
|
22 |
-
|
23 |
-
|
24 |
-
-
|
25 |
-
-
|
26 |
-
-
|
27 |
-
-
|
28 |
-
-
|
|
|
29 |
|
30 |
[1]: https://arxiv.org/abs/2402.05755 "SpiRit-LM: Interleaved Spoken and Written Language Model"
|
31 |
[2]: https://arxiv.org/abs/2102.01192 "Generative Spoken Language Modeling from Raw Audio"
|
@@ -34,127 +37,119 @@ language:
|
|
34 |
[5]: https://arxiv.org/abs/2310.16338 "Generative Pre-training for Speech with Flow Matching"
|
35 |
|
36 |
|
37 |
-
##
|
38 |
-
|
39 |
-
|
40 |
-
针对中文特别是支持教育场景语汇的自监督预训练语音编码器的缺乏,我们基于Meta HuBERT论文的方法开发了一个侧重语义信息的自监督语音编码器,并借鉴RVQVAE的方法,使用大量中文语音数据从头训练了侧重声学信息的音频编解码器(9层码本)。
|
41 |
-

|
42 |
|
43 |
-
|
44 |
-

|
60 |
-
[输出](assets/answer_example_1_MP3.mp3)
|
61 |
-
|
62 |
-
**对话例子2:** "这里的药材长势不错"
|
63 |
-
[输入](assets/question_example_4_MP3.mp3)
|
64 |
-
[输出](assets/answer_example_4_MP3.mp3)
|
65 |
-
|
66 |
-
### Demo Site
|
67 |
-
相应的 Demo 实际体验部署在 https://voice-playground.91jzx.cn ,限于资源有限,同时支持并发小于10。实际部署的checkpoint是心流知镜-s v0.2-240822-checkpoint,后续会更新到v0.2和v0.3的最新的版本。
|
68 |
-
|
69 |
-
### 多任务评估
|
70 |
-
在这里ASR子任务被看作是对于语音中蕴含的learnable semantic info在预训练阶段对此representation学习效果的一个评估。当前的checkpoint,在预训练的第一阶段观察到ASR子任务大约相当于Whisper-small的水平。所选取的评估数据,公开领域网上语音数据是未训练的数据,Wenet数据全部未参与端到端训练过程。从这两部分数据随机采样1024条进行评估。
|
71 |
-
| 数据集来源 | 数量 | 中文CER/WER |
|
72 |
-
|-------------------|---------|---------|
|
73 |
-
| 公开领域随机采样 - test | 1024(采样) | 12.55% |
|
74 |
-
| WenetSpeech - test| 1024(采样) | 24.23% |
|
75 |
|
76 |
-
|
|
|
|
|
|
|
77 |
|
78 |
-
|
79 |
-
|
80 |
-
|
81 |
|
82 |
-
|
83 |
-
|
84 |
-
|
85 |
-
|
86 |
-
|
87 |
-
|
88 |
-
|
89 |
-
|
90 |
-
|
91 |
-
|
92 |
-
|
93 |
-
|
94 |
-
|
95 |
-
|
96 |
-
|
97 |
-
|
98 |
-
|
99 |
-
|
100 |
-
|
101 |
-
-
|
102 |
-
-
|
103 |
-
|
104 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
105 |
|
106 |
⠀
|
107 |
-
### 2024
|
108 |
-
**心流知镜-s v0.2**
|
109 |
-
- [x]
|
110 |
-
- [ ]
|
111 |
-
- [ ]
|
112 |
-
- [ ]
|
113 |
|
114 |
⠀
|
115 |
-
### 2024
|
116 |
-
**心流知镜-s v0.3**
|
117 |
-
- [ ]
|
118 |
-
- [ ]
|
119 |
-
- [ ]
|
120 |
-
- [ ]
|
121 |
|
122 |
⠀
|
123 |
-
### 2024
|
124 |
-
**心流知镜-s v0.3
|
125 |
-
- [ ]
|
126 |
-
- [ ]
|
127 |
|
128 |
⠀
|
129 |
-
### 2024
|
130 |
-
**心流知镜-s v0.4**
|
131 |
-
- [ ]
|
132 |
-
- [ ]
|
133 |
|
134 |
⠀
|
135 |
-
### 2025
|
136 |
-
**心流知镜-s v0.5**
|
137 |
-
- [ ]
|
138 |
|
139 |
⠀
|
140 |
-
### 2025
|
141 |
-
**心流知镜-s1**
|
142 |
-
- [ ]
|
143 |
-
- [ ]
|
144 |
-
|
145 |
-
|
146 |
-
|
147 |
-
|
148 |
-
-
|
149 |
-
-
|
150 |
-
-
|
151 |
-
-
|
152 |
-
|
153 |
-
|
154 |
-
|
155 |
-
|
156 |
-
钉钉群:90720015617
|
157 |
-
<img src="assets/dingding_qrcode.png" alt="钉钉技术群二维码" width="200"/>
|
158 |
-
---
|
159 |
-
license: apache-2.0
|
160 |
-
---
|
|
|
1 |
+
([简体中文](./README_zh.md)|English)
|
2 |
+
|
3 |
+
[](https://huggingface.co/jzx-ai-lab/flow_mirror)
|
4 |
+
[](https://www.modelscope.cn/models/jzx-ai-lab/Flow_mirror)
|
5 |
+
[](https://github.com/jingzhunxue/flow_mirror)
|
6 |
+
[](./LICENSE)
|
7 |
+
|
8 |
[Update]
|
9 |
+
8.29: Created repository, published README & Roadmap
|
10 |
+
8.31: Released Demo Site (https://voice-playground.91jzx.cn)
|
11 |
+
9.02: Released Inference Code
|
12 |
+
9.12: Released FlowMirror-s-v0.2-checkpoint-20240828
|
13 |
|
14 |
|
15 |
+
## Motivation
|
16 |
+
While text remains the dominant form of language on the internet, many scenarios, such as teaching and medical consultations, still rely on direct verbal communication. Moreover, young children and individuals without literacy skills can engage in extensive communication and expression through listening and speaking, demonstrating that pure voice-based communication can provide sufficient intelligence for interaction. Spoken (textless) communication inherently contains rich expressive information, making it more valuable than purely ASR-converted text in scenarios like education and training.
|
|
|
17 |
|
18 |
+
Additionally, this project draws inspiration from the capabilities demonstrated by OpenAI's GPT-4 and its educational use cases showcased in demo videos.
|
19 |
|
|
|
|
|
20 |
|
21 |
+
## Team
|
22 |
+
Zhejiang Jingzhunxue is a company funded by Alibaba, focusing on providing education-related hardware and software products (AI-assisted learning devices). The AI team at Jingzhunxue is dedicated to achieving proactive learning experiences comparable to or surpassing human education using AI technologies, while striving to reduce technical costs to make these solutions affordable for everyone.
|
23 |
|
24 |
+
|
25 |
+
## Background
|
26 |
+
To the best of our knowledge, the earliest end-to-end voice models originated from Meta’s Speechbot GLSM series. Several relevant research papers have provided valuable references and experimental experiences for our work:
|
27 |
+
- SpiritLM: Nguyen et al. (2024) explored the interleaving of spoken and written language models.[More Info][1]
|
28 |
+
- GLSM: Lakhotia et al. (2021) Lakhotia et al. (2021) developed a generative spoken language model from raw audio.[More Info][2]
|
29 |
+
- AudioLM: Borsos et al. (2023) proposed a language modeling approach to audio generation.[More Info][3]
|
30 |
+
- SpeechGPT: Zhang et al. (2023) enhanced the cross-modal conversational capabilities of large language models.[More Info][4]
|
31 |
+
- SpeechFlow:Liu et al. (2024) introduced a speech generation pretraining method using flow matching. [More Info][5]
|
32 |
|
33 |
[1]: https://arxiv.org/abs/2402.05755 "SpiRit-LM: Interleaved Spoken and Written Language Model"
|
34 |
[2]: https://arxiv.org/abs/2102.01192 "Generative Spoken Language Modeling from Raw Audio"
|
|
|
37 |
[5]: https://arxiv.org/abs/2310.16338 "Generative Pre-training for Speech with Flow Matching"
|
38 |
|
39 |
|
40 |
+
## Methodology
|
41 |
+
Overall, we view the pre-training of end-to-end voice models as a process of learning representations that capture both semantic and acoustic information inherent in speech. Initializing with a text-based LLM brings the possibility of learning unified Text & Audio Representations and significantly reduces engineering complexity. Thus, we designed the overall training process in two stages as outlined below.
|
|
|
|
|
|
|
42 |
|
43 |
+
Due to the lack of self-supervised pre-trained speech encoders supporting Chinese, particularly for educational vocabulary, we developed a self-supervised speech encoder focusing on semantic information, based on the Meta HuBERT paper. Drawing inspiration from RVQVAE, we trained an audio codec focusing on acoustic information (9 layers of codebooks) from scratch using extensive Chinese speech data.
|
44 |
+

|
45 |
|
46 |
+
Based on these self-supervised pre-trained codecs, we used the qwen2 series LLM models as initialization parameters. As shown in the figure, we adopted an asymmetric structure, where input is primarily a Semantic Unit, and output includes both Acoustic Units and text.
|
47 |
+

|
48 |
|
49 |
+
FlowMirror-s v0.1 and v0.2 were pre-trained with 20,000 hours and 50,000 hours of speech data, respectively, and support tasks such as ASR, TTS, speech continuation, and voice dialogue. These experimental results preliminarily verify the feasibility of end-to-end voice models and demonstrate the scalability of the network design, suggesting that the model will achieve even stronger capabilities in future versions.
|
50 |
|
51 |
+
## Evaluation
|
52 |
+
Qualitative audio examples can be referenced through the following dialogues:
|
53 |
```text
|
54 |
example_1 = "人在没有目标的时候才应该有压力"
|
55 |
example_2 = "这个阶段需要学习什么知识?"
|
56 |
example_3 = "怎么把事情做对要花时间去培养"
|
57 |
example_4 = "这里的药材长势不错"
|
58 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
59 |
|
60 |
+
### Dialogue Voice Examples
|
61 |
+
**Example 1:** "People should only feel pressure when they lack a goal."
|
62 |
+
[Input](assets/question_example_1_MP3.mp3)
|
63 |
+
[Output](assets/answer_example_1_MP3.mp3)
|
64 |
|
65 |
+
**Example 2:** "The growth of the herbs here looks promising."
|
66 |
+
[Input](assets/question_example_4_MP3.mp3)
|
67 |
+
[Output](assets/answer_example_4_MP3.mp3)
|
68 |
|
69 |
+
### Demo Site
|
70 |
+
The demo is deployed at https://voice-playground.91jzx.cn, with support for up to 10 concurrent users due to limited resources. The checkpoint currently deployed is 心流知镜-s v0.2-240822-checkpoint. Future versions will update to the latest v0.2 and v0.3 checkpoints.
|
71 |
+
|
72 |
+
### Multi-task Evaluation
|
73 |
+
In this project, the ASR sub-task is considered an evaluation of how well learnable semantic information in the speech is captured during pre-training. The current checkpoint achieves ASR performance approximately equivalent to Whisper-small during the first stage of pre-training. The evaluation data consists of publicly available online speech data, which was not used during training, and Wenet data, which did not participate in end-to-end training. A random sample of 1,024 sentences from both datasets was evaluated.
|
74 |
+
| Dataset Source | Quantity | Chinese CER/WER |
|
75 |
+
|--------------------------|-----------|-------------------|
|
76 |
+
| Public Dataset - Test | 1,024 | 12.55% |
|
77 |
+
| WenetSpeech - Test | 1,024 | 24.23% |
|
78 |
+
|
79 |
+
Since this checkpoint is from an early epoch, it is expected that with increased training data and time, the alignment between speech semantics and text will significantly improve, even without increasing the model size.
|
80 |
+
|
81 |
+
**[TODO]**
|
82 |
+
Evaluation data from AudioBench will be added.
|
83 |
+
Note: There is an urgent need to construct a Chinese version of AudioBench for more comprehensive evaluations.
|
84 |
+
|
85 |
+
## Limitations and Drawbacks
|
86 |
+
* During the three-stage training process, we did not use conventional text LLM pre-training data. Compared to the original qwen2 model, this may lead to decreased performance in MMLU evaluations. Future versions will aim to mitigate this.
|
87 |
+
* The current version only controls the speaker's voice timbre. Other speech characteristics such as emotion, prosody, speaking rate, pauses, non-verbal sounds, and pitch have not been fine-tuned.
|
88 |
+
* Sometimes, the dialogue responses may be irrelevant or address the wrong topic (e.g., misinterpretations caused by homophones in speech). At this stage, due to the limited parameter size (1.5B) and the special distribution of pre-training speech data (not evenly distributed across conversation topics), as well as bottlenecks in data preprocessing, we anticipate significant improvements in this area with increased and more targeted data.
|
89 |
+
* Multi-turn conversations are not yet supported in the current version.
|
90 |
+
* There is substantial room for improving inference speed. The current TTFB on an L20 GPU is around 670ms. We expect that with TensorRT optimization and the application of other popular techniques, overall throughput can be improved by an order of magnitude, even without quantization.
|
91 |
+
|
92 |
+
## License
|
93 |
+
Since WenetSpeech data was used in the self-supervised encoder for v0.1-v0.3, the self-supervised pre-trained speech encoder and end-to-end checkpoint weight files are limited to academic use. The code is licensed under Apache 2.0.
|
94 |
+
To further promote the exploration of speech models for Chinese and Asian languages, we plan to release a new version trained on publicly collected data (excluding Wenet), providing a self-supervised encoder and decoder that is more freely usable.
|
95 |
+
|
96 |
+
## Roadmap
|
97 |
+
The project is planned as follows:
|
98 |
+
|
99 |
+
### August 2024
|
100 |
+
**心流知镜-s v0.1 & 0.2 (500M-1.5B parameters)**
|
101 |
+
- [x] Chinese self-supervised audio codec
|
102 |
+
- [x] 心流知镜-s v0.1 & v0.2 (500M-1.5B parameters)
|
103 |
+
- [x] Experience website based on WebRTC
|
104 |
+
- [x] Dual output: Speech & Text
|
105 |
|
106 |
⠀
|
107 |
+
### September 2024
|
108 |
+
**心流知镜-s v0.2**
|
109 |
+
- [x] Open-source [checkpoint](https://huggingface.co/jzx-ai-lab/flow_mirror) and inference code
|
110 |
+
- [ ] Accelerated inference version
|
111 |
+
- [ ] Support for on-device deployment
|
112 |
+
- [ ] Release self-supervised speech encoder and audio codec weights for academic use
|
113 |
|
114 |
⠀
|
115 |
+
### October 2024
|
116 |
+
**心流知镜-s v0.3**
|
117 |
+
- [ ] Enhanced for primary and secondary school subject teaching
|
118 |
+
- [ ] Support for speaker voice selection in dialogues
|
119 |
+
- [ ] Expressive speech (emotion, volume, pitch, speech rate, etc.)
|
120 |
+
- [ ] Construction of a Chinese-focused AudioBench evaluation dataset
|
121 |
|
122 |
⠀
|
123 |
+
### November 2024
|
124 |
+
**心流知镜-s v0.3 - Multilingual Version**
|
125 |
+
- [ ] Support for major languages in East Asia and globally
|
126 |
+
- [ ] Support for multilingual interactive dialogues
|
127 |
|
128 |
⠀
|
129 |
+
### December 2024
|
130 |
+
**心流知镜-s v0.4**
|
131 |
+
- [ ] Support for high-quality, fully duplex dialogues in educational scenarios
|
132 |
+
- [ ] Larger model sizes
|
133 |
|
134 |
⠀
|
135 |
+
### January 2025
|
136 |
+
**心流知镜-s v0.5**
|
137 |
+
- [ ] Support for various Chinese dialects and accents
|
138 |
|
139 |
⠀
|
140 |
+
### March 2025
|
141 |
+
**心流知镜-s1**
|
142 |
+
- [ ] Release of larger model sizes
|
143 |
+
- [ ] Expansion to visual capabilities
|
144 |
+
|
145 |
+
## Recruitment
|
146 |
+
We are hiring for the following areas, including group leader roles. Interested candidates are welcome to apply:
|
147 |
+
- Speech ASR/TTS/Dialog SLLM
|
148 |
+
- Role-playing LLM model
|
149 |
+
- Multimodal model inference acceleration
|
150 |
+
- Visual understanding and document intelligence
|
151 |
+
- General framework for character video generation
|
152 |
+
|
153 |
+
## Community
|
154 |
+
DingTalk Group: 90720015617
|
155 |
+
<img src="assets/dingding_qrcode.png" alt="DingTalk Technical Group QR Code" width="200"/>
|
|
|
|
|
|
|
|
|
|