BigDong funmaker commited on
Commit
2aaa97c
·
verified ·
1 Parent(s): ebf6ddf

Update README.md (#6)

Browse files

- Update README.md (34388732f82c2593af1e603e8f6acbf69d856370)


Co-authored-by: funcy <[email protected]>

Files changed (1) hide show
  1. README.md +226 -192
README.md CHANGED
@@ -1,192 +1,226 @@
1
- ---
2
- license: apache-2.0
3
- language:
4
- - zh
5
- - en
6
- pipeline_tag: text-generation
7
- library_name: transformers
8
- ---
9
- <div align="center">
10
- <img src="https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm_logo.png?raw=true" width="500em" ></img>
11
- </div>
12
-
13
- <p align="center">
14
- <a href="https://github.com/OpenBMB/MiniCPM/" target="_blank">GitHub Repo</a> |
15
- <a href="https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf" target="_blank">Technical Report</a>
16
- </p>
17
- <p align="center">
18
- 👋 Join us on <a href="https://discord.gg/3cGQn9b3YM" target="_blank">Discord</a> and <a href="https://github.com/OpenBMB/MiniCPM/blob/main/assets/wechat.jpg" target="_blank">WeChat</a>
19
- </p>
20
-
21
- ## What's New
22
- - [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report [here](https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf).🔥🔥🔥
23
-
24
- ## MiniCPM4 Series
25
- MiniCPM4 series are highly efficient large language models (LLMs) designed explicitly for end-side devices, which achieves this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems.
26
- - [MiniCPM4-8B](https://huggingface.co/openbmb/MiniCPM4-8B): The flagship of MiniCPM4, with 8B parameters, trained on 8T tokens.
27
- - [MiniCPM4-0.5B](https://huggingface.co/openbmb/MiniCPM4-0.5B): The small version of MiniCPM4, with 0.5B parameters, trained on 1T tokens. (**<-- you are here**)
28
- - [MiniCPM4-8B-Eagle-FRSpec](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-FRSpec): Eagle head for FRSpec, accelerating speculative inference for MiniCPM4-8B.
29
- - [MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu): Eagle head trained with QAT for FRSpec, efficiently integrate speculation and quantization to achieve ultra acceleration for MiniCPM4-8B.
30
- - [MiniCPM4-8B-Eagle-vLLM](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-vLLM): Eagle head in vLLM format, accelerating speculative inference for MiniCPM4-8B.
31
- - [MiniCPM4-8B-marlin-Eagle-vLLM](https://huggingface.co/openbmb/MiniCPM4-8B-marlin-Eagle-vLLM): Quantized Eagle head for vLLM format, accelerating speculative inference for MiniCPM4-8B.
32
- - [BitCPM4-0.5B](https://huggingface.co/openbmb/BitCPM4-0.5B): Extreme ternary quantization applied to MiniCPM4-0.5B compresses model parameters into ternary values, achieving a 90% reduction in bit width.
33
- - [BitCPM4-1B](https://huggingface.co/openbmb/BitCPM4-1B): Extreme ternary quantization applied to MiniCPM3-1B compresses model parameters into ternary values, achieving a 90% reduction in bit width.
34
- - [MiniCPM4-Survey](https://huggingface.co/openbmb/MiniCPM4-Survey): Based on MiniCPM4-8B, accepts users' quiries as input and autonomously generate trustworthy, long-form survey papers.
35
- - [MiniCPM4-MCP](https://huggingface.co/openbmb/MiniCPM4-MCP): Based on MiniCPM4-8B, accepts users' queries and available MCP tools as input and autonomously calls relevant MCP tools to satisfy users' requirements.
36
-
37
- ## Introduction
38
- MiniCPM 4 is an extremely efficient edge-side large model that has undergone efficient optimization across four dimensions: model architecture, learning algorithms, training data, and inference systems, achieving ultimate efficiency improvements.
39
-
40
- - 🏗️ **Efficient Model Architecture:**
41
- - InfLLM v2 -- Trainable Sparse Attention Mechanism: Adopts a trainable sparse attention mechanism architecture where each token only needs to compute relevance with less than 5% of tokens in 128K long text processing, significantly reducing computational overhead for long texts
42
-
43
- - 🧠 **Efficient Learning Algorithms:**
44
- - Model Wind Tunnel 2.0 -- Efficient Predictable Scaling: Introduces scaling prediction methods for performance of downstream tasks, enabling more precise model training configuration search
45
- - BitCPM -- Ultimate Ternary Quantization: Compresses model parameter bit-width to 3 values, achieving 90% extreme model bit-width reduction
46
- - Efficient Training Engineering Optimization: Adopts FP8 low-precision computing technology combined with Multi-token Prediction training strategy
47
-
48
- - 📚 **High-Quality Training Data:**
49
- - UltraClean -- High-quality Pre-training Data Filtering and Generation: Builds iterative data cleaning strategies based on efficient data verification, open-sourcing high-quality Chinese and English pre-training dataset [UltraFinweb](https://huggingface.co/datasets/openbmb/Ultra-FineWeb)
50
- - UltraChat v2 -- High-quality Supervised Fine-tuning Data Generation: Constructs large-scale high-quality supervised fine-tuning datasets covering multiple dimensions including knowledge-intensive data, reasoning-intensive data, instruction-following data, long text understanding data, and tool calling data
51
-
52
- - ⚡ **Efficient Inference System:**
53
- - CPM.cu -- Lightweight and Efficient CUDA Inference Framework: Integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding
54
- - ArkInfer -- Cross-platform Deployment System: Supports efficient deployment across multiple backend environments, providing flexible cross-platform adaptation capabilities
55
-
56
- ## Usage
57
- ### Inference with Transformers
58
- ```python
59
- from transformers import AutoModelForCausalLM, AutoTokenizer
60
- import torch
61
- torch.manual_seed(0)
62
-
63
- path = 'openbmb/MiniCPM4-0.5B'
64
- device = "cuda"
65
- tokenizer = AutoTokenizer.from_pretrained(path)
66
- model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)
67
-
68
- # User can directly use the chat interface
69
- responds, history = model.chat(tokenizer, "Write an article about Artificial Intelligence.", temperature=0.7, top_p=0.7)
70
- print(responds)
71
-
72
- # User can also use the generate interface
73
- # messages = [
74
- # {"role": "user", "content": "Write an article about Artificial Intelligence."},
75
- # ]
76
- # model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(device)
77
-
78
- # model_outputs = model.generate(
79
- # model_inputs,
80
- # max_new_tokens=1024,
81
- # top_p=0.7,
82
- # temperature=0.7
83
- # )
84
- # output_token_ids = [
85
- # model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs))
86
- # ]
87
-
88
- # responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
89
- # print(responses)
90
- ```
91
-
92
- ### Inference with [SGLang](https://github.com/sgl-project/sglang)
93
-
94
- For now, you need to install our forked version of SGLang.
95
- ```bash
96
- git clone -b openbmb https://github.com/OpenBMB/sglang.git
97
- cd sglang
98
-
99
- pip install --upgrade pip
100
- pip install -e "python[all]"
101
- ```
102
-
103
- You can start the inference server by running the following command:
104
- ```bash
105
- python -m sglang.launch_server --model openbmb/MiniCPM4-8B --trust-remote-code --port 30000 --chat-template chatml
106
- ```
107
-
108
- Then you can use the chat interface by running the following command:
109
- ```python
110
- import openai
111
-
112
- client = openai.Client(base_url=f"http://localhost:30000/v1", api_key="None")
113
-
114
- response = client.chat.completions.create(
115
- model="openbmb/MiniCPM4-8B",
116
- messages=[
117
- {"role": "user", "content": "Write an article about Artificial Intelligence."},
118
- ],
119
- temperature=0.7,
120
- max_tokens=1024,
121
- )
122
-
123
- print(response.choices[0].message.content)
124
- ```
125
-
126
- ### Inference with [vLLM](https://github.com/vllm-project/vllm)
127
- For now, you need to install the latest version of vLLM.
128
- ```
129
- pip install -U vllm \
130
- --pre \
131
- --extra-index-url https://wheels.vllm.ai/nightly
132
- ```
133
-
134
- Then you can inference MiniCPM4-8B with vLLM:
135
- ```python
136
- from transformers import AutoTokenizer
137
- from vllm import LLM, SamplingParams
138
-
139
- model_name = "openbmb/MiniCPM4-8B"
140
- prompt = [{"role": "user", "content": "Please recommend 5 tourist attractions in Beijing. "}]
141
-
142
- tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
143
- input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
144
-
145
- llm = LLM(
146
- model=model_name,
147
- trust_remote_code=True,
148
- max_num_batched_tokens=32768,
149
- dtype="bfloat16",
150
- gpu_memory_utilization=0.8,
151
- )
152
- sampling_params = SamplingParams(top_p=0.7, temperature=0.7, max_tokens=1024, repetition_penalty=1.02)
153
-
154
- outputs = llm.generate(prompts=input_text, sampling_params=sampling_params)
155
-
156
- print(outputs[0].outputs[0].text)
157
- ```
158
-
159
- ## Evaluation Results
160
- On two typical end-side chips, Jetson AGX Orin and RTX 4090, MiniCPM4 demonstrates significantly faster processing speed compared to similar-size models in long text processing tasks. As text length increases, MiniCPM4's efficiency advantage becomes more pronounced. On the Jetson AGX Orin platform, compared to Qwen3-8B, MiniCPM4 achieves approximately 7x decoding speed improvement.
161
-
162
- ![benchmark](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/efficiency.png?raw=true)
163
-
164
- #### Comprehensive Evaluation
165
- MiniCPM4 launches end-side versions with 8B and 0.5B parameter scales, both achieving best-in-class performance in their respective categories.
166
-
167
- ![benchmark](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/benchmark.png?raw=true)
168
-
169
- #### Long Text Evaluation
170
- MiniCPM4 is pre-trained on 32K long texts and achieves length extension through YaRN technology. In the 128K long text needle-in-a-haystack task, MiniCPM4 demonstrates outstanding performance.
171
-
172
- ![long-niah](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/128k-niah.png?raw=true)
173
-
174
- ## Statement
175
- - As a language model, MiniCPM generates content by learning from a vast amount of text.
176
- - However, it does not possess the ability to comprehend or express personal opinions or value judgments.
177
- - Any content generated by MiniCPM does not represent the viewpoints or positions of the model developers.
178
- - Therefore, when using content generated by MiniCPM, users should take full responsibility for evaluating and verifying it on their own.
179
-
180
- ## LICENSE
181
- - This repository and MiniCPM models are released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
182
-
183
- ## Citation
184
- - Please cite our [paper](https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf) if you find our work valuable.
185
-
186
- ```bibtex
187
- @article{minicpm4,
188
- title={{MiniCPM4}: Ultra-Efficient LLMs on End Devices},
189
- author={MiniCPM Team},
190
- year={2025}
191
- }
192
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - zh
5
+ - en
6
+ pipeline_tag: text-generation
7
+ library_name: transformers
8
+ ---
9
+ <div align="center">
10
+ <img src="https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm_logo.png?raw=true" width="500em" ></img>
11
+ </div>
12
+
13
+ <p align="center">
14
+ <a href="https://github.com/OpenBMB/MiniCPM/" target="_blank">GitHub Repo</a> |
15
+ <a href="https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf" target="_blank">Technical Report</a>
16
+ </p>
17
+ <p align="center">
18
+ 👋 Join us on <a href="https://discord.gg/3cGQn9b3YM" target="_blank">Discord</a> and <a href="https://github.com/OpenBMB/MiniCPM/blob/main/assets/wechat.jpg" target="_blank">WeChat</a>
19
+ </p>
20
+
21
+ ## What's New
22
+ - [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report [here](https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf).🔥🔥🔥
23
+
24
+ ## MiniCPM4 Series
25
+ MiniCPM4 series are highly efficient large language models (LLMs) designed explicitly for end-side devices, which achieves this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems.
26
+ - [MiniCPM4-8B](https://huggingface.co/openbmb/MiniCPM4-8B): The flagship of MiniCPM4, with 8B parameters, trained on 8T tokens.
27
+ - [MiniCPM4-0.5B](https://huggingface.co/openbmb/MiniCPM4-0.5B): The small version of MiniCPM4, with 0.5B parameters, trained on 1T tokens. (**<-- you are here**)
28
+ - [MiniCPM4-8B-Eagle-FRSpec](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-FRSpec): Eagle head for FRSpec, accelerating speculative inference for MiniCPM4-8B.
29
+ - [MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu): Eagle head trained with QAT for FRSpec, efficiently integrate speculation and quantization to achieve ultra acceleration for MiniCPM4-8B.
30
+ - [MiniCPM4-8B-Eagle-vLLM](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-vLLM): Eagle head in vLLM format, accelerating speculative inference for MiniCPM4-8B.
31
+ - [MiniCPM4-8B-marlin-Eagle-vLLM](https://huggingface.co/openbmb/MiniCPM4-8B-marlin-Eagle-vLLM): Quantized Eagle head for vLLM format, accelerating speculative inference for MiniCPM4-8B.
32
+ - [BitCPM4-0.5B](https://huggingface.co/openbmb/BitCPM4-0.5B): Extreme ternary quantization applied to MiniCPM4-0.5B compresses model parameters into ternary values, achieving a 90% reduction in bit width.
33
+ - [BitCPM4-1B](https://huggingface.co/openbmb/BitCPM4-1B): Extreme ternary quantization applied to MiniCPM3-1B compresses model parameters into ternary values, achieving a 90% reduction in bit width.
34
+ - [MiniCPM4-Survey](https://huggingface.co/openbmb/MiniCPM4-Survey): Based on MiniCPM4-8B, accepts users' quiries as input and autonomously generate trustworthy, long-form survey papers.
35
+ - [MiniCPM4-MCP](https://huggingface.co/openbmb/MiniCPM4-MCP): Based on MiniCPM4-8B, accepts users' queries and available MCP tools as input and autonomously calls relevant MCP tools to satisfy users' requirements.
36
+
37
+ ## Introduction
38
+ MiniCPM 4 is an extremely efficient edge-side large model that has undergone efficient optimization across four dimensions: model architecture, learning algorithms, training data, and inference systems, achieving ultimate efficiency improvements.
39
+
40
+ - 🏗️ **Efficient Model Architecture:**
41
+ - InfLLM v2 -- Trainable Sparse Attention Mechanism: Adopts a trainable sparse attention mechanism architecture where each token only needs to compute relevance with less than 5% of tokens in 128K long text processing, significantly reducing computational overhead for long texts
42
+
43
+ - 🧠 **Efficient Learning Algorithms:**
44
+ - Model Wind Tunnel 2.0 -- Efficient Predictable Scaling: Introduces scaling prediction methods for performance of downstream tasks, enabling more precise model training configuration search
45
+ - BitCPM -- Ultimate Ternary Quantization: Compresses model parameter bit-width to 3 values, achieving 90% extreme model bit-width reduction
46
+ - Efficient Training Engineering Optimization: Adopts FP8 low-precision computing technology combined with Multi-token Prediction training strategy
47
+
48
+ - 📚 **High-Quality Training Data:**
49
+ - UltraClean -- High-quality Pre-training Data Filtering and Generation: Builds iterative data cleaning strategies based on efficient data verification, open-sourcing high-quality Chinese and English pre-training dataset [UltraFinweb](https://huggingface.co/datasets/openbmb/Ultra-FineWeb)
50
+ - UltraChat v2 -- High-quality Supervised Fine-tuning Data Generation: Constructs large-scale high-quality supervised fine-tuning datasets covering multiple dimensions including knowledge-intensive data, reasoning-intensive data, instruction-following data, long text understanding data, and tool calling data
51
+
52
+ - ⚡ **Efficient Inference System:**
53
+ - CPM.cu -- Lightweight and Efficient CUDA Inference Framework: Integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding
54
+ - ArkInfer -- Cross-platform Deployment System: Supports efficient deployment across multiple backend environments, providing flexible cross-platform adaptation capabilities
55
+
56
+ ## Usage
57
+ ### Inference with Transformers
58
+ ```python
59
+ from transformers import AutoModelForCausalLM, AutoTokenizer
60
+ import torch
61
+ torch.manual_seed(0)
62
+
63
+ path = 'openbmb/MiniCPM4-0.5B'
64
+ device = "cuda"
65
+ tokenizer = AutoTokenizer.from_pretrained(path)
66
+ model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)
67
+
68
+ # User can directly use the chat interface
69
+ responds, history = model.chat(tokenizer, "Write an article about Artificial Intelligence.", temperature=0.7, top_p=0.7)
70
+ print(responds)
71
+
72
+ # User can also use the generate interface
73
+ # messages = [
74
+ # {"role": "user", "content": "Write an article about Artificial Intelligence."},
75
+ # ]
76
+ # prompt_text = tokenizer.apply_chat_template(
77
+ # messages,
78
+ # tokenize=False,
79
+ # add_generation_prompt=True,
80
+ # )
81
+ # model_inputs = tokenizer([prompt_text], return_tensors="pt").to(device)
82
+
83
+ # model_outputs = model.generate(
84
+ # **model_inputs,
85
+ # max_new_tokens=1024,
86
+ # top_p=0.7,
87
+ # temperature=0.7
88
+ # )
89
+ # output_token_ids = [
90
+ # model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs['input_ids']))
91
+ # ]
92
+
93
+ # responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
94
+ # print(responses)
95
+ ```
96
+
97
+ ### Inference with [SGLang](https://github.com/sgl-project/sglang)
98
+
99
+ For now, you need to install our forked version of SGLang.
100
+ ```bash
101
+ git clone -b openbmb https://github.com/OpenBMB/sglang.git
102
+ cd sglang
103
+
104
+ pip install --upgrade pip
105
+ pip install -e "python[all]"
106
+ ```
107
+
108
+ You can start the inference server by running the following command:
109
+ ```bash
110
+ python -m sglang.launch_server --model openbmb/MiniCPM4-0.5B --trust-remote-code --port 30000 --chat-template chatml
111
+ ```
112
+
113
+ Then you can use the chat interface by running the following command:
114
+ ```python
115
+ import openai
116
+
117
+ client = openai.Client(base_url=f"http://localhost:30000/v1", api_key="None")
118
+
119
+ response = client.chat.completions.create(
120
+ model="openbmb/MiniCPM4-0.5B",
121
+ messages=[
122
+ {"role": "user", "content": "Write an article about Artificial Intelligence."},
123
+ ],
124
+ temperature=0.7,
125
+ max_tokens=1024,
126
+ )
127
+
128
+ print(response.choices[0].message.content)
129
+ ```
130
+
131
+ ### Inference with [vLLM](https://github.com/vllm-project/vllm)
132
+ For now, you need to install the latest version of vLLM.
133
+ ```
134
+ pip install -U vllm \
135
+ --pre \
136
+ --extra-index-url https://wheels.vllm.ai/nightly
137
+ ```
138
+
139
+ Then you can inference MiniCPM4-0.5B with vLLM:
140
+ ```python
141
+ from transformers import AutoTokenizer
142
+ from vllm import LLM, SamplingParams
143
+
144
+ model_name = "openbmb/MiniCPM4-0.5B"
145
+ prompt = [{"role": "user", "content": "Please recommend 5 tourist attractions in Beijing. "}]
146
+
147
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
148
+ input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
149
+
150
+ llm = LLM(
151
+ model=model_name,
152
+ trust_remote_code=True,
153
+ max_num_batched_tokens=32768,
154
+ dtype="bfloat16",
155
+ gpu_memory_utilization=0.8,
156
+ )
157
+ sampling_params = SamplingParams(top_p=0.7, temperature=0.7, max_tokens=1024, repetition_penalty=1.02)
158
+
159
+ outputs = llm.generate(prompts=input_text, sampling_params=sampling_params)
160
+
161
+ print(outputs[0].outputs[0].text)
162
+ ```
163
+
164
+ Also, you can start the inference server by running the following command:
165
+ > **Note**: In vLLM's chat API, `add_special_tokens` is `False` by default. This means important special tokens—such as the beginning-of-sequence (BOS) token—will not be added automatically. To ensure the input prompt is correctly formatted for the model, you should explicitly set `extra_body={"add_special_tokens": True}`.
166
+
167
+ ```bash
168
+ vllm serve openbmb/MiniCPM4-0.5B
169
+ ```
170
+
171
+ Then you can use the chat interface by running the following code:
172
+
173
+ ```python
174
+ import openai
175
+
176
+ client = openai.Client(base_url="http://localhost:8000/v1", api_key="EMPTY")
177
+
178
+ response = client.chat.completions.create(
179
+ model="openbmb/MiniCPM4-0.5B",
180
+ messages=[
181
+ {"role": "user", "content": "Write an article about Artificial Intelligence."},
182
+ ],
183
+ temperature=0.7,
184
+ max_tokens=1024,
185
+ extra_body=dict(add_special_tokens=True), # Ensures special tokens are added for chat template
186
+
187
+ )
188
+
189
+ print(response.choices[0].message.content)
190
+ ```
191
+
192
+
193
+ ## Evaluation Results
194
+ On two typical end-side chips, Jetson AGX Orin and RTX 4090, MiniCPM4 demonstrates significantly faster processing speed compared to similar-size models in long text processing tasks. As text length increases, MiniCPM4's efficiency advantage becomes more pronounced. On the Jetson AGX Orin platform, compared to Qwen3-8B, MiniCPM4 achieves approximately 7x decoding speed improvement.
195
+
196
+ ![benchmark](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/efficiency.png?raw=true)
197
+
198
+ #### Comprehensive Evaluation
199
+ MiniCPM4 launches end-side versions with 8B and 0.5B parameter scales, both achieving best-in-class performance in their respective categories.
200
+
201
+ ![benchmark](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/benchmark.png?raw=true)
202
+
203
+ #### Long Text Evaluation
204
+ MiniCPM4 is pre-trained on 32K long texts and achieves length extension through YaRN technology. In the 128K long text needle-in-a-haystack task, MiniCPM4 demonstrates outstanding performance.
205
+
206
+ ![long-niah](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/128k-niah.png?raw=true)
207
+
208
+ ## Statement
209
+ - As a language model, MiniCPM generates content by learning from a vast amount of text.
210
+ - However, it does not possess the ability to comprehend or express personal opinions or value judgments.
211
+ - Any content generated by MiniCPM does not represent the viewpoints or positions of the model developers.
212
+ - Therefore, when using content generated by MiniCPM, users should take full responsibility for evaluating and verifying it on their own.
213
+
214
+ ## LICENSE
215
+ - This repository and MiniCPM models are released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
216
+
217
+ ## Citation
218
+ - Please cite our [paper](https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf) if you find our work valuable.
219
+
220
+ ```bibtex
221
+ @article{minicpm4,
222
+ title={{MiniCPM4}: Ultra-Efficient LLMs on End Devices},
223
+ author={MiniCPM Team},
224
+ year={2025}
225
+ }
226
+ ```