update README
Browse files
README.md
CHANGED
@@ -29,7 +29,7 @@ license: apache-2.0
|
|
29 |
<a href="https://www.minimax.io" target="_blank" style="margin: 2px;">
|
30 |
<img alt="Homepage" src="https://img.shields.io/badge/_Homepage-MiniMax-FF4040?style=flat-square&labelColor=2C3E50&logo=&logoWidth=20" style="display: inline-block; vertical-align: middle;"/>
|
31 |
</a>
|
32 |
-
<a href="" target="_blank" style="margin: 2px;">
|
33 |
<img alt="Paper" src="https://img.shields.io/badge/📖_Paper-MiniMax--M1-FF4040?style=flat-square&labelColor=2C3E50" style="display: inline-block; vertical-align: middle;"/>
|
34 |
</a>
|
35 |
<a href="https://chat.minimax.io/" target="_blank" style="margin: 2px;">
|
@@ -95,64 +95,59 @@ foundation for next-generation language model agents to reason and tackle real-w
|
|
95 |
|
96 |
## 2. Evaluation
|
97 |
|
98 |
-
<!-- **Performance of MiniMax-M1 on core benchmarks.**
|
99 |
-
|
100 |
-
| **Tasks** | **OpenAI-o3** | **Gemini 2.5<br>Pro (06-05)** | **Claude<br>4 Opus** | **Seed-<br>Thinking-<br>v1.5** | **DeepSeek-<br>R1** | **DeepSeek-<br>R1-0528** | **Qwen3-<br>235B-A22B** | **MiniMax-<br>M1-40K** | **MiniMax-<br>M1-80K** |
|
101 |
-
|:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
102 |
-
| *Extended<br>Thinking* | *100k* | *64k* | *64k* | *32k* | *32k* | *64k* | *32k* | *40K* | *80K* |
|
103 |
-
| ***Mathematics*** |
|
104 |
-
| AIME 2024 | 91.6 | 92.0 | 76.0 | 86.7 | 79.8 | 91.4 | 85.7 | 83.3 | 86.0 |
|
105 |
-
| AIME 2025 | 88.9 | 88.0 | 75.5 | 74.0 | 70.0 | 87.5 | 81.5 | 74.6 | 76.9 |
|
106 |
-
| MATH-500 | 98.1 | 98.8 | 98.2 | 96.7 | 97.3 | 98.0 | 96.2 | 96.0 | 96.8 |
|
107 |
-
| ***General Coding*** |
|
108 |
-
| LiveCodeBench<br>*(24/8~25/5)* | 75.8 | 77.1 | 56.6 | 67.5 | 55.9 | 73.1 | 65.9 | 62.3 | 65.0 |
|
109 |
-
| FullStackBench | 69.3 | -- | 70.3 | 69.9 | 70.1 | 69.4 | 62.9 | 67.6 | 68.3 |
|
110 |
-
| ***Reasoning & Knowledge*** |
|
111 |
-
| GPQA Diamond | 83.3 | 86.4 | 79.6 | 77.3 | 71.5 | 81.0 | 71.1 | 69.2 | 70.0 |
|
112 |
-
| HLE *(no tools)* | 20.3 | 21.6 | 10.7 | 8.2 | 8.6\* | 17.7\* | 7.6\* | 7.2\* | 8.4\* |
|
113 |
-
| ZebraLogic | 95.8 | 91.6 | 95.1 | 84.4 | 78.7 | 95.1 | 80.3 | 80.1 | 86.8 |
|
114 |
-
| MMLU-Pro | 85.0 | 86.0 | 85.0 | 87.0 | 84.0 | 85.0 | 83.0 | 80.6 | 81.1 |
|
115 |
-
| ***Software Engineering*** |
|
116 |
-
| SWE-bench Verified| 69.1 | 67.2 | 72.5 | 47.0 | 49.2 | 57.6 | 34.4 | 55.6 | 56.0 |
|
117 |
-
| ***Long Context*** |
|
118 |
-
| OpenAI-MRCR *(128k)* | 56.5 | 76.8 | 48.9 | 54.3 | 35.8 | 51.5 | 27.7 | 76.1 | 73.4 |
|
119 |
-
| OpenAI-MRCR *(1M)* | -- | 58.8 | -- | -- | -- | -- | -- | 58.6 | 56.2 |
|
120 |
-
| LongBench-v2 | 58.8 | 65.0 | 55.6 | 52.5 | 58.3 | 52.1 | 50.1 | 61.0 | 61.5 |
|
121 |
-
| ***Agentic Tool Use*** |
|
122 |
-
| TAU-bench *(airline)* | 52.0 | 50.0 | 59.6 | 44.0 | -- | 53.5 | 34.7 | 60.0 | 62.0 |
|
123 |
-
| TAU-bench *(retail)* | 73.9 | 67.0 | 81.4 | 55.7 | -- | 63.9 | 58.6 | 67.8 | 63.5 |
|
124 |
-
| ***Factuality*** |
|
125 |
-
| SimpleQA | 49.4 | 54.0 | -- | 12.9 | 30.1 | 27.8 | 11.0 | 17.9 | 18.5 |
|
126 |
-
| ***General Assistant*** |
|
127 |
-
| MultiChallenge | 56.5 | 51.8 | 45.8 | 43.0 | 40.7 | 45.0 | 40.0 | 44.7 | 44.7 |
|
128 |
-
|
129 |
-
\* conducted on the text-only HLE subset. -->
|
130 |
-
|
131 |
**Performance of MiniMax-M1 on core benchmarks.**
|
132 |
|
133 |
-
|
|
|
134 |
|:---|:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
135 |
-
| | *Extended Thinking* | *
|
136 |
-
| ***Mathematics*** | AIME 2024 |
|
137 |
-
| | AIME 2025 |
|
138 |
-
| | MATH-500 |
|
139 |
-
| ***General Coding*** | LiveCodeBench *(24/8~25/5)* |
|
140 |
-
| | FullStackBench |
|
141 |
-
| ***Reasoning & Knowledge***| GPQA Diamond |
|
142 |
-
| | HLE *(no tools)* |
|
143 |
-
| | ZebraLogic |
|
144 |
-
| | MMLU-Pro |
|
145 |
-
| ***Software Engineering***| SWE-bench Verified|
|
146 |
-
| ***Long Context*** | OpenAI-MRCR *(128k)* |
|
147 |
-
| | OpenAI-MRCR *(1M)* |
|
148 |
-
| | LongBench-v2 |
|
149 |
-
| ***Agentic Tool Use***| TAU-bench *(airline)* |
|
150 |
-
| | TAU-bench *(retail)* |
|
151 |
-
| ***Factuality*** | SimpleQA |
|
152 |
-
| ***General Assistant***| MultiChallenge |
|
153 |
|
154 |
\* conducted on the text-only HLE subset.
|
155 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
156 |
## 3. Deployment Guide
|
157 |
|
158 |
Download the model from HuggingFace repository:
|
|
|
29 |
<a href="https://www.minimax.io" target="_blank" style="margin: 2px;">
|
30 |
<img alt="Homepage" src="https://img.shields.io/badge/_Homepage-MiniMax-FF4040?style=flat-square&labelColor=2C3E50&logo=&logoWidth=20" style="display: inline-block; vertical-align: middle;"/>
|
31 |
</a>
|
32 |
+
<a href="./MiniMax_M1_tech_report.pdf" target="_blank" style="margin: 2px;">
|
33 |
<img alt="Paper" src="https://img.shields.io/badge/📖_Paper-MiniMax--M1-FF4040?style=flat-square&labelColor=2C3E50" style="display: inline-block; vertical-align: middle;"/>
|
34 |
</a>
|
35 |
<a href="https://chat.minimax.io/" target="_blank" style="margin: 2px;">
|
|
|
95 |
|
96 |
## 2. Evaluation
|
97 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
98 |
**Performance of MiniMax-M1 on core benchmarks.**
|
99 |
|
100 |
+
|
101 |
+
| **Category** | **Task** | **MiniMax-M1-80K** | **MiniMax-M1-40K** | **Qwen3-235B-A22B** | **DeepSeek-R1-0528** | **DeepSeek-R1** | **Seed-Thinking-v1.5** | **Claude 4 Opus** | **Gemini 2.5 Pro (06-05)** | **OpenAI-o3** |
|
102 |
|:---|:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
103 |
+
| | *Extended Thinking* | *80K* | *40K* | *32k* | *64k* | *32k* | *32k* | *64k* | *64k* | *100k* |
|
104 |
+
| ***Mathematics*** | AIME 2024 | 86.0 | 83.3 | 85.7 | 91.4 | 79.8 | 86.7 | 76.0 | 92.0 | 91.6 |
|
105 |
+
| | AIME 2025 | 76.9 | 74.6 | 81.5 | 87.5 | 70.0 | 74.0 | 75.5 | 88.0 | 88.9 |
|
106 |
+
| | MATH-500 | 96.8 | 96.0 | 96.2 | 98.0 | 97.3 | 96.7 | 98.2 | 98.8 | 98.1 |
|
107 |
+
| ***General Coding*** | LiveCodeBench *(24/8~25/5)* | 65.0 | 62.3 | 65.9 | 73.1 | 55.9 | 67.5 | 56.6 | 77.1 | 75.8 |
|
108 |
+
| | FullStackBench | 68.3 | 67.6 | 62.9 | 69.4 | 70.1 | 69.9 | 70.3 | -- | 69.3 |
|
109 |
+
| ***Reasoning & Knowledge***| GPQA Diamond | 70.0 | 69.2 | 71.1 | 81.0 | 71.5 | 77.3 | 79.6 | 86.4 | 83.3 |
|
110 |
+
| | HLE *(no tools)* | 8.4\* | 7.2\* | 7.6\* | 17.7\* | 8.6\* | 8.2 | 10.7 | 21.6 | 20.3 |
|
111 |
+
| | ZebraLogic | 86.8 | 80.1 | 80.3 | 95.1 | 78.7 | 84.4 | 95.1 | 91.6 | 95.8 |
|
112 |
+
| | MMLU-Pro | 81.1 | 80.6 | 83.0 | 85.0 | 84.0 | 87.0 | 85.0 | 86.0 | 85.0 |
|
113 |
+
| ***Software Engineering***| SWE-bench Verified| 56.0 | 55.6 | 34.4 | 57.6 | 49.2 | 47.0 | 72.5 | 67.2 | 69.1 |
|
114 |
+
| ***Long Context*** | OpenAI-MRCR *(128k)* | 73.4 | 76.1 | 27.7 | 51.5 | 35.8 | 54.3 | 48.9 | 76.8 | 56.5 |
|
115 |
+
| | OpenAI-MRCR *(1M)* | 56.2 | 58.6 | -- | -- | -- | -- | -- | 58.8 | -- |
|
116 |
+
| | LongBench-v2 | 61.5 | 61.0 | 50.1 | 52.1 | 58.3 | 52.5 | 55.6 | 65.0 | 58.8 |
|
117 |
+
| ***Agentic Tool Use***| TAU-bench *(airline)* | 62.0 | 60.0 | 34.7 | 53.5 | -- | 44.0 | 59.6 | 50.0 | 52.0 |
|
118 |
+
| | TAU-bench *(retail)* | 63.5 | 67.8 | 58.6 | 63.9 | -- | 55.7 | 81.4 | 67.0 | 73.9 |
|
119 |
+
| ***Factuality*** | SimpleQA | 18.5 | 17.9 | 11.0 | 27.8 | 30.1 | 12.9 | -- | 54.0 | 49.4 |
|
120 |
+
| ***General Assistant***| MultiChallenge | 44.7 | 44.7 | 40.0 | 45.0 | 40.7 | 43.0 | 45.8 | 51.8 | 56.5 |
|
121 |
|
122 |
\* conducted on the text-only HLE subset.
|
123 |
|
124 |
+
Our models are evaluated with temperature=1.0, top_p=0.95.
|
125 |
+
|
126 |
+
### SWE-bench methodology
|
127 |
+
We report results derived from the Agentless scaffold. Departing from the original pipeline, our methodology employs a two-stage localization process (without any embedding-based retrieval mechanisms): initial coarse-grained file localization followed by fine-grained localization to specific files and code elements. The values for our models are calculated on the subset of n=486 verified tasks which work on our infrastructure. The excluded 14 test cases that were incompatible with our internal infrastructure are:
|
128 |
+
"astropy__astropy-7606",
|
129 |
+
"astropy__astropy-8707",
|
130 |
+
"astropy__astropy-8872",
|
131 |
+
"django__django-10097",
|
132 |
+
"matplotlib__matplotlib-20488",
|
133 |
+
"psf__requests-2317",
|
134 |
+
"psf__requests-2931",
|
135 |
+
"psf__requests-5414",
|
136 |
+
"pylint-dev__pylint-6528",
|
137 |
+
"pylint-dev__pylint-7277",
|
138 |
+
"sphinx-doc__sphinx-10435",
|
139 |
+
"sphinx-doc__sphinx-7985",
|
140 |
+
"sphinx-doc__sphinx-8269",
|
141 |
+
"sphinx-doc__sphinx-8475"
|
142 |
+
|
143 |
+
### TAU-bench methodology
|
144 |
+
We evaluate TAU-Bench with GPT-4.1 as user model and without any custom tools. The maximum number of interaction steps is 40.
|
145 |
+
Our general system prompt is:
|
146 |
+
```
|
147 |
+
- In each round, you need to carefully examine the tools provided to you to determine if any can be used.
|
148 |
+
- You must adhere to all of the policies. Pay attention to the details in the terms. Solutions for most situations can be found within these policies.
|
149 |
+
```
|
150 |
+
|
151 |
## 3. Deployment Guide
|
152 |
|
153 |
Download the model from HuggingFace repository:
|