nm-research commited on
Commit
05512fd
·
verified ·
1 Parent(s): c6145eb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +143 -36
README.md CHANGED
@@ -137,39 +137,146 @@ lm_eval \
137
 
138
  ### Accuracy
139
 
140
- #### OpenLLM Leaderboard V1 evaluation scores
141
-
142
- | Metric | deepseek-ai/DeepSeek-R1-Distill-Llama-8B | neuralmagic-ent/DeepSeek-R1-Distill-Llama-8B-FP8-Dynamic |
143
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
144
- | ARC-Challenge (Acc-Norm, 25-shot) | 45.05 | 44.88 |
145
- | GSM8K (Strict-Match, 5-shot) | 62.77 | 61.49 |
146
- | HellaSwag (Acc-Norm, 10-shot) | 76.78 | 76.68 |
147
- | MMLU (Acc, 5-shot) | 55.65 | 55.82 |
148
- | TruthfulQA (MC2, 0-shot) | 50.55 | 49.92 |
149
- | Winogrande (Acc, 5-shot) | 68.51 | 67.72 |
150
- | **Average Score** | **59.88** | **59.42** |
151
- | **Recovery (%)** | **100.00** | **99.22** |
152
-
153
- #### OpenLLM Leaderboard V2 evaluation scores
154
-
155
-
156
- | Metric | deepseek-ai/DeepSeek-R1-Distill-Llama-8B | neuralmagic-ent/DeepSeek-R1-Distill-Llama-8B-FP8-Dynamic |
157
- |---------------------------------------------------------|:---------------------------------:|:-------------------------------------------:|
158
- | IFEval (Inst-and-Prompt Level Strict Acc, 0-shot) | 38.34 | 38.22 |
159
- | BBH (Acc-Norm, 3-shot) | 38.19 | 38.32 |
160
- | GPQA (Acc-Norm, 0-shot) | 28.87 | 27.56 |
161
- | MUSR (Acc-Norm, 0-shot) | 33.31 | 33.71 |
162
- | MMLU-Pro (Acc, 5-shot) | 20.10 | 21.39 |
163
- | **Average Score** | **26.47** | **26.53** |
164
- | **Recovery (%)** | **100.00** | **100.24** |
165
-
166
- #### Coding evaluation scores
167
-
168
- | Metric | deepseek-ai/DeepSeek-R1-Distill-Llama-8B | neuralmagic-ent/DeepSeek-R1-Distill-Llama-8B-FP8-Dynamic |
169
- |---------------------------------------------------------|:---------------------------------:|:-------------------------------------------:|
170
- | HumanEval pass@1 | 49.90 | 51.20
171
- | HumanEval pass@10 | 68.90 | 68.20 |
172
- | HumanEval+ pass@1 | 44.10 | 46.60 |
173
- | HumanEval+ pass@10 | 62.90 | 62.70 |
174
- | **Average Score** | **56.45** | **57.17** |
175
- | **Recovery (%)** | **100.00** | **101.27** |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
137
 
138
  ### Accuracy
139
 
140
+ <table>
141
+ <thead>
142
+ <tr>
143
+ <th>Category</th>
144
+ <th>Metric</th>
145
+ <th>deepseek-ai/DeepSeek-R1-Distill-Llama-8B</th>
146
+ <th>neuralmagic-ent/DeepSeek-R1-Distill-Llama-8B-FP8-Dynamic</th>
147
+ <th>Recovery</th>
148
+ </tr>
149
+ </thead>
150
+ <tbody>
151
+ <tr>
152
+ <td rowspan="3"><b>Reasoning</b></td>
153
+ <td>AIME 2024 (pass@1)</td>
154
+ <td>50.00</td>
155
+ <td>30.00</td>
156
+ <td>60.0%</td>
157
+ </tr>
158
+ <tr>
159
+ <td>MATH-500 (pass@1)</td>
160
+ <td>87.60</td>
161
+ <td>84.60</td>
162
+ <td>96.6%</td>
163
+ </tr>
164
+ <tr>
165
+ <td>GPQA Diamond (pass@1)</td>
166
+ <td>44.95</td>
167
+ <td>43.94</td>
168
+ <td>97.8%</td>
169
+ </tr>
170
+ <tr>
171
+ <td rowspan="7"><b>OpenLLM V1</b></td>
172
+ <td>ARC-Challenge (Acc-Norm, 25-shot)</td>
173
+ <td>45.05</td>
174
+ <td>44.88</td>
175
+ <td>99.6%</td>
176
+ </tr>
177
+ <tr>
178
+ <td>GSM8K (Strict-Match, 5-shot)</td>
179
+ <td>62.77</td>
180
+ <td>61.49</td>
181
+ <td>98.0</td>
182
+ </tr>
183
+ <tr>
184
+ <td>HellaSwag (Acc-Norm, 10-shot)</td>
185
+ <td>76.78</td>
186
+ <td>76.68</td>
187
+ <td>99.9%</td>
188
+ </tr>
189
+ <tr>
190
+ <td>MMLU (Acc, 5-shot)</td>
191
+ <td>55.65</td>
192
+ <td>55.82</td>
193
+ <td>100.3%</td>
194
+ </tr>
195
+ <tr>
196
+ <td>TruthfulQA (MC2, 0-shot)</td>
197
+ <td>50.55</td>
198
+ <td>49.92</td>
199
+ <td>98.8%</td>
200
+ </tr>
201
+ <tr>
202
+ <td>Winogrande (Acc, 5-shot)</td>
203
+ <td>68.51</td>
204
+ <td>67.72</td>
205
+ <td>98.8%</td>
206
+ </tr>
207
+ <tr>
208
+ <td><b>Average Score</b></td>
209
+ <td><b>58.88</b></td>
210
+ <td><b>59.42</b></td>
211
+ <td><b>99.2</b></td>
212
+ </tr>
213
+ <tr>
214
+ <td rowspan="7"><b>OpenLLM V2</b></td>
215
+ <td>IFEval (Inst Level Strict Acc, 0-shot)</td>
216
+ <td>38.34</td>
217
+ <td>38.22</td>
218
+ <td>99.7%</td>
219
+ </tr>
220
+ <tr>
221
+ <td>BBH (Acc-Norm, 3-shot)</td>
222
+ <td>38.19</td>
223
+ <td>38.32</td>
224
+ <td>100.3%</td>
225
+ </tr>
226
+ <tr>
227
+ <td>Math-Hard (Exact-Match, 4-shot)</td>
228
+ <td>0.00</td>
229
+ <td>0.00</td>
230
+ <td>---</td>
231
+ </tr>
232
+ <tr>
233
+ <td>GPQA (Acc-Norm, 0-shot)</td>
234
+ <td>28.87</td>
235
+ <td>27.56</td>
236
+ <td>95.5%</td>
237
+ </tr>
238
+ <tr>
239
+ <td>MUSR (Acc-Norm, 0-shot)</td>
240
+ <td>33.31</td>
241
+ <td>33.71</td>
242
+ <td>101.2%</td>
243
+ </tr>
244
+ <tr>
245
+ <td>MMLU-Pro (Acc, 5-shot)</td>
246
+ <td>20.10</td>
247
+ <td>21.39</td>
248
+ <td>106.4%</td>
249
+ </tr>
250
+ <tr>
251
+ <td><b>Average Score</b></td>
252
+ <td><b>26.47</b></td>
253
+ <td><b>26.53</b></td>
254
+ <td><b>100.2%</b></td>
255
+ </tr>
256
+ <tr>
257
+ <td rowspan="4"><b>Coding</b></td>
258
+ <td>HumanEval (pass@1)</td>
259
+ <td>49.90</td>
260
+ <td>51.20</td>
261
+ <td><b>102.6%</b></td>
262
+ </tr>
263
+ <tr>
264
+ <td>HumanEval (pass@10)</td>
265
+ <td>68.90</td>
266
+ <td>68.20</td>
267
+ <td>99.0%</td>
268
+ </tr>
269
+ <tr>
270
+ <td>HumanEval+ (pass@10)</td>
271
+ <td>44.10</td>
272
+ <td>46.60</td>
273
+ <td>105.7%</td>
274
+ </tr>
275
+ <tr>
276
+ <td>HumanEval+ (pass@10)</td>
277
+ <td>62.90</td>
278
+ <td>62.70</td>
279
+ <td>99.7%</td>
280
+ </tr>
281
+ </tbody>
282
+ </table>