nm-research commited on
Commit
671f6fb
·
verified ·
1 Parent(s): c61000a

Add reasoning evals

Browse files
Files changed (1) hide show
  1. README.md +25 -0
README.md CHANGED
@@ -172,6 +172,31 @@ lm_eval \
172
  </thead>
173
  <tbody>
174
  <tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
175
  <td rowspan="7"><b>OpenLLM V1</b></td>
176
  <td>ARC-Challenge (Acc-Norm, 25-shot)</td>
177
  <td>63.65</td>
 
172
  </thead>
173
  <tbody>
174
  <tr>
175
+ <td rowspan="4"><b>Reasoning</b></td>
176
+ <td>AIME 2024 (pass@1)</td>
177
+ <td>67.83</td>
178
+ <td>65.61</td>
179
+ <td>96.73%</td>
180
+ </tr>
181
+ <tr>
182
+ <td>MATH-500 (pass@1)</td>
183
+ <td>95.29</td>
184
+ <td>95.19</td>
185
+ <td>99.9%</td>
186
+ </tr>
187
+ <tr>
188
+ <td>GPQA Diamond (pass@1)</td>
189
+ <td>65.57</td>
190
+ <td>64.04</td>
191
+ <td>97.67%</td>
192
+ </tr>
193
+ <tr>
194
+ <td><b>Average Score</b></td>
195
+ <td><b>76.23</b></td>
196
+ <td><b>74.95</b></td>
197
+ <td><b>98.23%</b></td>
198
+ </tr>
199
+ <tr>
200
  <td rowspan="7"><b>OpenLLM V1</b></td>
201
  <td>ARC-Challenge (Acc-Norm, 25-shot)</td>
202
  <td>63.65</td>