nm-research commited on
Commit
caa5c25
·
verified ·
1 Parent(s): 1834dc7

Add reasoning evals

Browse files
Files changed (1) hide show
  1. README.md +25 -0
README.md CHANGED
@@ -164,6 +164,31 @@ lm_eval \
164
  </thead>
165
  <tbody>
166
  <tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
167
  <td rowspan="7"><b>OpenLLM V1</b></td>
168
  <td>ARC-Challenge (Acc-Norm, 25-shot)</td>
169
  <td>58.79</td>
 
164
  </thead>
165
  <tbody>
166
  <tr>
167
+ <td rowspan="4"><b>Reasoning</b></td>
168
+ <td>AIME 2024 (pass@1)</td>
169
+ <td>66.67</td>
170
+ <td>66.04</td>
171
+ <td>99.06%</td>
172
+ </tr>
173
+ <tr>
174
+ <td>MATH-500 (pass@1)</td>
175
+ <td>94.66</td>
176
+ <td>94.95</td>
177
+ <td>100.31%</td>
178
+ </tr>
179
+ <tr>
180
+ <td>GPQA Diamond (pass@1)</td>
181
+ <td>59.35</td>
182
+ <td>57.48</td>
183
+ <td>96.85%</td>
184
+ </tr>
185
+ <tr>
186
+ <td><b>Average Score</b></td>
187
+ <td><b>73.56</b></td>
188
+ <td><b>72.82</b></td>
189
+ <td><b>98.99%</b></td>
190
+ </tr>
191
+ <tr>
192
  <td rowspan="7"><b>OpenLLM V1</b></td>
193
  <td>ARC-Challenge (Acc-Norm, 25-shot)</td>
194
  <td>58.79</td>