rpand002 ranarag commited on
Commit
eb67582
·
verified ·
1 Parent(s): 2b5e645

added evaluation results (#3)

Browse files

- added evaluation results (16becfcd7acb6466cb65c09012ef31999933a256)


Co-authored-by: Anurag Roy <[email protected]>

Files changed (1) hide show
  1. README.md +118 -4
README.md CHANGED
@@ -163,10 +163,124 @@ Developing a skill involves a combination of learning, practice, and often, feed
163
 
164
  Remember, everyone learns at their own pace, so don't compare your progress with others. The most important thing is that you're consistently moving forward.
165
  ```
166
- <!-- **Evaluation Results:**
167
- <TODO>
168
- Add the figures.
169
- </TODO> -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
170
 
171
  **Training Data:**
172
  Overall, our training data is largely comprised of two key sources: (1) publicly available datasets with permissive license, (2) internal synthetically generated data targeted to enhance reasoning capabilites.
 
163
 
164
  Remember, everyone learns at their own pace, so don't compare your progress with others. The most important thing is that you're consistently moving forward.
165
  ```
166
+
167
+ **Evaluation Results:**
168
+ <table>
169
+
170
+ <thead>
171
+ <tr>
172
+ <th style="text-align:left; background-color: #001d6c; color: white;">Models</th>
173
+ <th style="text-align:center; background-color: #001d6c; color: white;">ArenaHard</th>
174
+ <th style="text-align:center; background-color: #001d6c; color: white;">Alpaca-Eval-2</th>
175
+ <th style="text-align:center; background-color: #001d6c; color: white;">MMLU</th>
176
+ <th style="text-align:center; background-color: #001d6c; color: white;">PopQA</th>
177
+ <th style="text-align:center; background-color: #001d6c; color: white;">TruthfulQA</th>
178
+ <th style="text-align:center; background-color: #001d6c; color: white;">BigBenchHard</th>
179
+ <th style="text-align:center; background-color: #001d6c; color: white;">DROP</th>
180
+ <th style="text-align:center; background-color: #001d6c; color: white;">GSM8K</th>
181
+ <th style="text-align:center; background-color: #001d6c; color: white;">HumanEval</th>
182
+ <th style="text-align:center; background-color: #001d6c; color: white;">HumanEval+</th>
183
+ <th style="text-align:center; background-color: #001d6c; color: white;">IFEval</th>
184
+ <th style="text-align:center; background-color: #001d6c; color: white;">AttaQ</th>
185
+ </tr></thead>
186
+ <tbody>
187
+ <tr>
188
+ <td style="text-align:left; background-color: #DAE8FF; color: black;">Llama-3.1-8B-Instruct</td>
189
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">36.43</td>
190
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">27.22</td>
191
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">69.15</td>
192
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">28.79</td>
193
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">52.79</td>
194
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">72.66</td>
195
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">61.48</td>
196
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">83.24</td>
197
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">85.32</td>
198
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">80.15</td>
199
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">79.10</td>
200
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">83.43</td>
201
+ </tr>
202
+
203
+ <tr>
204
+ <td style="text-align:left; background-color: #DAE8FF; color: black;">DeepSeek-R1-Distill-Llama-8B</td>
205
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">17.17</td>
206
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">21.85</td>
207
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">45.80</td>
208
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">13.25</td>
209
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">47.43</td>
210
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">65.71</td>
211
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">44.46</td>
212
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">72.18</td>
213
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">67.54</td>
214
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">62.91</td>
215
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">66.50</td>
216
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">42.87</td>
217
+ </tr>
218
+
219
+ <tr>
220
+ <td style="text-align:left; background-color: #DAE8FF; color: black;">Qwen-2.5-7B-Instruct</td>
221
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">25.44</td>
222
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">30.34</td>
223
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">74.30</td>
224
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">18.12</td>
225
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">63.06</td>
226
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">70.40</td>
227
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">54.71</td>
228
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">84.46</td>
229
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">93.35</td>
230
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">89.91</td>
231
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">74.90</td>
232
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">81.90</td>
233
+ </tr>
234
+
235
+ <tr>
236
+ <td style="text-align:left; background-color: #DAE8FF; color: black;">DeepSeek-R1-Distill-Qwen-7B</td>
237
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">10.36</td>
238
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">15.35</td>
239
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">50.72</td>
240
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">9.94</td>
241
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">47.14</td>
242
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">65.04</td>
243
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">42.76</td>
244
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">78.47</td>
245
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">79.89</td>
246
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">78.43</td>
247
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">59.10</td>
248
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">42.45</td>
249
+ </tr>
250
+
251
+ <tr>
252
+ <td style="text-align:left; background-color: #DAE8FF; color: black;">Granite-3.1-8B-Instruct</td>
253
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">37.58</td>
254
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">27.87</td>
255
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">66.84</td>
256
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">28.84</td>
257
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">65.92</td>
258
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">68.10</td>
259
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">50.78</td>
260
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">79.08</td>
261
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">88.82</td>
262
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">84.62</td>
263
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">71.20</td>
264
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">85.73</td>
265
+ </tr>
266
+
267
+ <tr>
268
+ <td style="text-align:left; background-color: #DAE8FF; color: black;">Granite-3.2-8B-Instruct-Preview</td>
269
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">55.23</td>
270
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">61.16</td>
271
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">66.93</td>
272
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">28.08</td>
273
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">66.37</td>
274
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">65.60</td>
275
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">50.73</td>
276
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">83.09</td>
277
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">89.47</td>
278
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">86.88</td>
279
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">73.57</td>
280
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">85.99</td>
281
+ </tr>
282
+
283
+ </tbody></table>
284
 
285
  **Training Data:**
286
  Overall, our training data is largely comprised of two key sources: (1) publicly available datasets with permissive license, (2) internal synthetically generated data targeted to enhance reasoning capabilites.