added evaluation results

#3
by ranarag - opened
Files changed (1) hide show
  1. README.md +118 -4
README.md CHANGED
@@ -163,10 +163,124 @@ Developing a skill involves a combination of learning, practice, and often, feed
163
 
164
  Remember, everyone learns at their own pace, so don't compare your progress with others. The most important thing is that you're consistently moving forward.
165
  ```
166
- <!-- **Evaluation Results:**
167
- <TODO>
168
- Add the figures.
169
- </TODO> -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
170
 
171
  **Training Data:**
172
  Overall, our training data is largely comprised of two key sources: (1) publicly available datasets with permissive license, (2) internal synthetically generated data targeted to enhance reasoning capabilites.
 
163
 
164
  Remember, everyone learns at their own pace, so don't compare your progress with others. The most important thing is that you're consistently moving forward.
165
  ```
166
+
167
+ **Evaluation Results:**
168
+ <table>
169
+
170
+ <thead>
171
+ <tr>
172
+ <th style="text-align:left; background-color: #001d6c; color: white;">Models</th>
173
+ <th style="text-align:center; background-color: #001d6c; color: white;">ArenaHard</th>
174
+ <th style="text-align:center; background-color: #001d6c; color: white;">Alpaca-Eval-2</th>
175
+ <th style="text-align:center; background-color: #001d6c; color: white;">MMLU</th>
176
+ <th style="text-align:center; background-color: #001d6c; color: white;">PopQA</th>
177
+ <th style="text-align:center; background-color: #001d6c; color: white;">TruthfulQA</th>
178
+ <th style="text-align:center; background-color: #001d6c; color: white;">BigBenchHard</th>
179
+ <th style="text-align:center; background-color: #001d6c; color: white;">DROP</th>
180
+ <th style="text-align:center; background-color: #001d6c; color: white;">GSM8K</th>
181
+ <th style="text-align:center; background-color: #001d6c; color: white;">HumanEval</th>
182
+ <th style="text-align:center; background-color: #001d6c; color: white;">HumanEval+</th>
183
+ <th style="text-align:center; background-color: #001d6c; color: white;">IFEval</th>
184
+ <th style="text-align:center; background-color: #001d6c; color: white;">AttaQ</th>
185
+ </tr></thead>
186
+ <tbody>
187
+ <tr>
188
+ <td style="text-align:left; background-color: #DAE8FF; color: black;">Llama-3.1-8B-Instruct</td>
189
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">36.43</td>
190
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">27.22</td>
191
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">69.15</td>
192
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">28.79</td>
193
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">52.79</td>
194
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">72.66</td>
195
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">61.48</td>
196
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">83.24</td>
197
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">85.32</td>
198
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">80.15</td>
199
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">79.10</td>
200
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">83.43</td>
201
+ </tr>
202
+
203
+ <tr>
204
+ <td style="text-align:left; background-color: #DAE8FF; color: black;">DeepSeek-R1-Distill-Llama-8B</td>
205
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">17.17</td>
206
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">21.85</td>
207
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">45.80</td>
208
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">13.25</td>
209
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">47.43</td>
210
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">65.71</td>
211
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">44.46</td>
212
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">72.18</td>
213
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">67.54</td>
214
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">62.91</td>
215
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">66.50</td>
216
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">42.87</td>
217
+ </tr>
218
+
219
+ <tr>
220
+ <td style="text-align:left; background-color: #DAE8FF; color: black;">Qwen-2.5-7B-Instruct</td>
221
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">25.44</td>
222
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">30.34</td>
223
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">74.30</td>
224
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">18.12</td>
225
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">63.06</td>
226
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">70.40</td>
227
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">54.71</td>
228
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">84.46</td>
229
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">93.35</td>
230
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">89.91</td>
231
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">74.90</td>
232
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">81.90</td>
233
+ </tr>
234
+
235
+ <tr>
236
+ <td style="text-align:left; background-color: #DAE8FF; color: black;">DeepSeek-R1-Distill-Qwen-7B</td>
237
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">10.36</td>
238
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">15.35</td>
239
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">50.72</td>
240
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">9.94</td>
241
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">47.14</td>
242
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">65.04</td>
243
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">42.76</td>
244
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">78.47</td>
245
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">79.89</td>
246
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">78.43</td>
247
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">59.10</td>
248
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">42.45</td>
249
+ </tr>
250
+
251
+ <tr>
252
+ <td style="text-align:left; background-color: #DAE8FF; color: black;">Granite-3.1-8B-Instruct</td>
253
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">37.58</td>
254
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">27.87</td>
255
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">66.84</td>
256
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">28.84</td>
257
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">65.92</td>
258
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">68.10</td>
259
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">50.78</td>
260
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">79.08</td>
261
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">88.82</td>
262
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">84.62</td>
263
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">71.20</td>
264
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">85.73</td>
265
+ </tr>
266
+
267
+ <tr>
268
+ <td style="text-align:left; background-color: #DAE8FF; color: black;">Granite-3.2-8B-Instruct-Preview</td>
269
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">55.23</td>
270
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">61.16</td>
271
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">66.93</td>
272
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">28.08</td>
273
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">66.37</td>
274
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">65.60</td>
275
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">50.73</td>
276
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">83.09</td>
277
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">89.47</td>
278
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">86.88</td>
279
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">73.57</td>
280
+ <td style="text-align:center; background-color: #DAE8FF; color: black;">85.99</td>
281
+ </tr>
282
+
283
+ </tbody></table>
284
 
285
  **Training Data:**
286
  Overall, our training data is largely comprised of two key sources: (1) publicly available datasets with permissive license, (2) internal synthetically generated data targeted to enhance reasoning capabilites.