added evaluation results (#3)
Browse files- added evaluation results (16becfcd7acb6466cb65c09012ef31999933a256)
Co-authored-by: Anurag Roy <[email protected]>
README.md
CHANGED
@@ -163,10 +163,124 @@ Developing a skill involves a combination of learning, practice, and often, feed
|
|
163 |
|
164 |
Remember, everyone learns at their own pace, so don't compare your progress with others. The most important thing is that you're consistently moving forward.
|
165 |
```
|
166 |
-
|
167 |
-
|
168 |
-
|
169 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
170 |
|
171 |
**Training Data:**
|
172 |
Overall, our training data is largely comprised of two key sources: (1) publicly available datasets with permissive license, (2) internal synthetically generated data targeted to enhance reasoning capabilites.
|
|
|
163 |
|
164 |
Remember, everyone learns at their own pace, so don't compare your progress with others. The most important thing is that you're consistently moving forward.
|
165 |
```
|
166 |
+
|
167 |
+
**Evaluation Results:**
|
168 |
+
<table>
|
169 |
+
|
170 |
+
<thead>
|
171 |
+
<tr>
|
172 |
+
<th style="text-align:left; background-color: #001d6c; color: white;">Models</th>
|
173 |
+
<th style="text-align:center; background-color: #001d6c; color: white;">ArenaHard</th>
|
174 |
+
<th style="text-align:center; background-color: #001d6c; color: white;">Alpaca-Eval-2</th>
|
175 |
+
<th style="text-align:center; background-color: #001d6c; color: white;">MMLU</th>
|
176 |
+
<th style="text-align:center; background-color: #001d6c; color: white;">PopQA</th>
|
177 |
+
<th style="text-align:center; background-color: #001d6c; color: white;">TruthfulQA</th>
|
178 |
+
<th style="text-align:center; background-color: #001d6c; color: white;">BigBenchHard</th>
|
179 |
+
<th style="text-align:center; background-color: #001d6c; color: white;">DROP</th>
|
180 |
+
<th style="text-align:center; background-color: #001d6c; color: white;">GSM8K</th>
|
181 |
+
<th style="text-align:center; background-color: #001d6c; color: white;">HumanEval</th>
|
182 |
+
<th style="text-align:center; background-color: #001d6c; color: white;">HumanEval+</th>
|
183 |
+
<th style="text-align:center; background-color: #001d6c; color: white;">IFEval</th>
|
184 |
+
<th style="text-align:center; background-color: #001d6c; color: white;">AttaQ</th>
|
185 |
+
</tr></thead>
|
186 |
+
<tbody>
|
187 |
+
<tr>
|
188 |
+
<td style="text-align:left; background-color: #DAE8FF; color: black;">Llama-3.1-8B-Instruct</td>
|
189 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">36.43</td>
|
190 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">27.22</td>
|
191 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">69.15</td>
|
192 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">28.79</td>
|
193 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">52.79</td>
|
194 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">72.66</td>
|
195 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">61.48</td>
|
196 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">83.24</td>
|
197 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">85.32</td>
|
198 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">80.15</td>
|
199 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">79.10</td>
|
200 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">83.43</td>
|
201 |
+
</tr>
|
202 |
+
|
203 |
+
<tr>
|
204 |
+
<td style="text-align:left; background-color: #DAE8FF; color: black;">DeepSeek-R1-Distill-Llama-8B</td>
|
205 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">17.17</td>
|
206 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">21.85</td>
|
207 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">45.80</td>
|
208 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">13.25</td>
|
209 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">47.43</td>
|
210 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">65.71</td>
|
211 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">44.46</td>
|
212 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">72.18</td>
|
213 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">67.54</td>
|
214 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">62.91</td>
|
215 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">66.50</td>
|
216 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">42.87</td>
|
217 |
+
</tr>
|
218 |
+
|
219 |
+
<tr>
|
220 |
+
<td style="text-align:left; background-color: #DAE8FF; color: black;">Qwen-2.5-7B-Instruct</td>
|
221 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">25.44</td>
|
222 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">30.34</td>
|
223 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">74.30</td>
|
224 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">18.12</td>
|
225 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">63.06</td>
|
226 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">70.40</td>
|
227 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">54.71</td>
|
228 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">84.46</td>
|
229 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">93.35</td>
|
230 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">89.91</td>
|
231 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">74.90</td>
|
232 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">81.90</td>
|
233 |
+
</tr>
|
234 |
+
|
235 |
+
<tr>
|
236 |
+
<td style="text-align:left; background-color: #DAE8FF; color: black;">DeepSeek-R1-Distill-Qwen-7B</td>
|
237 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">10.36</td>
|
238 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">15.35</td>
|
239 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">50.72</td>
|
240 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">9.94</td>
|
241 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">47.14</td>
|
242 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">65.04</td>
|
243 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">42.76</td>
|
244 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">78.47</td>
|
245 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">79.89</td>
|
246 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">78.43</td>
|
247 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">59.10</td>
|
248 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">42.45</td>
|
249 |
+
</tr>
|
250 |
+
|
251 |
+
<tr>
|
252 |
+
<td style="text-align:left; background-color: #DAE8FF; color: black;">Granite-3.1-8B-Instruct</td>
|
253 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">37.58</td>
|
254 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">27.87</td>
|
255 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">66.84</td>
|
256 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">28.84</td>
|
257 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">65.92</td>
|
258 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">68.10</td>
|
259 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">50.78</td>
|
260 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">79.08</td>
|
261 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">88.82</td>
|
262 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">84.62</td>
|
263 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">71.20</td>
|
264 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">85.73</td>
|
265 |
+
</tr>
|
266 |
+
|
267 |
+
<tr>
|
268 |
+
<td style="text-align:left; background-color: #DAE8FF; color: black;">Granite-3.2-8B-Instruct-Preview</td>
|
269 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">55.23</td>
|
270 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">61.16</td>
|
271 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">66.93</td>
|
272 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">28.08</td>
|
273 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">66.37</td>
|
274 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">65.60</td>
|
275 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">50.73</td>
|
276 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">83.09</td>
|
277 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">89.47</td>
|
278 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">86.88</td>
|
279 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">73.57</td>
|
280 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">85.99</td>
|
281 |
+
</tr>
|
282 |
+
|
283 |
+
</tbody></table>
|
284 |
|
285 |
**Training Data:**
|
286 |
Overall, our training data is largely comprised of two key sources: (1) publicly available datasets with permissive license, (2) internal synthetically generated data targeted to enhance reasoning capabilites.
|