Update index.html
Browse files- index.html +124 -18
index.html
CHANGED
@@ -32,15 +32,15 @@
|
|
32 |
<h1 class="title is-1 publication-title">Atla Selene Mini:<br>A General Purpose Evaluation Model</h1>
|
33 |
<div class="is-size-5 publication-authors">
|
34 |
<span class="author-block">
|
35 |
-
Andrei Alexandru
|
36 |
<span class="author-block">
|
37 |
-
Antonia Calvi
|
38 |
<span class="author-block">
|
39 |
-
Henry Broomfield
|
40 |
<span class="author-block">
|
41 |
-
Jackson Golden
|
42 |
<span class="author-block">
|
43 |
-
Kyle Dai
|
44 |
</div>
|
45 |
<div class="is-size-5 publication-authors">
|
46 |
<span class="author-block">
|
@@ -111,11 +111,11 @@
|
|
111 |
<div class="container is-max-desktop">
|
112 |
<!-- Logo -->
|
113 |
<div class="columns is-centered has-text-centered">
|
114 |
-
<div class="column is-
|
115 |
-
<img src="
|
116 |
</div>
|
117 |
</div>
|
118 |
-
|
119 |
<!-- Abstract -->
|
120 |
<div class="columns is-centered has-text-centered">
|
121 |
<div class="column is-four-fifths">
|
@@ -146,7 +146,7 @@
|
|
146 |
Human evaluation is time-consuming and expensive, and scales poorly with volume and complexity – hence the need for scalable, automated techniques. As generative models have become more capable, the field has addressed this need by using LLMs themselves to evaluate other LLMs' responses, producing judgments and natural language critiques without humans in the loop – an approach also known as "LLM-as-a-judge" (LLMJ).
|
147 |
</p>
|
148 |
<figure class="image">
|
149 |
-
<img src="
|
150 |
<figcaption>
|
151 |
<b>Figure 1:</b> Atla Selene Mini outperforms current state-of-the-art SLMJs: a) Overall task-average performance, comparing Atla Selene Mini (black) with the best and most widely used SLMJs. b) Breakdown of performance by task type and benchmark.
|
152 |
</figcaption>
|
@@ -165,7 +165,7 @@
|
|
165 |
</p>
|
166 |
|
167 |
<figure class="image">
|
168 |
-
<img src="
|
169 |
<figcaption>
|
170 |
<b>Figure 2:</b> Data curation strategy: The process of transforming a candidate dataset (left) into the final training mix (right). Yellow boxes indicate filtering steps, purple represents synthetic generation of chosen and rejected pairs for preference optimization.
|
171 |
</figcaption>
|
@@ -188,7 +188,18 @@
|
|
188 |
|
189 |
<h3 class="title is-4">Training</h3>
|
190 |
<p>
|
191 |
-
We fine-tuned a Llama 3.1 8B Instruct model using the variant of DPO introduced in [citation], and refer readers to that paper for the full derivation.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
192 |
</p>
|
193 |
</div>
|
194 |
</div>
|
@@ -235,21 +246,116 @@
|
|
235 |
<tbody>
|
236 |
<tr>
|
237 |
<td>Atla-Selene-Mini</td>
|
238 |
-
<td>0.756</td>
|
239 |
-
<td>0.753</td>
|
240 |
-
<td>0.746</td>
|
241 |
<td>0.613</td>
|
242 |
<td>0.584</td>
|
243 |
-
<td>0.891</td>
|
244 |
<td>0.688</td>
|
245 |
<td>0.900</td>
|
246 |
-
<td>0.863</td>
|
247 |
<td>0.732</td>
|
248 |
<td>0.576</td>
|
249 |
<td>0.915</td>
|
250 |
<td>0.778</td>
|
251 |
</tr>
|
252 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
253 |
</tbody>
|
254 |
</table>
|
255 |
</div>
|
@@ -260,7 +366,7 @@
|
|
260 |
</p>
|
261 |
|
262 |
<figure class="image">
|
263 |
-
<img src="
|
264 |
<figcaption>
|
265 |
<b>Figure 3:</b> Real-world evaluation: a) Performance on domain-specific industry benchmarks b) Performance on RewardBench with different prompt formats c) Performance measured by ELO scores in Judge Arena.
|
266 |
</figcaption>
|
|
|
32 |
<h1 class="title is-1 publication-title">Atla Selene Mini:<br>A General Purpose Evaluation Model</h1>
|
33 |
<div class="is-size-5 publication-authors">
|
34 |
<span class="author-block">
|
35 |
+
<b>Andrei Alexandru</b><sup>1</sup>,</span>
|
36 |
<span class="author-block">
|
37 |
+
<b>Antonia Calvi</b><sup>1</sup>,</span>
|
38 |
<span class="author-block">
|
39 |
+
<b>Henry Broomfield</b><sup>1</sup>,</span>
|
40 |
<span class="author-block">
|
41 |
+
<b>Jackson Golden</b><sup>1</sup>,</span>
|
42 |
<span class="author-block">
|
43 |
+
<b>Kyle Dai</b><sup>1</sup>,</span>
|
44 |
</div>
|
45 |
<div class="is-size-5 publication-authors">
|
46 |
<span class="author-block">
|
|
|
111 |
<div class="container is-max-desktop">
|
112 |
<!-- Logo -->
|
113 |
<div class="columns is-centered has-text-centered">
|
114 |
+
<div class="column is-2">
|
115 |
+
<img src="atla-logo.png" alt="Atla Logo" style="width: 50%">
|
116 |
</div>
|
117 |
</div>
|
118 |
+
|
119 |
<!-- Abstract -->
|
120 |
<div class="columns is-centered has-text-centered">
|
121 |
<div class="column is-four-fifths">
|
|
|
146 |
Human evaluation is time-consuming and expensive, and scales poorly with volume and complexity – hence the need for scalable, automated techniques. As generative models have become more capable, the field has addressed this need by using LLMs themselves to evaluate other LLMs' responses, producing judgments and natural language critiques without humans in the loop – an approach also known as "LLM-as-a-judge" (LLMJ).
|
147 |
</p>
|
148 |
<figure class="image">
|
149 |
+
<img src="Fig1.png" alt="Performance comparison">
|
150 |
<figcaption>
|
151 |
<b>Figure 1:</b> Atla Selene Mini outperforms current state-of-the-art SLMJs: a) Overall task-average performance, comparing Atla Selene Mini (black) with the best and most widely used SLMJs. b) Breakdown of performance by task type and benchmark.
|
152 |
</figcaption>
|
|
|
165 |
</p>
|
166 |
|
167 |
<figure class="image">
|
168 |
+
<img src="Fig2.png" alt="Data curation strategy">
|
169 |
<figcaption>
|
170 |
<b>Figure 2:</b> Data curation strategy: The process of transforming a candidate dataset (left) into the final training mix (right). Yellow boxes indicate filtering steps, purple represents synthetic generation of chosen and rejected pairs for preference optimization.
|
171 |
</figcaption>
|
|
|
188 |
|
189 |
<h3 class="title is-4">Training</h3>
|
190 |
<p>
|
191 |
+
We fine-tuned a Llama 3.1 8B Instruct model using the variant of DPO introduced in [citation], and refer readers to that paper for the full derivation. The distinction between this loss and the "vanilla" DPO loss is that it incorporates a negative log-likelihood term:
|
192 |
+
</p>
|
193 |
+
<div class="content has-text-centered">
|
194 |
+
<p>
|
195 |
+
\[\mathcal{L}_{\mathrm{DPO}+\mathrm{NLL}}=\mathcal{L}_{\mathrm{DPO}}\left((q_i^c, j_i^c), (q_i^r, j_i^r) \mid x'_i\right)+\alpha \mathcal{L}_{\mathrm{NLL}}\left(q_i^c, j_i^c \mid x'_i\right)\]
|
196 |
+
</p>
|
197 |
+
</div>
|
198 |
+
<p>
|
199 |
+
Here, \(q_i\) and \(j_i\) correspond to the chain-of-thought critique and judgment for data point \(i\), while \(x'_i\) is the prompt to the judge. The superscript refers to the chosen (\(c\)) or rejected (\(r\)) responses. Note how NLL is only applied on the chosen responses, as we did not want to increase the likelihood of poor-quality responses. \(\alpha\) is a hyperparameter that traded off the pairwise DPO loss against the ground-truth NLL loss.
|
200 |
+
</p>
|
201 |
+
<p>
|
202 |
+
We performed hyperparameter tuning on the following parameters: learning rate \(\eta \in\) {5.5 × 10\(^{-8}\), 1 × 10\(^{-7}\), 7 × 10\(^{-7}\) }, RPO \(\alpha \in\) {0.5, 1} and weight decay \(\in\) {0.01, 0.1}. The final values were a learning rate of 1 × 10\(^{-7}\), \(\alpha = 1\), and weight decay of 0.1. Training was conducted with a batch size of 32 for one epoch on 8 NVIDIA H100 80GB GPUs, taking 16 hours.
|
203 |
</p>
|
204 |
</div>
|
205 |
</div>
|
|
|
246 |
<tbody>
|
247 |
<tr>
|
248 |
<td>Atla-Selene-Mini</td>
|
249 |
+
<td><b>0.756</b></td>
|
250 |
+
<td><b>0.753</b></td>
|
251 |
+
<td><b>0.746</b></td>
|
252 |
<td>0.613</td>
|
253 |
<td>0.584</td>
|
254 |
+
<td><b>0.891</b></td>
|
255 |
<td>0.688</td>
|
256 |
<td>0.900</td>
|
257 |
+
<td><b>0.863</b></td>
|
258 |
<td>0.732</td>
|
259 |
<td>0.576</td>
|
260 |
<td>0.915</td>
|
261 |
<td>0.778</td>
|
262 |
</tr>
|
263 |
+
<tr>
|
264 |
+
<td>SFR-LLaMA-3.1-8B-Judge</td>
|
265 |
+
<td>0.749</td>
|
266 |
+
<td>0.750</td>
|
267 |
+
<td>0.710</td>
|
268 |
+
<td>0.520</td>
|
269 |
+
<td>0.590</td>
|
270 |
+
<td>0.887</td>
|
271 |
+
<td>0.689</td>
|
272 |
+
<td><b>0.941</b></td>
|
273 |
+
<td>0.850</td>
|
274 |
+
<td><b>0.749</b></td>
|
275 |
+
<td>0.603</td>
|
276 |
+
<td><b>0.928</b></td>
|
277 |
+
<td>0.780</td>
|
278 |
+
</tr>
|
279 |
+
<tr>
|
280 |
+
<td>GPT-4o-mini</td>
|
281 |
+
<td>0.743</td>
|
282 |
+
<td>0.735</td>
|
283 |
+
<td>0.700</td>
|
284 |
+
<td><b>0.615</b></td>
|
285 |
+
<td><b>0.605</b></td>
|
286 |
+
<td>0.801</td>
|
287 |
+
<td><b>0.731</b></td>
|
288 |
+
<td>0.896</td>
|
289 |
+
<td>0.725</td>
|
290 |
+
<td>0.701</td>
|
291 |
+
<td><b>0.625</b></td>
|
292 |
+
<td>0.906</td>
|
293 |
+
<td><b>0.781</b></td>
|
294 |
+
</tr>
|
295 |
+
<tr>
|
296 |
+
<td>Llama-3.1-8B-Instruct</td>
|
297 |
+
<td>0.660</td>
|
298 |
+
<td>0.653</td>
|
299 |
+
<td>0.505</td>
|
300 |
+
<td>0.448</td>
|
301 |
+
<td>0.452</td>
|
302 |
+
<td>0.750</td>
|
303 |
+
<td>0.730</td>
|
304 |
+
<td>0.882</td>
|
305 |
+
<td>0.650</td>
|
306 |
+
<td>0.608</td>
|
307 |
+
<td>0.506</td>
|
308 |
+
<td>0.894</td>
|
309 |
+
<td>0.756</td>
|
310 |
+
</tr>
|
311 |
+
<tr>
|
312 |
+
<td>Prometheus-2-7B</td>
|
313 |
+
<td>0.520</td>
|
314 |
+
<td>0.562</td>
|
315 |
+
<td>0.460</td>
|
316 |
+
<td>0.470</td>
|
317 |
+
<td>0.500</td>
|
318 |
+
<td>0.720</td>
|
319 |
+
<td>0.723</td>
|
320 |
+
<td>0.796</td>
|
321 |
+
<td>0.400</td>
|
322 |
+
<td>0.676</td>
|
323 |
+
<td>0.560</td>
|
324 |
+
<td>0.486</td>
|
325 |
+
<td>0.386</td>
|
326 |
+
</tr>
|
327 |
+
<tr>
|
328 |
+
<td>Patronus-GLIDER-3.8B</td>
|
329 |
+
<td>-</td>
|
330 |
+
<td>-</td>
|
331 |
+
<td>-</td>
|
332 |
+
<td><b>0.615</b></td>
|
333 |
+
<td>0.604</td>
|
334 |
+
<td>0.784</td>
|
335 |
+
<td>-</td>
|
336 |
+
<td>0.851</td>
|
337 |
+
<td>-</td>
|
338 |
+
<td>-</td>
|
339 |
+
<td>-</td>
|
340 |
+
<td>-</td>
|
341 |
+
<td>-</td>
|
342 |
+
</tr>
|
343 |
+
<tr>
|
344 |
+
<td>FlowAI-Judge-3.8B</td>
|
345 |
+
<td>-</td>
|
346 |
+
<td>-</td>
|
347 |
+
<td>-</td>
|
348 |
+
<td>0.400</td>
|
349 |
+
<td>0.460</td>
|
350 |
+
<td>0.728</td>
|
351 |
+
<td>-</td>
|
352 |
+
<td>0.803</td>
|
353 |
+
<td>-</td>
|
354 |
+
<td>-</td>
|
355 |
+
<td>-</td>
|
356 |
+
<td>-</td>
|
357 |
+
<td>-</td>
|
358 |
+
</tr>
|
359 |
</tbody>
|
360 |
</table>
|
361 |
</div>
|
|
|
366 |
</p>
|
367 |
|
368 |
<figure class="image">
|
369 |
+
<img src="Fig3.png" alt="Real-world evaluation">
|
370 |
<figcaption>
|
371 |
<b>Figure 3:</b> Real-world evaluation: a) Performance on domain-specific industry benchmarks b) Performance on RewardBench with different prompt formats c) Performance measured by ELO scores in Judge Arena.
|
372 |
</figcaption>
|