spisupat commited on
Commit
40eda1d
·
verified ·
1 Parent(s): 0e12c29

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +124 -18
index.html CHANGED
@@ -32,15 +32,15 @@
32
  <h1 class="title is-1 publication-title">Atla Selene Mini:<br>A General Purpose Evaluation Model</h1>
33
  <div class="is-size-5 publication-authors">
34
  <span class="author-block">
35
- Andrei Alexandru<sup>1</sup>,</span>
36
  <span class="author-block">
37
- Antonia Calvi<sup>1</sup>,</span>
38
  <span class="author-block">
39
- Henry Broomfield<sup>1</sup>,</span>
40
  <span class="author-block">
41
- Jackson Golden<sup>1</sup>,</span>
42
  <span class="author-block">
43
- Kyle Dai<sup>1</sup>,</span>
44
  </div>
45
  <div class="is-size-5 publication-authors">
46
  <span class="author-block">
@@ -111,11 +111,11 @@
111
  <div class="container is-max-desktop">
112
  <!-- Logo -->
113
  <div class="columns is-centered has-text-centered">
114
- <div class="column is-4">
115
- <img src="figs/atla-logo.png" alt="Atla Logo">
116
  </div>
117
  </div>
118
-
119
  <!-- Abstract -->
120
  <div class="columns is-centered has-text-centered">
121
  <div class="column is-four-fifths">
@@ -146,7 +146,7 @@
146
  Human evaluation is time-consuming and expensive, and scales poorly with volume and complexity – hence the need for scalable, automated techniques. As generative models have become more capable, the field has addressed this need by using LLMs themselves to evaluate other LLMs' responses, producing judgments and natural language critiques without humans in the loop – an approach also known as "LLM-as-a-judge" (LLMJ).
147
  </p>
148
  <figure class="image">
149
- <img src="figs/Fig1.png" alt="Performance comparison">
150
  <figcaption>
151
  <b>Figure 1:</b> Atla Selene Mini outperforms current state-of-the-art SLMJs: a) Overall task-average performance, comparing Atla Selene Mini (black) with the best and most widely used SLMJs. b) Breakdown of performance by task type and benchmark.
152
  </figcaption>
@@ -165,7 +165,7 @@
165
  </p>
166
 
167
  <figure class="image">
168
- <img src="figs/Fig2.png" alt="Data curation strategy">
169
  <figcaption>
170
  <b>Figure 2:</b> Data curation strategy: The process of transforming a candidate dataset (left) into the final training mix (right). Yellow boxes indicate filtering steps, purple represents synthetic generation of chosen and rejected pairs for preference optimization.
171
  </figcaption>
@@ -188,7 +188,18 @@
188
 
189
  <h3 class="title is-4">Training</h3>
190
  <p>
191
- We fine-tuned a Llama 3.1 8B Instruct model using the variant of DPO introduced in [citation], and refer readers to that paper for the full derivation.
 
 
 
 
 
 
 
 
 
 
 
192
  </p>
193
  </div>
194
  </div>
@@ -235,21 +246,116 @@
235
  <tbody>
236
  <tr>
237
  <td>Atla-Selene-Mini</td>
238
- <td>0.756</td>
239
- <td>0.753</td>
240
- <td>0.746</td>
241
  <td>0.613</td>
242
  <td>0.584</td>
243
- <td>0.891</td>
244
  <td>0.688</td>
245
  <td>0.900</td>
246
- <td>0.863</td>
247
  <td>0.732</td>
248
  <td>0.576</td>
249
  <td>0.915</td>
250
  <td>0.778</td>
251
  </tr>
252
- <!-- Add other rows from the LaTeX table -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
253
  </tbody>
254
  </table>
255
  </div>
@@ -260,7 +366,7 @@
260
  </p>
261
 
262
  <figure class="image">
263
- <img src="figs/Fig3.png" alt="Real-world evaluation">
264
  <figcaption>
265
  <b>Figure 3:</b> Real-world evaluation: a) Performance on domain-specific industry benchmarks b) Performance on RewardBench with different prompt formats c) Performance measured by ELO scores in Judge Arena.
266
  </figcaption>
 
32
  <h1 class="title is-1 publication-title">Atla Selene Mini:<br>A General Purpose Evaluation Model</h1>
33
  <div class="is-size-5 publication-authors">
34
  <span class="author-block">
35
+ <b>Andrei Alexandru</b><sup>1</sup>,</span>
36
  <span class="author-block">
37
+ <b>Antonia Calvi</b><sup>1</sup>,</span>
38
  <span class="author-block">
39
+ <b>Henry Broomfield</b><sup>1</sup>,</span>
40
  <span class="author-block">
41
+ <b>Jackson Golden</b><sup>1</sup>,</span>
42
  <span class="author-block">
43
+ <b>Kyle Dai</b><sup>1</sup>,</span>
44
  </div>
45
  <div class="is-size-5 publication-authors">
46
  <span class="author-block">
 
111
  <div class="container is-max-desktop">
112
  <!-- Logo -->
113
  <div class="columns is-centered has-text-centered">
114
+ <div class="column is-2">
115
+ <img src="atla-logo.png" alt="Atla Logo" style="width: 50%">
116
  </div>
117
  </div>
118
+
119
  <!-- Abstract -->
120
  <div class="columns is-centered has-text-centered">
121
  <div class="column is-four-fifths">
 
146
  Human evaluation is time-consuming and expensive, and scales poorly with volume and complexity – hence the need for scalable, automated techniques. As generative models have become more capable, the field has addressed this need by using LLMs themselves to evaluate other LLMs' responses, producing judgments and natural language critiques without humans in the loop – an approach also known as "LLM-as-a-judge" (LLMJ).
147
  </p>
148
  <figure class="image">
149
+ <img src="Fig1.png" alt="Performance comparison">
150
  <figcaption>
151
  <b>Figure 1:</b> Atla Selene Mini outperforms current state-of-the-art SLMJs: a) Overall task-average performance, comparing Atla Selene Mini (black) with the best and most widely used SLMJs. b) Breakdown of performance by task type and benchmark.
152
  </figcaption>
 
165
  </p>
166
 
167
  <figure class="image">
168
+ <img src="Fig2.png" alt="Data curation strategy">
169
  <figcaption>
170
  <b>Figure 2:</b> Data curation strategy: The process of transforming a candidate dataset (left) into the final training mix (right). Yellow boxes indicate filtering steps, purple represents synthetic generation of chosen and rejected pairs for preference optimization.
171
  </figcaption>
 
188
 
189
  <h3 class="title is-4">Training</h3>
190
  <p>
191
+ We fine-tuned a Llama 3.1 8B Instruct model using the variant of DPO introduced in [citation], and refer readers to that paper for the full derivation. The distinction between this loss and the "vanilla" DPO loss is that it incorporates a negative log-likelihood term:
192
+ </p>
193
+ <div class="content has-text-centered">
194
+ <p>
195
+ \[\mathcal{L}_{\mathrm{DPO}+\mathrm{NLL}}=\mathcal{L}_{\mathrm{DPO}}\left((q_i^c, j_i^c), (q_i^r, j_i^r) \mid x'_i\right)+\alpha \mathcal{L}_{\mathrm{NLL}}\left(q_i^c, j_i^c \mid x'_i\right)\]
196
+ </p>
197
+ </div>
198
+ <p>
199
+ Here, \(q_i\) and \(j_i\) correspond to the chain-of-thought critique and judgment for data point \(i\), while \(x'_i\) is the prompt to the judge. The superscript refers to the chosen (\(c\)) or rejected (\(r\)) responses. Note how NLL is only applied on the chosen responses, as we did not want to increase the likelihood of poor-quality responses. \(\alpha\) is a hyperparameter that traded off the pairwise DPO loss against the ground-truth NLL loss.
200
+ </p>
201
+ <p>
202
+ We performed hyperparameter tuning on the following parameters: learning rate \(\eta \in\) {5.5 × 10\(^{-8}\), 1 × 10\(^{-7}\), 7 × 10\(^{-7}\) }, RPO \(\alpha \in\) {0.5, 1} and weight decay \(\in\) {0.01, 0.1}. The final values were a learning rate of 1 × 10\(^{-7}\), \(\alpha = 1\), and weight decay of 0.1. Training was conducted with a batch size of 32 for one epoch on 8 NVIDIA H100 80GB GPUs, taking 16 hours.
203
  </p>
204
  </div>
205
  </div>
 
246
  <tbody>
247
  <tr>
248
  <td>Atla-Selene-Mini</td>
249
+ <td><b>0.756</b></td>
250
+ <td><b>0.753</b></td>
251
+ <td><b>0.746</b></td>
252
  <td>0.613</td>
253
  <td>0.584</td>
254
+ <td><b>0.891</b></td>
255
  <td>0.688</td>
256
  <td>0.900</td>
257
+ <td><b>0.863</b></td>
258
  <td>0.732</td>
259
  <td>0.576</td>
260
  <td>0.915</td>
261
  <td>0.778</td>
262
  </tr>
263
+ <tr>
264
+ <td>SFR-LLaMA-3.1-8B-Judge</td>
265
+ <td>0.749</td>
266
+ <td>0.750</td>
267
+ <td>0.710</td>
268
+ <td>0.520</td>
269
+ <td>0.590</td>
270
+ <td>0.887</td>
271
+ <td>0.689</td>
272
+ <td><b>0.941</b></td>
273
+ <td>0.850</td>
274
+ <td><b>0.749</b></td>
275
+ <td>0.603</td>
276
+ <td><b>0.928</b></td>
277
+ <td>0.780</td>
278
+ </tr>
279
+ <tr>
280
+ <td>GPT-4o-mini</td>
281
+ <td>0.743</td>
282
+ <td>0.735</td>
283
+ <td>0.700</td>
284
+ <td><b>0.615</b></td>
285
+ <td><b>0.605</b></td>
286
+ <td>0.801</td>
287
+ <td><b>0.731</b></td>
288
+ <td>0.896</td>
289
+ <td>0.725</td>
290
+ <td>0.701</td>
291
+ <td><b>0.625</b></td>
292
+ <td>0.906</td>
293
+ <td><b>0.781</b></td>
294
+ </tr>
295
+ <tr>
296
+ <td>Llama-3.1-8B-Instruct</td>
297
+ <td>0.660</td>
298
+ <td>0.653</td>
299
+ <td>0.505</td>
300
+ <td>0.448</td>
301
+ <td>0.452</td>
302
+ <td>0.750</td>
303
+ <td>0.730</td>
304
+ <td>0.882</td>
305
+ <td>0.650</td>
306
+ <td>0.608</td>
307
+ <td>0.506</td>
308
+ <td>0.894</td>
309
+ <td>0.756</td>
310
+ </tr>
311
+ <tr>
312
+ <td>Prometheus-2-7B</td>
313
+ <td>0.520</td>
314
+ <td>0.562</td>
315
+ <td>0.460</td>
316
+ <td>0.470</td>
317
+ <td>0.500</td>
318
+ <td>0.720</td>
319
+ <td>0.723</td>
320
+ <td>0.796</td>
321
+ <td>0.400</td>
322
+ <td>0.676</td>
323
+ <td>0.560</td>
324
+ <td>0.486</td>
325
+ <td>0.386</td>
326
+ </tr>
327
+ <tr>
328
+ <td>Patronus-GLIDER-3.8B</td>
329
+ <td>-</td>
330
+ <td>-</td>
331
+ <td>-</td>
332
+ <td><b>0.615</b></td>
333
+ <td>0.604</td>
334
+ <td>0.784</td>
335
+ <td>-</td>
336
+ <td>0.851</td>
337
+ <td>-</td>
338
+ <td>-</td>
339
+ <td>-</td>
340
+ <td>-</td>
341
+ <td>-</td>
342
+ </tr>
343
+ <tr>
344
+ <td>FlowAI-Judge-3.8B</td>
345
+ <td>-</td>
346
+ <td>-</td>
347
+ <td>-</td>
348
+ <td>0.400</td>
349
+ <td>0.460</td>
350
+ <td>0.728</td>
351
+ <td>-</td>
352
+ <td>0.803</td>
353
+ <td>-</td>
354
+ <td>-</td>
355
+ <td>-</td>
356
+ <td>-</td>
357
+ <td>-</td>
358
+ </tr>
359
  </tbody>
360
  </table>
361
  </div>
 
366
  </p>
367
 
368
  <figure class="image">
369
+ <img src="Fig3.png" alt="Real-world evaluation">
370
  <figcaption>
371
  <b>Figure 3:</b> Real-world evaluation: a) Performance on domain-specific industry benchmarks b) Performance on RewardBench with different prompt formats c) Performance measured by ELO scores in Judge Arena.
372
  </figcaption>