spisupat commited on
Commit
0e12c29
·
verified ·
1 Parent(s): 67c2c33

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +193 -16
index.html CHANGED
@@ -32,15 +32,15 @@
32
  <h1 class="title is-1 publication-title">Atla Selene Mini:<br>A General Purpose Evaluation Model</h1>
33
  <div class="is-size-5 publication-authors">
34
  <span class="author-block">
35
- <b>Andrei Alexandru</b><sup>1</sup>,</span>
36
  <span class="author-block">
37
- <b>Antonia Calvi</b><sup>1</sup>,</span>
38
  <span class="author-block">
39
- <b>Henry Broomfield</b><sup>1</sup>,</span>
40
  <span class="author-block">
41
- <b>Jackson Golden</b><sup>1</sup>,</span>
42
  <span class="author-block">
43
- <b>Kyle Dai</b><sup>1</sup>,</span>
44
  </div>
45
  <div class="is-size-5 publication-authors">
46
  <span class="author-block">
@@ -69,10 +69,23 @@
69
 
70
  <div class="column has-text-centered">
71
  <div class="publication-links">
72
- <!-- Model Link -->
 
 
 
 
 
 
 
 
 
 
73
  <span class="link-block">
74
  <a href="https://hf.co/AtlaAI/Selene-1-Mini-Llama-3.1-8B" target="_blank"
75
  class="external-link button is-normal is-rounded is-dark">
 
 
 
76
  <span>HuggingFace</span>
77
  </a>
78
  </span>
@@ -80,6 +93,9 @@
80
  <span class="link-block">
81
  <a href="https://ollama.com/atla/selene-mini" target="_blank"
82
  class="external-link button is-normal is-rounded is-dark">
 
 
 
83
  <span>Ollama</span>
84
  </a>
85
  </span>
@@ -93,6 +109,13 @@
93
 
94
  <section class="section">
95
  <div class="container is-max-desktop">
 
 
 
 
 
 
 
96
  <!-- Abstract -->
97
  <div class="columns is-centered has-text-centered">
98
  <div class="column is-four-fifths">
@@ -111,12 +134,19 @@
111
  </div>
112
  </div>
113
 
114
- <!-- Figure 1 -->
115
- <div class="columns is-centered has-text-centered">
116
  <div class="column is-four-fifths">
117
- <div class="content">
118
- <figure>
119
- <img src="/api/placeholder/800/400" alt="Performance comparison">
 
 
 
 
 
 
 
120
  <figcaption>
121
  <b>Figure 1:</b> Atla Selene Mini outperforms current state-of-the-art SLMJs: a) Overall task-average performance, comparing Atla Selene Mini (black) with the best and most widely used SLMJs. b) Breakdown of performance by task type and benchmark.
122
  </figcaption>
@@ -133,12 +163,33 @@
133
  <p>
134
  Selene Mini is optimized for fast inference, high performance, and promptability. It is a general-purpose evaluator, and is trained to respond with both critiques and judgments in order to deliver actionable insights. To achieve this, we fine-tuned a Llama 3.1 8B Instruct model on a curated mixture of 16 publicly available datasets, totaling 577k data points.
135
  </p>
136
- <figure>
137
- <img src="/api/placeholder/800/400" alt="Data curation strategy">
 
138
  <figcaption>
139
  <b>Figure 2:</b> Data curation strategy: The process of transforming a candidate dataset (left) into the final training mix (right). Yellow boxes indicate filtering steps, purple represents synthetic generation of chosen and rejected pairs for preference optimization.
140
  </figcaption>
141
  </figure>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
142
  </div>
143
  </div>
144
  </div>
@@ -152,12 +203,107 @@
152
  <p>
153
  We assess the performance of Selene Mini on 11 out-of-distribution benchmarks, spanning three different types of evaluation tasks: absolute scoring, classification, and pairwise preference.
154
  </p>
155
- <figure>
156
- <img src="/api/placeholder/800/400" alt="Real-world evaluation">
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
157
  <figcaption>
158
  <b>Figure 3:</b> Real-world evaluation: a) Performance on domain-specific industry benchmarks b) Performance on RewardBench with different prompt formats c) Performance measured by ELO scores in Judge Arena.
159
  </figcaption>
160
  </figure>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
161
  </div>
162
  </div>
163
  </div>
@@ -171,7 +317,22 @@
171
  In this work, we introduce Atla Selene Mini, demonstrating that effective general-purpose evaluation can be achieved in smaller model architectures through principled data curation and a hybrid training objective (DPO + SFT). The model's strong performance across benchmarks, particularly on absolute scoring tasks – which represent the most common and useful form of evaluation in practice – suggests that careful attention to training data quality can be as impactful as increased model size for evaluation capabilities.
172
  </p>
173
  <p>
174
- Looking ahead, we anticipate two emerging frontiers that will shape the future of AI evaluation. First is the rise of agent-based systems that combine language models with external tools and APIs, creating more powerful and versatile AI systems. Second is the increasing use of inference-time compute – systems that perform additional reasoning steps during inference to generate higher-quality outputs.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
175
  </p>
176
  </div>
177
  </div>
@@ -179,6 +340,22 @@
179
  </div>
180
  </section>
181
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
182
  <footer class="footer">
183
  <div class="container">
184
  <div class="content has-text-centered">
 
32
  <h1 class="title is-1 publication-title">Atla Selene Mini:<br>A General Purpose Evaluation Model</h1>
33
  <div class="is-size-5 publication-authors">
34
  <span class="author-block">
35
+ Andrei Alexandru<sup>1</sup>,</span>
36
  <span class="author-block">
37
+ Antonia Calvi<sup>1</sup>,</span>
38
  <span class="author-block">
39
+ Henry Broomfield<sup>1</sup>,</span>
40
  <span class="author-block">
41
+ Jackson Golden<sup>1</sup>,</span>
42
  <span class="author-block">
43
+ Kyle Dai<sup>1</sup>,</span>
44
  </div>
45
  <div class="is-size-5 publication-authors">
46
  <span class="author-block">
 
69
 
70
  <div class="column has-text-centered">
71
  <div class="publication-links">
72
+ <!-- PDF Link -->
73
+ <span class="link-block">
74
+ <a href="arxiv_submitted.pdf" target="_blank"
75
+ class="external-link button is-normal is-rounded is-dark">
76
+ <span class="icon">
77
+ <i class="fas fa-file-pdf"></i>
78
+ </span>
79
+ <span>Paper</span>
80
+ </a>
81
+ </span>
82
+ <!-- HuggingFace Link -->
83
  <span class="link-block">
84
  <a href="https://hf.co/AtlaAI/Selene-1-Mini-Llama-3.1-8B" target="_blank"
85
  class="external-link button is-normal is-rounded is-dark">
86
+ <span class="icon">
87
+ <i class="fab fa-github"></i>
88
+ </span>
89
  <span>HuggingFace</span>
90
  </a>
91
  </span>
 
93
  <span class="link-block">
94
  <a href="https://ollama.com/atla/selene-mini" target="_blank"
95
  class="external-link button is-normal is-rounded is-dark">
96
+ <span class="icon">
97
+ <i class="fas fa-code"></i>
98
+ </span>
99
  <span>Ollama</span>
100
  </a>
101
  </span>
 
109
 
110
  <section class="section">
111
  <div class="container is-max-desktop">
112
+ <!-- Logo -->
113
+ <div class="columns is-centered has-text-centered">
114
+ <div class="column is-4">
115
+ <img src="figs/atla-logo.png" alt="Atla Logo">
116
+ </div>
117
+ </div>
118
+
119
  <!-- Abstract -->
120
  <div class="columns is-centered has-text-centered">
121
  <div class="column is-four-fifths">
 
134
  </div>
135
  </div>
136
 
137
+ <!-- Introduction -->
138
+ <div class="columns is-centered">
139
  <div class="column is-four-fifths">
140
+ <h2 class="title is-3">Introduction</h2>
141
+ <div class="content has-text-justified">
142
+ <p>
143
+ Automated evaluation of large language models (LLMs) is an increasingly pertinent task as LLMs demonstrate their value across a growing array of real-world use cases. Reliable evaluation is critical to ensure that LLMs are aligned with human objectives, i.e. that these models do what they are intended to do.
144
+ </p>
145
+ <p>
146
+ Human evaluation is time-consuming and expensive, and scales poorly with volume and complexity – hence the need for scalable, automated techniques. As generative models have become more capable, the field has addressed this need by using LLMs themselves to evaluate other LLMs' responses, producing judgments and natural language critiques without humans in the loop – an approach also known as "LLM-as-a-judge" (LLMJ).
147
+ </p>
148
+ <figure class="image">
149
+ <img src="figs/Fig1.png" alt="Performance comparison">
150
  <figcaption>
151
  <b>Figure 1:</b> Atla Selene Mini outperforms current state-of-the-art SLMJs: a) Overall task-average performance, comparing Atla Selene Mini (black) with the best and most widely used SLMJs. b) Breakdown of performance by task type and benchmark.
152
  </figcaption>
 
163
  <p>
164
  Selene Mini is optimized for fast inference, high performance, and promptability. It is a general-purpose evaluator, and is trained to respond with both critiques and judgments in order to deliver actionable insights. To achieve this, we fine-tuned a Llama 3.1 8B Instruct model on a curated mixture of 16 publicly available datasets, totaling 577k data points.
165
  </p>
166
+
167
+ <figure class="image">
168
+ <img src="figs/Fig2.png" alt="Data curation strategy">
169
  <figcaption>
170
  <b>Figure 2:</b> Data curation strategy: The process of transforming a candidate dataset (left) into the final training mix (right). Yellow boxes indicate filtering steps, purple represents synthetic generation of chosen and rejected pairs for preference optimization.
171
  </figcaption>
172
  </figure>
173
+
174
+ <h3 class="title is-4">Datasets</h3>
175
+ <p>
176
+ We took inspiration from the datasets used to train Foundational Large Autorater Models (FLAMe), which spanned a mix of pairwise, absolute scoring, and classification tasks. Each data point in these three task types was structured slightly differently.
177
+ </p>
178
+
179
+ <h3 class="title is-4">Synthetic augmentation</h3>
180
+ <p>
181
+ To construct pairs of contrasting evaluations, we generated rejected judgments that differed from the chosen ground-truth judgments in the data. For each judgment, we synthetically generated chosen and rejected chain-of-thought critiques by prompting a generation model to argue for the respective judgments.
182
+ </p>
183
+
184
+ <h3 class="title is-4">Filtering for quality</h3>
185
+ <p>
186
+ We used filtering strategies on both raw and synthetic data to ensure high quality. For raw data, we used ArmoRM, an off-the-shelf reward model, to score and filter four of our largest datasets that we hypothesized to contain high-variance in data quality.
187
+ </p>
188
+
189
+ <h3 class="title is-4">Training</h3>
190
+ <p>
191
+ We fine-tuned a Llama 3.1 8B Instruct model using the variant of DPO introduced in [citation], and refer readers to that paper for the full derivation.
192
+ </p>
193
  </div>
194
  </div>
195
  </div>
 
203
  <p>
204
  We assess the performance of Selene Mini on 11 out-of-distribution benchmarks, spanning three different types of evaluation tasks: absolute scoring, classification, and pairwise preference.
205
  </p>
206
+
207
+ <div class="table-container">
208
+ <table class="table is-bordered is-striped is-narrow is-hoverable is-fullwidth">
209
+ <caption>Table 1: Detailed performance breakdown across model sizes</caption>
210
+ <thead>
211
+ <tr>
212
+ <th>Model</th>
213
+ <th colspan="2">Overall (average)</th>
214
+ <th colspan="3">Absolute scoring tasks</th>
215
+ <th colspan="6">Pairwise preference tasks</th>
216
+ <th colspan="2">Classification tasks</th>
217
+ </tr>
218
+ <tr>
219
+ <th></th>
220
+ <th>Tasks</th>
221
+ <th>Benchmarks</th>
222
+ <th>MT-Bench</th>
223
+ <th>FLASK</th>
224
+ <th>BiGGen</th>
225
+ <th>RewardB</th>
226
+ <th>LFQA</th>
227
+ <th>HHH</th>
228
+ <th>EvalBias</th>
229
+ <th>InstruSum</th>
230
+ <th>Auto-J</th>
231
+ <th>InfoBench</th>
232
+ <th>AggreFact</th>
233
+ </tr>
234
+ </thead>
235
+ <tbody>
236
+ <tr>
237
+ <td>Atla-Selene-Mini</td>
238
+ <td>0.756</td>
239
+ <td>0.753</td>
240
+ <td>0.746</td>
241
+ <td>0.613</td>
242
+ <td>0.584</td>
243
+ <td>0.891</td>
244
+ <td>0.688</td>
245
+ <td>0.900</td>
246
+ <td>0.863</td>
247
+ <td>0.732</td>
248
+ <td>0.576</td>
249
+ <td>0.915</td>
250
+ <td>0.778</td>
251
+ </tr>
252
+ <!-- Add other rows from the LaTeX table -->
253
+ </tbody>
254
+ </table>
255
+ </div>
256
+
257
+ <h3 class="title is-4">Real-world evaluation</h3>
258
+ <p>
259
+ While the performance of our SLMJ across a wide range of benchmarks offers an indication of its strong general-purpose evaluation capabilities, such benchmarks are often not entirely representative of realistic evaluation use cases.
260
+ </p>
261
+
262
+ <figure class="image">
263
+ <img src="figs/Fig3.png" alt="Real-world evaluation">
264
  <figcaption>
265
  <b>Figure 3:</b> Real-world evaluation: a) Performance on domain-specific industry benchmarks b) Performance on RewardBench with different prompt formats c) Performance measured by ELO scores in Judge Arena.
266
  </figcaption>
267
  </figure>
268
+
269
+ <div class="table-container">
270
+ <table class="table is-bordered is-striped is-narrow is-hoverable is-fullwidth">
271
+ <caption>Table 2: Industry benchmarks</caption>
272
+ <thead>
273
+ <tr>
274
+ <th>Model</th>
275
+ <th colspan="4">CRAFT-MD</th>
276
+ <th>Finance</th>
277
+ </tr>
278
+ <tr>
279
+ <th></th>
280
+ <th>Medical terminology</th>
281
+ <th>Most likely diagnosis</th>
282
+ <th>Relevant med. hist.</th>
283
+ <th>Overall</th>
284
+ <th>Bench</th>
285
+ </tr>
286
+ </thead>
287
+ <tbody>
288
+ <tr>
289
+ <td>Atla-Selene-Mini</td>
290
+ <td>0.92</td>
291
+ <td>0.62</td>
292
+ <td>0.68</td>
293
+ <td>0.74</td>
294
+ <td>0.717</td>
295
+ </tr>
296
+ <tr>
297
+ <td>LLama-3.1-8B-Instruct</td>
298
+ <td>0.79</td>
299
+ <td>0.51</td>
300
+ <td>0.62</td>
301
+ <td>0.64</td>
302
+ <td>0.664</td>
303
+ </tr>
304
+ </tbody>
305
+ </table>
306
+ </div>
307
  </div>
308
  </div>
309
  </div>
 
317
  In this work, we introduce Atla Selene Mini, demonstrating that effective general-purpose evaluation can be achieved in smaller model architectures through principled data curation and a hybrid training objective (DPO + SFT). The model's strong performance across benchmarks, particularly on absolute scoring tasks – which represent the most common and useful form of evaluation in practice – suggests that careful attention to training data quality can be as impactful as increased model size for evaluation capabilities.
318
  </p>
319
  <p>
320
+ Looking ahead, we anticipate two emerging frontiers that will shape the future of AI evaluation. First is the rise of agent-based systems that combine language models with external tools and APIs, creating more powerful and versatile AI systems. Second is the increasing use of inference-time compute – systems that perform additional reasoning steps during inference to generate higher-quality outputs. These developments will require new evaluation frameworks and capabilities. Future research could explore how evaluator models can assess not just language outputs, but entire chains of reasoning, tool usage, and multi-step processes.
321
+ </p>
322
+ <p>
323
+ In conclusion, Atla Selene Mini represents a significant step forward in making reliable, general-purpose LLM evaluation more accessible to the broader community. Its combination of strong performance, domain generalization, and practical usability in an open-weights model provides a valuable tool for researchers and practitioners working to improve language model capabilities and safety.
324
+ </p>
325
+ </div>
326
+ </div>
327
+ </div>
328
+
329
+ <!-- Acknowledgments -->
330
+ <div class="columns is-centered">
331
+ <div class="column is-four-fifths">
332
+ <h2 class="title is-3">Acknowledgments</h2>
333
+ <div class="content has-text-justified">
334
+ <p>
335
+ We thank Clémentine Fourrier and the HuggingFace team for their help in setting up Judge Arena. We are grateful to Juan Felipe Cerón Uribe, Seungone Kim, Shreya Shankar, Eugene Yan, Yifan Mai, Austin Xu, Peifeng Wang and the team at SalesForce for helpful discussions around evaluations. We thank Zongheng Yang, Romil Bhardwaj and the Skypilot team for their assistance with our training infrastructure.
336
  </p>
337
  </div>
338
  </div>
 
340
  </div>
341
  </section>
342
 
343
+ <footer class="footer">
344
+ <div class="container">
345
+ <div class="columns is-centered">
346
+ <div class="column is-8">
347
+ <div class="content">
348
+ <p class="has-text-centered">
349
+ © 2025 Atla AI
350
+ </p>
351
+ </div>
352
+ </div>
353
+ </div>
354
+ </div>
355
+ </footer>
356
+ </div>
357
+ </section>
358
+
359
  <footer class="footer">
360
  <div class="container">
361
  <div class="content has-text-centered">