Update index.html
Browse files- index.html +182 -138
index.html
CHANGED
@@ -1,153 +1,197 @@
|
|
1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
<span> <sup>2</sup>University College London</span> •
|
37 |
-
<span> <sup>3</sup>Cohere</span>
|
38 |
-
</p>
|
39 |
-
|
40 |
-
<a href="https://atla-ai.com" className="text-blue-600 hover:underline">
|
41 |
-
atla-ai.com
|
42 |
-
</a>
|
43 |
-
</div>
|
44 |
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
</
|
|
|
|
|
55 |
</div>
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
{/* Abstract */}
|
62 |
-
<section className="mb-16">
|
63 |
-
<h2 className="text-2xl font-bold mb-4">Abstract</h2>
|
64 |
-
<div className="prose max-w-none">
|
65 |
-
<p className="mb-4">
|
66 |
-
We introduce Atla Selene Mini, a state-of-the-art small language model-as-a-judge (SLMJ). Selene Mini is a general-purpose evaluator that outperforms the best SLMJs and GPT-4o-mini on overall performance across 11 out-of-distribution benchmarks, spanning absolute scoring, classification, and pairwise preference tasks. It is the highest-scoring 8B generative model on RewardBench, surpassing strong baselines like GPT-4o and specialized judges.
|
67 |
-
</p>
|
68 |
-
<p className="mb-4">
|
69 |
-
To achieve this, we develop a principled data curation strategy that augments public datasets with synthetically generated critiques and ensures high quality through filtering and dataset ablations. We train our model on a combined direct preference optimization (DPO) and supervised fine-tuning (SFT) loss, and produce a highly promptable evaluator that excels in real-world scenarios.
|
70 |
-
</p>
|
71 |
-
<p>
|
72 |
-
Selene Mini shows dramatically improved zero-shot agreement with human expert evaluations on financial and medical industry datasets. It is also robust to variations in prompt format. Preliminary results indicate that Selene Mini is the top-ranking evaluator in a live, community-driven Judge Arena. We release the model weights on HuggingFace and Ollama to encourage widespread community adoption.
|
73 |
-
</p>
|
74 |
</div>
|
75 |
-
|
76 |
-
|
77 |
-
|
78 |
-
|
79 |
-
<h2 className="text-2xl font-bold mb-4">Introduction</h2>
|
80 |
-
<div className="prose max-w-none">
|
81 |
-
<p className="mb-4">
|
82 |
-
Automated evaluation of large language models (LLMs) is an increasingly pertinent task as LLMs demonstrate their value across a growing array of real-world use cases. Reliable evaluation is critical to ensure that LLMs are aligned with human objectives, i.e. that these models do what they are intended to do.
|
83 |
-
</p>
|
84 |
-
<p className="mb-4">
|
85 |
-
Human evaluation is time-consuming and expensive, and scales poorly with volume and complexity – hence the need for scalable, automated techniques. As generative models have become more capable, the field has addressed this need by using LLMs themselves to evaluate other LLMs' responses, producing judgments and natural language critiques without humans in the loop – an approach also known as "LLM-as-a-judge" (LLMJ).
|
86 |
-
</p>
|
87 |
-
<div className="my-8">
|
88 |
-
<img src="/api/placeholder/800/400" alt="Figure 1: Performance comparison" className="w-full rounded-lg shadow-lg"/>
|
89 |
-
<p className="text-sm text-gray-600 mt-2">
|
90 |
-
Figure 1: Atla Selene Mini outperforms current state-of-the-art SLMJs: a) Overall task-average performance, comparing Atla Selene Mini (black) with the best and most widely used SLMJs. b) Breakdown of performance by task type and benchmark.
|
91 |
-
</p>
|
92 |
-
</div>
|
93 |
</div>
|
94 |
-
</section>
|
95 |
|
96 |
-
|
97 |
-
|
98 |
-
|
99 |
-
|
100 |
-
<p className="mb-4">
|
101 |
-
Selene Mini is optimized for fast inference, high performance, and promptability. It is a general-purpose evaluator, and is trained to respond with both critiques and judgments in order to deliver actionable insights.
|
102 |
-
</p>
|
103 |
-
<div className="my-8">
|
104 |
-
<img src="/api/placeholder/800/400" alt="Figure 2: Data curation strategy" className="w-full rounded-lg shadow-lg"/>
|
105 |
-
<p className="text-sm text-gray-600 mt-2">
|
106 |
-
Figure 2: Data curation strategy: The process of transforming a candidate dataset (left) into the final training mix (right). Yellow boxes indicate filtering steps, purple represents synthetic generation of chosen and rejected pairs for preference optimization.
|
107 |
-
</p>
|
108 |
-
</div>
|
109 |
</div>
|
110 |
-
</section>
|
111 |
|
112 |
-
|
113 |
-
|
114 |
-
|
115 |
-
|
116 |
-
|
117 |
-
|
118 |
-
|
119 |
-
|
120 |
-
|
121 |
-
|
122 |
-
<
|
123 |
-
|
124 |
-
|
|
|
|
|
|
|
125 |
</div>
|
126 |
</div>
|
127 |
-
</
|
|
|
|
|
|
|
|
|
128 |
|
129 |
-
|
130 |
-
|
131 |
-
|
132 |
-
|
133 |
-
|
134 |
-
|
135 |
-
|
136 |
-
|
137 |
-
|
138 |
-
|
139 |
-
|
140 |
-
|
141 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
142 |
|
143 |
-
|
144 |
-
|
145 |
-
|
146 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
147 |
</div>
|
148 |
-
</
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
149 |
</div>
|
150 |
-
|
151 |
-
|
152 |
|
153 |
-
|
|
|
|
1 |
+
<!DOCTYPE html>
|
2 |
+
<html>
|
3 |
+
<head>
|
4 |
+
<meta charset="utf-8">
|
5 |
+
<meta name="description" content="Atla Selene Mini: A General Purpose Evaluation Model">
|
6 |
+
<meta name="viewport" content="width=device-width, initial-scale=1">
|
7 |
+
<title>Atla Selene Mini: A General Purpose Evaluation Model</title>
|
8 |
|
9 |
+
<link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" rel="stylesheet">
|
10 |
+
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/bulma/0.9.4/css/bulma.min.css">
|
11 |
+
|
12 |
+
<style>
|
13 |
+
body {
|
14 |
+
font-family: 'Noto Sans', sans-serif;
|
15 |
+
}
|
16 |
+
.publication-title {
|
17 |
+
font-family: 'Castoro', serif;
|
18 |
+
}
|
19 |
+
.author-block {
|
20 |
+
display: inline-block;
|
21 |
+
margin-right: 10px;
|
22 |
+
}
|
23 |
+
.publication-links {
|
24 |
+
margin-top: 20px;
|
25 |
+
}
|
26 |
+
.link-block {
|
27 |
+
margin-right: 8px;
|
28 |
+
}
|
29 |
+
.content figure {
|
30 |
+
margin: 20px 0;
|
31 |
+
}
|
32 |
+
.content figure img {
|
33 |
+
max-width: 100%;
|
34 |
+
height: auto;
|
35 |
+
}
|
36 |
+
.hero.is-light {
|
37 |
+
background-color: #f5f5f5;
|
38 |
+
}
|
39 |
+
</style>
|
40 |
+
</head>
|
41 |
+
<body>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
42 |
|
43 |
+
<section class="hero">
|
44 |
+
<div class="hero-body">
|
45 |
+
<div class="container is-max-desktop">
|
46 |
+
<div class="columns is-centered">
|
47 |
+
<div class="column has-text-centered">
|
48 |
+
<h1 class="title is-1 publication-title">Atla Selene Mini:<br>A General Purpose Evaluation Model</h1>
|
49 |
+
<div class="is-size-5 publication-authors">
|
50 |
+
<span class="author-block">Andrei Alexandru<sup>1</sup>,</span>
|
51 |
+
<span class="author-block">Antonia Calvi<sup>1</sup>,</span>
|
52 |
+
<span class="author-block">Henry Broomfield<sup>1</sup>,</span>
|
53 |
+
<span class="author-block">Jackson Golden<sup>1</sup>,</span>
|
54 |
+
<span class="author-block">Kyle Dai<sup>1</sup>,</span>
|
55 |
</div>
|
56 |
+
<div class="is-size-5 publication-authors">
|
57 |
+
<span class="author-block">Mathias Leys<sup>1</sup>,</span>
|
58 |
+
<span class="author-block">Maurice Burger<sup>1</sup>,</span>
|
59 |
+
<span class="author-block">Max Bartolo<sup>2,3</sup>,</span>
|
60 |
+
<span class="author-block">Roman Engeler<sup>1</sup>,</span>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
61 |
</div>
|
62 |
+
<div class="is-size-5 publication-authors">
|
63 |
+
<span class="author-block">Sashank Pisupati<sup>1</sup>,</span>
|
64 |
+
<span class="author-block">Toby Drane<sup>1</sup>,</span>
|
65 |
+
<span class="author-block">Young Sun Park<sup>1</sup></span>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
66 |
</div>
|
|
|
67 |
|
68 |
+
<div class="is-size-5 publication-authors">
|
69 |
+
<span class="author-block"><sup>1</sup>atla,</span>
|
70 |
+
<span class="author-block"><sup>2</sup>University College London,</span>
|
71 |
+
<span class="author-block"><sup>3</sup>Cohere</span>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
72 |
</div>
|
|
|
73 |
|
74 |
+
<div class="column has-text-centered">
|
75 |
+
<div class="publication-links">
|
76 |
+
<!-- Model Link -->
|
77 |
+
<span class="link-block">
|
78 |
+
<a href="https://hf.co/AtlaAI/Selene-1-Mini-Llama-3.1-8B" target="_blank"
|
79 |
+
class="external-link button is-normal is-rounded is-dark">
|
80 |
+
<span>HuggingFace</span>
|
81 |
+
</a>
|
82 |
+
</span>
|
83 |
+
<!-- Ollama Link -->
|
84 |
+
<span class="link-block">
|
85 |
+
<a href="https://ollama.com/atla/selene-mini" target="_blank"
|
86 |
+
class="external-link button is-normal is-rounded is-dark">
|
87 |
+
<span>Ollama</span>
|
88 |
+
</a>
|
89 |
+
</span>
|
90 |
</div>
|
91 |
</div>
|
92 |
+
</div>
|
93 |
+
</div>
|
94 |
+
</div>
|
95 |
+
</div>
|
96 |
+
</section>
|
97 |
|
98 |
+
<section class="section">
|
99 |
+
<div class="container is-max-desktop">
|
100 |
+
<!-- Abstract -->
|
101 |
+
<div class="columns is-centered has-text-centered">
|
102 |
+
<div class="column is-four-fifths">
|
103 |
+
<h2 class="title is-3">Abstract</h2>
|
104 |
+
<div class="content has-text-justified">
|
105 |
+
<p>
|
106 |
+
We introduce Atla Selene Mini, a state-of-the-art small language model-as-a-judge (SLMJ). Selene Mini is a general-purpose evaluator that outperforms the best SLMJs and GPT-4o-mini on overall performance across 11 out-of-distribution benchmarks, spanning absolute scoring, classification, and pairwise preference tasks. It is the highest-scoring 8B generative model on RewardBench, surpassing strong baselines like GPT-4o and specialized judges.
|
107 |
+
</p>
|
108 |
+
<p>
|
109 |
+
To achieve this, we develop a principled data curation strategy that augments public datasets with synthetically generated critiques and ensures high quality through filtering and dataset ablations. We train our model on a combined direct preference optimization (DPO) and supervised fine-tuning (SFT) loss, and produce a highly promptable evaluator that excels in real-world scenarios.
|
110 |
+
</p>
|
111 |
+
<p>
|
112 |
+
Selene Mini shows dramatically improved zero-shot agreement with human expert evaluations on financial and medical industry datasets. It is also robust to variations in prompt format. Preliminary results indicate that Selene Mini is the top-ranking evaluator in a live, community-driven Judge Arena. We release the model weights on HuggingFace and Ollama to encourage widespread community adoption.
|
113 |
+
</p>
|
114 |
+
</div>
|
115 |
+
</div>
|
116 |
+
</div>
|
117 |
|
118 |
+
<!-- Figure 1 -->
|
119 |
+
<div class="columns is-centered has-text-centered">
|
120 |
+
<div class="column is-four-fifths">
|
121 |
+
<div class="content">
|
122 |
+
<figure>
|
123 |
+
<img src="/api/placeholder/800/400" alt="Performance comparison">
|
124 |
+
<figcaption>
|
125 |
+
<b>Figure 1:</b> Atla Selene Mini outperforms current state-of-the-art SLMJs: a) Overall task-average performance, comparing Atla Selene Mini (black) with the best and most widely used SLMJs. b) Breakdown of performance by task type and benchmark.
|
126 |
+
</figcaption>
|
127 |
+
</figure>
|
128 |
</div>
|
129 |
+
</div>
|
130 |
+
</div>
|
131 |
+
|
132 |
+
<!-- Methods Section -->
|
133 |
+
<div class="columns is-centered">
|
134 |
+
<div class="column is-four-fifths">
|
135 |
+
<h2 class="title is-3">Methods</h2>
|
136 |
+
<div class="content has-text-justified">
|
137 |
+
<p>
|
138 |
+
Selene Mini is optimized for fast inference, high performance, and promptability. It is a general-purpose evaluator, and is trained to respond with both critiques and judgments in order to deliver actionable insights. To achieve this, we fine-tuned a Llama 3.1 8B Instruct model on a curated mixture of 16 publicly available datasets, totaling 577k data points.
|
139 |
+
</p>
|
140 |
+
<figure>
|
141 |
+
<img src="/api/placeholder/800/400" alt="Data curation strategy">
|
142 |
+
<figcaption>
|
143 |
+
<b>Figure 2:</b> Data curation strategy: The process of transforming a candidate dataset (left) into the final training mix (right). Yellow boxes indicate filtering steps, purple represents synthetic generation of chosen and rejected pairs for preference optimization.
|
144 |
+
</figcaption>
|
145 |
+
</figure>
|
146 |
+
</div>
|
147 |
+
</div>
|
148 |
+
</div>
|
149 |
+
|
150 |
+
<!-- Results Section -->
|
151 |
+
<div class="columns is-centered">
|
152 |
+
<div class="column is-four-fifths">
|
153 |
+
<h2 class="title is-3">Results</h2>
|
154 |
+
<div class="content has-text-justified">
|
155 |
+
<h3 class="title is-4">Benchmark Performance</h3>
|
156 |
+
<p>
|
157 |
+
We assess the performance of Selene Mini on 11 out-of-distribution benchmarks, spanning three different types of evaluation tasks: absolute scoring, classification, and pairwise preference.
|
158 |
+
</p>
|
159 |
+
<figure>
|
160 |
+
<img src="/api/placeholder/800/400" alt="Real-world evaluation">
|
161 |
+
<figcaption>
|
162 |
+
<b>Figure 3:</b> Real-world evaluation: a) Performance on domain-specific industry benchmarks b) Performance on RewardBench with different prompt formats c) Performance measured by ELO scores in Judge Arena.
|
163 |
+
</figcaption>
|
164 |
+
</figure>
|
165 |
+
</div>
|
166 |
+
</div>
|
167 |
+
</div>
|
168 |
+
|
169 |
+
<!-- Discussion Section -->
|
170 |
+
<div class="columns is-centered">
|
171 |
+
<div class="column is-four-fifths">
|
172 |
+
<h2 class="title is-3">Discussion</h2>
|
173 |
+
<div class="content has-text-justified">
|
174 |
+
<p>
|
175 |
+
In this work, we introduce Atla Selene Mini, demonstrating that effective general-purpose evaluation can be achieved in smaller model architectures through principled data curation and a hybrid training objective (DPO + SFT). The model's strong performance across benchmarks, particularly on absolute scoring tasks – which represent the most common and useful form of evaluation in practice – suggests that careful attention to training data quality can be as impactful as increased model size for evaluation capabilities.
|
176 |
+
</p>
|
177 |
+
<p>
|
178 |
+
Looking ahead, we anticipate two emerging frontiers that will shape the future of AI evaluation. First is the rise of agent-based systems that combine language models with external tools and APIs, creating more powerful and versatile AI systems. Second is the increasing use of inference-time compute – systems that perform additional reasoning steps during inference to generate higher-quality outputs.
|
179 |
+
</p>
|
180 |
+
</div>
|
181 |
+
</div>
|
182 |
+
</div>
|
183 |
+
</div>
|
184 |
+
</section>
|
185 |
+
|
186 |
+
<footer class="footer">
|
187 |
+
<div class="container">
|
188 |
+
<div class="content has-text-centered">
|
189 |
+
<p>
|
190 |
+
© 2025 Atla AI
|
191 |
+
</p>
|
192 |
</div>
|
193 |
+
</div>
|
194 |
+
</footer>
|
195 |
|
196 |
+
</body>
|
197 |
+
</html>
|