File size: 9,761 Bytes
4a00e43 cb42fb5 4a00e43 28e5488 8e3115f 4a00e43 7c537d9 4a00e43 8e3115f aab47c7 8e3115f aab47c7 8e3115f aab47c7 8e3115f aab47c7 8e3115f aab47c7 7c537d9 4a00e43 8e3115f 7c537d9 4a00e43 8e3115f 7c537d9 cb42fb5 4a00e43 7c537d9 cb42fb5 4a00e43 cb42fb5 7c537d9 4a00e43 cb42fb5 4a00e43 7c537d9 4a00e43 7c537d9 4a00e43 7c537d9 4a00e43 7c537d9 4a00e43 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 |
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="description" content="Atla Selene Mini: A General Purpose Evaluation Model">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Atla Selene Mini: A General Purpose Evaluation Model</title>
<link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" rel="stylesheet">
<link rel="stylesheet" href="./static/css/bulma.min.css">
<link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
<link rel="stylesheet" href="./static/css/bulma-slider.min.css">
<link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
<link rel="stylesheet" href="./static/css/index.css">
<link rel="icon" href="./static/images/favicon.svg">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script defer src="./static/js/fontawesome.all.min.js"></script>
<script src="./static/js/bulma-carousel.min.js"></script>
<script src="./static/js/bulma-slider.min.js"></script>
<script src="./static/js/index.js"></script>
</head>
<body>
<section class="hero">
<div class="hero-body">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column has-text-centered">
<h1 class="title is-1 publication-title">Atla Selene Mini:<br>A General Purpose Evaluation Model</h1>
<div class="is-size-5 publication-authors">
<span class="author-block">
<b>Andrei Alexandru</b><sup>1</sup>,</span>
<span class="author-block">
<b>Antonia Calvi</b><sup>1</sup>,</span>
<span class="author-block">
<b>Henry Broomfield</b><sup>1</sup>,</span>
<span class="author-block">
<b>Jackson Golden</b><sup>1</sup>,</span>
<span class="author-block">
<b>Kyle Dai</b><sup>1</sup>,</span>
</div>
<div class="is-size-5 publication-authors">
<span class="author-block">
<b>Mathias Leys</b><sup>1</sup>,</span>
<span class="author-block">
<b>Maurice Burger</b><sup>1</sup>,</span>
<span class="author-block">
<b>Max Bartolo</b><sup>2,3</sup>,</span>
<span class="author-block">
<b>Roman Engeler</b><sup>1</sup>,</span>
</div>
<div class="is-size-5 publication-authors">
<span class="author-block">
<b>Sashank Pisupati</b><sup>1</sup>,</span>
<span class="author-block">
<b>Toby Drane</b><sup>1</sup>,</span>
<span class="author-block">
<b>Young Sun Park</b><sup>1</sup></span>
</div>
<div class="is-size-5 publication-authors">
<span class="author-block"><sup>1</sup>atla,</span>
<span class="author-block"><sup>2</sup>University College London,</span>
<span class="author-block"><sup>3</sup>Cohere</span>
</div>
<div class="column has-text-centered">
<div class="publication-links">
<!-- Model Link -->
<span class="link-block">
<a href="https://hf.co/AtlaAI/Selene-1-Mini-Llama-3.1-8B" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span>HuggingFace</span>
</a>
</span>
<!-- Ollama Link -->
<span class="link-block">
<a href="https://ollama.com/atla/selene-mini" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span>Ollama</span>
</a>
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<!-- Abstract -->
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Abstract</h2>
<div class="content has-text-justified">
<p>
We introduce Atla Selene Mini, a state-of-the-art small language model-as-a-judge (SLMJ). Selene Mini is a general-purpose evaluator that outperforms the best SLMJs and GPT-4o-mini on overall performance across 11 out-of-distribution benchmarks, spanning absolute scoring, classification, and pairwise preference tasks. It is the highest-scoring 8B generative model on RewardBench, surpassing strong baselines like GPT-4o and specialized judges.
</p>
<p>
To achieve this, we develop a principled data curation strategy that augments public datasets with synthetically generated critiques and ensures high quality through filtering and dataset ablations. We train our model on a combined direct preference optimization (DPO) and supervised fine-tuning (SFT) loss, and produce a highly promptable evaluator that excels in real-world scenarios.
</p>
<p>
Selene Mini shows dramatically improved zero-shot agreement with human expert evaluations on financial and medical industry datasets. It is also robust to variations in prompt format. Preliminary results indicate that Selene Mini is the top-ranking evaluator in a live, community-driven Judge Arena. We release the model weights on HuggingFace and Ollama to encourage widespread community adoption.
</p>
</div>
</div>
</div>
<!-- Figure 1 -->
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<div class="content">
<figure>
<img src="/api/placeholder/800/400" alt="Performance comparison">
<figcaption>
<b>Figure 1:</b> Atla Selene Mini outperforms current state-of-the-art SLMJs: a) Overall task-average performance, comparing Atla Selene Mini (black) with the best and most widely used SLMJs. b) Breakdown of performance by task type and benchmark.
</figcaption>
</figure>
</div>
</div>
</div>
<!-- Methods Section -->
<div class="columns is-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Methods</h2>
<div class="content has-text-justified">
<p>
Selene Mini is optimized for fast inference, high performance, and promptability. It is a general-purpose evaluator, and is trained to respond with both critiques and judgments in order to deliver actionable insights. To achieve this, we fine-tuned a Llama 3.1 8B Instruct model on a curated mixture of 16 publicly available datasets, totaling 577k data points.
</p>
<figure>
<img src="/api/placeholder/800/400" alt="Data curation strategy">
<figcaption>
<b>Figure 2:</b> Data curation strategy: The process of transforming a candidate dataset (left) into the final training mix (right). Yellow boxes indicate filtering steps, purple represents synthetic generation of chosen and rejected pairs for preference optimization.
</figcaption>
</figure>
</div>
</div>
</div>
<!-- Results Section -->
<div class="columns is-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Results</h2>
<div class="content has-text-justified">
<h3 class="title is-4">Benchmark Performance</h3>
<p>
We assess the performance of Selene Mini on 11 out-of-distribution benchmarks, spanning three different types of evaluation tasks: absolute scoring, classification, and pairwise preference.
</p>
<figure>
<img src="/api/placeholder/800/400" alt="Real-world evaluation">
<figcaption>
<b>Figure 3:</b> Real-world evaluation: a) Performance on domain-specific industry benchmarks b) Performance on RewardBench with different prompt formats c) Performance measured by ELO scores in Judge Arena.
</figcaption>
</figure>
</div>
</div>
</div>
<!-- Discussion Section -->
<div class="columns is-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Discussion</h2>
<div class="content has-text-justified">
<p>
In this work, we introduce Atla Selene Mini, demonstrating that effective general-purpose evaluation can be achieved in smaller model architectures through principled data curation and a hybrid training objective (DPO + SFT). The model's strong performance across benchmarks, particularly on absolute scoring tasks – which represent the most common and useful form of evaluation in practice – suggests that careful attention to training data quality can be as impactful as increased model size for evaluation capabilities.
</p>
<p>
Looking ahead, we anticipate two emerging frontiers that will shape the future of AI evaluation. First is the rise of agent-based systems that combine language models with external tools and APIs, creating more powerful and versatile AI systems. Second is the increasing use of inference-time compute – systems that perform additional reasoning steps during inference to generate higher-quality outputs.
</p>
</div>
</div>
</div>
</div>
</section>
<footer class="footer">
<div class="container">
<div class="content has-text-centered">
<p>
© 2025 Atla AI
</p>
</div>
</div>
</footer>
</body>
</html> |