Spaces:

AtlaAI
/

selene-1-mini-tech-report

Running

App Files Files Community

selene-1-mini-tech-report / index.html

spisupat

Update index.html

aab47c7 verified 4 months ago

raw

history blame

9.76 kB

	<!DOCTYPE html>
	<html>
	<head>
	<meta charset="utf-8">
	<meta name="description" content="Atla Selene Mini: A General Purpose Evaluation Model">
	<meta name="viewport" content="width=device-width, initial-scale=1">
	<title>Atla Selene Mini: A General Purpose Evaluation Model</title>

	<link href="https://fonts.googleapis.com/css?family=Google+Sans\|Noto+Sans\|Castoro" rel="stylesheet">
	<link rel="stylesheet" href="./static/css/bulma.min.css">
	<link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
	<link rel="stylesheet" href="./static/css/bulma-slider.min.css">
	<link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
	<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
	<link rel="stylesheet" href="./static/css/index.css">
	<link rel="icon" href="./static/images/favicon.svg">

	<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
	<script defer src="./static/js/fontawesome.all.min.js"></script>
	<script src="./static/js/bulma-carousel.min.js"></script>
	<script src="./static/js/bulma-slider.min.js"></script>
	<script src="./static/js/index.js"></script>

	</head>
	<body>

	<section class="hero">
	<div class="hero-body">
	<div class="container is-max-desktop">
	<div class="columns is-centered">
	<div class="column has-text-centered">
	<h1 class="title is-1 publication-title">Atla Selene Mini:<br>A General Purpose Evaluation Model</h1>
	<div class="is-size-5 publication-authors">
	<span class="author-block">
	<b>Andrei Alexandru</b><sup>1</sup>,</span>
	<span class="author-block">
	<b>Antonia Calvi</b><sup>1</sup>,</span>
	<span class="author-block">
	<b>Henry Broomfield</b><sup>1</sup>,</span>
	<span class="author-block">
	<b>Jackson Golden</b><sup>1</sup>,</span>
	<span class="author-block">
	<b>Kyle Dai</b><sup>1</sup>,</span>
	</div>
	<div class="is-size-5 publication-authors">
	<span class="author-block">
	<b>Mathias Leys</b><sup>1</sup>,</span>
	<span class="author-block">
	<b>Maurice Burger</b><sup>1</sup>,</span>
	<span class="author-block">
	<b>Max Bartolo</b><sup>2,3</sup>,</span>
	<span class="author-block">
	<b>Roman Engeler</b><sup>1</sup>,</span>
	</div>
	<div class="is-size-5 publication-authors">
	<span class="author-block">
	<b>Sashank Pisupati</b><sup>1</sup>,</span>
	<span class="author-block">
	<b>Toby Drane</b><sup>1</sup>,</span>
	<span class="author-block">
	<b>Young Sun Park</b><sup>1</sup></span>
	</div>

	<div class="is-size-5 publication-authors">
	<span class="author-block"><sup>1</sup>atla,</span>
	<span class="author-block"><sup>2</sup>University College London,</span>
	<span class="author-block"><sup>3</sup>Cohere</span>
	</div>

	<div class="column has-text-centered">
	<div class="publication-links">
	<!-- Model Link -->
	<span class="link-block">
	<a href="https://hf.co/AtlaAI/Selene-1-Mini-Llama-3.1-8B" target="_blank"
	class="external-link button is-normal is-rounded is-dark">
	<span>HuggingFace</span>
	</a>
	</span>
	<!-- Ollama Link -->
	<span class="link-block">
	<a href="https://ollama.com/atla/selene-mini" target="_blank"
	class="external-link button is-normal is-rounded is-dark">
	<span>Ollama</span>
	</a>
	</span>
	</div>
	</div>
	</div>
	</div>
	</div>
	</div>
	</section>

	<section class="section">
	<div class="container is-max-desktop">
	<!-- Abstract -->
	<div class="columns is-centered has-text-centered">
	<div class="column is-four-fifths">
	<h2 class="title is-3">Abstract</h2>
	<div class="content has-text-justified">
	<p>
	We introduce Atla Selene Mini, a state-of-the-art small language model-as-a-judge (SLMJ). Selene Mini is a general-purpose evaluator that outperforms the best SLMJs and GPT-4o-mini on overall performance across 11 out-of-distribution benchmarks, spanning absolute scoring, classification, and pairwise preference tasks. It is the highest-scoring 8B generative model on RewardBench, surpassing strong baselines like GPT-4o and specialized judges.
	</p>
	<p>
	To achieve this, we develop a principled data curation strategy that augments public datasets with synthetically generated critiques and ensures high quality through filtering and dataset ablations. We train our model on a combined direct preference optimization (DPO) and supervised fine-tuning (SFT) loss, and produce a highly promptable evaluator that excels in real-world scenarios.
	</p>
	<p>
	Selene Mini shows dramatically improved zero-shot agreement with human expert evaluations on financial and medical industry datasets. It is also robust to variations in prompt format. Preliminary results indicate that Selene Mini is the top-ranking evaluator in a live, community-driven Judge Arena. We release the model weights on HuggingFace and Ollama to encourage widespread community adoption.
	</p>
	</div>
	</div>
	</div>

	<!-- Figure 1 -->
	<div class="columns is-centered has-text-centered">
	<div class="column is-four-fifths">
	<div class="content">
	<figure>
	<img src="/api/placeholder/800/400" alt="Performance comparison">
	<figcaption>
	<b>Figure 1:</b> Atla Selene Mini outperforms current state-of-the-art SLMJs: a) Overall task-average performance, comparing Atla Selene Mini (black) with the best and most widely used SLMJs. b) Breakdown of performance by task type and benchmark.
	</figcaption>
	</figure>
	</div>
	</div>
	</div>

	<!-- Methods Section -->
	<div class="columns is-centered">
	<div class="column is-four-fifths">
	<h2 class="title is-3">Methods</h2>
	<div class="content has-text-justified">
	<p>
	Selene Mini is optimized for fast inference, high performance, and promptability. It is a general-purpose evaluator, and is trained to respond with both critiques and judgments in order to deliver actionable insights. To achieve this, we fine-tuned a Llama 3.1 8B Instruct model on a curated mixture of 16 publicly available datasets, totaling 577k data points.
	</p>
	<figure>
	<img src="/api/placeholder/800/400" alt="Data curation strategy">
	<figcaption>
	<b>Figure 2:</b> Data curation strategy: The process of transforming a candidate dataset (left) into the final training mix (right). Yellow boxes indicate filtering steps, purple represents synthetic generation of chosen and rejected pairs for preference optimization.
	</figcaption>
	</figure>
	</div>
	</div>
	</div>

	<!-- Results Section -->
	<div class="columns is-centered">
	<div class="column is-four-fifths">
	<h2 class="title is-3">Results</h2>
	<div class="content has-text-justified">
	<h3 class="title is-4">Benchmark Performance</h3>
	<p>
	We assess the performance of Selene Mini on 11 out-of-distribution benchmarks, spanning three different types of evaluation tasks: absolute scoring, classification, and pairwise preference.
	</p>
	<figure>
	<img src="/api/placeholder/800/400" alt="Real-world evaluation">
	<figcaption>
	<b>Figure 3:</b> Real-world evaluation: a) Performance on domain-specific industry benchmarks b) Performance on RewardBench with different prompt formats c) Performance measured by ELO scores in Judge Arena.
	</figcaption>
	</figure>
	</div>
	</div>
	</div>

	<!-- Discussion Section -->
	<div class="columns is-centered">
	<div class="column is-four-fifths">
	<h2 class="title is-3">Discussion</h2>
	<div class="content has-text-justified">
	<p>
	In this work, we introduce Atla Selene Mini, demonstrating that effective general-purpose evaluation can be achieved in smaller model architectures through principled data curation and a hybrid training objective (DPO + SFT). The model's strong performance across benchmarks, particularly on absolute scoring tasks – which represent the most common and useful form of evaluation in practice – suggests that careful attention to training data quality can be as impactful as increased model size for evaluation capabilities.
	</p>
	<p>
	Looking ahead, we anticipate two emerging frontiers that will shape the future of AI evaluation. First is the rise of agent-based systems that combine language models with external tools and APIs, creating more powerful and versatile AI systems. Second is the increasing use of inference-time compute – systems that perform additional reasoning steps during inference to generate higher-quality outputs.
	</p>
	</div>
	</div>
	</div>
	</div>
	</section>

	<footer class="footer">
	<div class="container">
	<div class="content has-text-centered">
	<p>
	© 2025 Atla AI
	</p>
	</div>
	</div>
	</footer>

	</body>
	</html>