Atla Selene Mini:
A General Purpose Evaluation Model

Andrei Alexandru1, Antonia Calvi1, Henry Broomfield1, Jackson Golden1, Kyle Dai1,
Mathias Leys1, Maurice Burger1, Max Bartolo2,3, Roman Engeler1,
Sashank Pisupati1, Toby Drane1, Young Sun Park1
1atla, 2University College London, 3Cohere

Abstract

We introduce Atla Selene Mini, a state-of-the-art small language model-as-a-judge (SLMJ). Selene Mini is a general-purpose evaluator that outperforms the best SLMJs and GPT-4o-mini on overall performance across 11 out-of-distribution benchmarks, spanning absolute scoring, classification, and pairwise preference tasks. It is the highest-scoring 8B generative model on RewardBench, surpassing strong baselines like GPT-4o and specialized judges.

To achieve this, we develop a principled data curation strategy that augments public datasets with synthetically generated critiques and ensures high quality through filtering and dataset ablations. We train our model on a combined direct preference optimization (DPO) and supervised fine-tuning (SFT) loss, and produce a highly promptable evaluator that excels in real-world scenarios.

Selene Mini shows dramatically improved zero-shot agreement with human expert evaluations on financial and medical industry datasets. It is also robust to variations in prompt format. Preliminary results indicate that Selene Mini is the top-ranking evaluator in a live, community-driven Judge Arena. We release the model weights on HuggingFace and Ollama to encourage widespread community adoption.

Performance comparison
Figure 1: Atla Selene Mini outperforms current state-of-the-art SLMJs: a) Overall task-average performance, comparing Atla Selene Mini (black) with the best and most widely used SLMJs. b) Breakdown of performance by task type and benchmark.

Methods

Selene Mini is optimized for fast inference, high performance, and promptability. It is a general-purpose evaluator, and is trained to respond with both critiques and judgments in order to deliver actionable insights. To achieve this, we fine-tuned a Llama 3.1 8B Instruct model on a curated mixture of 16 publicly available datasets, totaling 577k data points.

Data curation strategy
Figure 2: Data curation strategy: The process of transforming a candidate dataset (left) into the final training mix (right). Yellow boxes indicate filtering steps, purple represents synthetic generation of chosen and rejected pairs for preference optimization.

Results

Benchmark Performance

We assess the performance of Selene Mini on 11 out-of-distribution benchmarks, spanning three different types of evaluation tasks: absolute scoring, classification, and pairwise preference.

Real-world evaluation
Figure 3: Real-world evaluation: a) Performance on domain-specific industry benchmarks b) Performance on RewardBench with different prompt formats c) Performance measured by ELO scores in Judge Arena.

Discussion

In this work, we introduce Atla Selene Mini, demonstrating that effective general-purpose evaluation can be achieved in smaller model architectures through principled data curation and a hybrid training objective (DPO + SFT). The model's strong performance across benchmarks, particularly on absolute scoring tasks – which represent the most common and useful form of evaluation in practice – suggests that careful attention to training data quality can be as impactful as increased model size for evaluation capabilities.

Looking ahead, we anticipate two emerging frontiers that will shape the future of AI evaluation. First is the rise of agent-based systems that combine language models with external tools and APIs, creating more powerful and versatile AI systems. Second is the increasing use of inference-time compute – systems that perform additional reasoning steps during inference to generate higher-quality outputs.