spisupat's picture
Update index.html
cb42fb5 verified
raw
history blame
9.02 kB
import React from 'react';
const TechReport = () => {
return (
<div className="min-h-screen bg-white">
{/* Header/Hero Section */}
<section className="py-16 bg-gray-50">
<div className="container mx-auto px-4 max-w-4xl">
<h1 className="text-4xl font-bold text-center mb-8">
Atla Selene Mini:<br/>A General Purpose Evaluation Model
</h1>
{/* Authors */}
<div className="text-center mb-8">
<p className="mb-4">
<span>Andrei Alexandru<sup>1</sup></span>
<span> Antonia Calvi<sup>1</sup></span>
<span> Henry Broomfield<sup>1</sup></span>
<span> Jackson Golden<sup>1</sup></span>
<span> Kyle Dai<sup>1</sup></span>
</p>
<p className="mb-4">
<span className="font-semibold">Mathias Leys<sup>1</sup></span>
<span className="font-semibold"> Maurice Burger<sup>1</sup></span>
<span className="font-semibold"> Max Bartolo<sup>2,3</sup></span>
<span className="font-semibold"> Roman Engeler<sup>1</sup></span>
</p>
<p className="mb-4">
<span className="font-semibold">Sashank Pisupati<sup>1</sup></span>
<span className="font-semibold"> Toby Drane<sup>1</sup></span>
<span className="font-semibold"> Young Sun Park<sup>1</sup></span>
</p>
<p className="text-sm">
<span><sup>1</sup>atla</span>
<span> <sup>2</sup>University College London</span>
<span> <sup>3</sup>Cohere</span>
</p>
<a href="https://atla-ai.com" className="text-blue-600 hover:underline">
atla-ai.com
</a>
</div>
{/* Links */}
<div className="flex justify-center gap-4">
<a href="https://hf.co/AtlaAI/Selene-1-Mini-Llama-3.1-8B"
className="px-4 py-2 bg-gray-900 text-white rounded-full hover:bg-gray-800 transition">
HuggingFace
</a>
<a href="https://ollama.com/atla/selene-mini"
className="px-4 py-2 bg-gray-900 text-white rounded-full hover:bg-gray-800 transition">
Ollama
</a>
</div>
</div>
</section>
{/* Main Content */}
<main className="container mx-auto px-4 max-w-4xl py-12">
{/* Abstract */}
<section className="mb-16">
<h2 className="text-2xl font-bold mb-4">Abstract</h2>
<div className="prose max-w-none">
<p className="mb-4">
We introduce Atla Selene Mini, a state-of-the-art small language model-as-a-judge (SLMJ). Selene Mini is a general-purpose evaluator that outperforms the best SLMJs and GPT-4o-mini on overall performance across 11 out-of-distribution benchmarks, spanning absolute scoring, classification, and pairwise preference tasks. It is the highest-scoring 8B generative model on RewardBench, surpassing strong baselines like GPT-4o and specialized judges.
</p>
<p className="mb-4">
To achieve this, we develop a principled data curation strategy that augments public datasets with synthetically generated critiques and ensures high quality through filtering and dataset ablations. We train our model on a combined direct preference optimization (DPO) and supervised fine-tuning (SFT) loss, and produce a highly promptable evaluator that excels in real-world scenarios.
</p>
<p>
Selene Mini shows dramatically improved zero-shot agreement with human expert evaluations on financial and medical industry datasets. It is also robust to variations in prompt format. Preliminary results indicate that Selene Mini is the top-ranking evaluator in a live, community-driven Judge Arena. We release the model weights on HuggingFace and Ollama to encourage widespread community adoption.
</p>
</div>
</section>
{/* Introduction */}
<section className="mb-16">
<h2 className="text-2xl font-bold mb-4">Introduction</h2>
<div className="prose max-w-none">
<p className="mb-4">
Automated evaluation of large language models (LLMs) is an increasingly pertinent task as LLMs demonstrate their value across a growing array of real-world use cases. Reliable evaluation is critical to ensure that LLMs are aligned with human objectives, i.e. that these models do what they are intended to do.
</p>
<p className="mb-4">
Human evaluation is time-consuming and expensive, and scales poorly with volume and complexity – hence the need for scalable, automated techniques. As generative models have become more capable, the field has addressed this need by using LLMs themselves to evaluate other LLMs' responses, producing judgments and natural language critiques without humans in the loop – an approach also known as "LLM-as-a-judge" (LLMJ).
</p>
<div className="my-8">
<img src="/api/placeholder/800/400" alt="Figure 1: Performance comparison" className="w-full rounded-lg shadow-lg"/>
<p className="text-sm text-gray-600 mt-2">
Figure 1: Atla Selene Mini outperforms current state-of-the-art SLMJs: a) Overall task-average performance, comparing Atla Selene Mini (black) with the best and most widely used SLMJs. b) Breakdown of performance by task type and benchmark.
</p>
</div>
</div>
</section>
{/* Methods */}
<section className="mb-16">
<h2 className="text-2xl font-bold mb-4">Methods</h2>
<div className="prose max-w-none">
<p className="mb-4">
Selene Mini is optimized for fast inference, high performance, and promptability. It is a general-purpose evaluator, and is trained to respond with both critiques and judgments in order to deliver actionable insights.
</p>
<div className="my-8">
<img src="/api/placeholder/800/400" alt="Figure 2: Data curation strategy" className="w-full rounded-lg shadow-lg"/>
<p className="text-sm text-gray-600 mt-2">
Figure 2: Data curation strategy: The process of transforming a candidate dataset (left) into the final training mix (right). Yellow boxes indicate filtering steps, purple represents synthetic generation of chosen and rejected pairs for preference optimization.
</p>
</div>
</div>
</section>
{/* Results */}
<section className="mb-16">
<h2 className="text-2xl font-bold mb-4">Results</h2>
<div className="prose max-w-none">
<h3 className="text-xl font-semibold mb-3">Benchmark Performance</h3>
<p className="mb-4">
We assess the performance of Selene Mini on 11 out-of-distribution benchmarks, spanning three different types of evaluation tasks: absolute scoring, classification, and pairwise preference.
</p>
<div className="my-8">
<img src="/api/placeholder/800/400" alt="Figure 3: Real-world evaluation" className="w-full rounded-lg shadow-lg"/>
<p className="text-sm text-gray-600 mt-2">
Figure 3: Real-world evaluation: a) Performance on domain-specific industry benchmarks b) Performance on RewardBench with different prompt formats c) Performance measured by ELO scores in Judge Arena.
</p>
</div>
</div>
</section>
{/* Discussion */}
<section className="mb-16">
<h2 className="text-2xl font-bold mb-4">Discussion</h2>
<div className="prose max-w-none">
<p className="mb-4">
In this work, we introduce Atla Selene Mini, demonstrating that effective general-purpose evaluation can be achieved in smaller model architectures through principled data curation and a hybrid training objective (DPO + SFT).
</p>
<p className="mb-4">
Looking ahead, we anticipate two emerging frontiers that will shape the future of AI evaluation. First is the rise of agent-based systems that combine language models with external tools and APIs, creating more powerful and versatile AI systems. Second is the increasing use of inference-time compute – systems that perform additional reasoning steps during inference to generate higher-quality outputs.
</p>
</div>
</section>
</main>
{/* Footer */}
<footer className="bg-gray-50 py-8">
<div className="container mx-auto px-4 max-w-4xl text-center text-sm text-gray-600">
<p>© 2025 Atla AI</p>
</div>
</footer>
</div>
);
};
export default TechReport;