root-signals
/

RootSignals-Judge-Llama-70B

@@ -18,7 +18,7 @@ The model weights are freely available in FP8 to facilitate cost effective resea
 **Root Judge**’s performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and
 achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost.
-## 1. Intended Use Cases
 **Root Judge** is primarily intended to be used as an LLM-as-a-Judge in various contexts such as:
 - Detecting context-grounded hallucinations, e.g. for Retrieval Augmented Generation (RAG) settings in an explainable manner, providing a justification for the score
 - Pairwise preference judgments due to strong evaluation instruction-following capabilities
@@ -26,9 +26,12 @@ achieves SOTA on hallucination detection compared to leading closed models, at a
 - Assisting inference-time search or synthetic data tasks that require Best-of-N decisions
 - Privacy-focused settings that require local deployments
-## 2. Performance Summary
-### 2.1 Hallucination Detection (in RAG setting)
 📊 Benchmark: [HaluBench Test Set](https://huggingface.co/datasets/PatronusAI/HaluBench):
@@ -44,9 +47,9 @@ Rank | Model | Test Samples | Pass@1 Rate (%) | Cost ($)
 [🔎 Detailed Performance Breakdown - Hallucination Detection](https://docs.google.com/spreadsheets/d/1NM9VgGG9_-1kQbepeoueUTkvT1bDeRndTD4RM5iV7l4/edit?usp=sharing)
-### 2.2 Instruction Following
-Instruction-following performance in various diverse benchmarks compared to other open-weights judge and reward models (higher is better):
 Rank | Model | Size (GB) | GSM8K (%) | IFEval (%) | MUSR-Murder (%) | MUSR-Object (%) | MUSR-Team (%) | Avg Score | Relative to Root Judge (%) |
 | ---|--------------|------------|--------|---------|--------------|--------------|------------|------------|--------------------|
@@ -60,7 +63,22 @@ Rank | Model | Size (GB) | GSM8K (%) | IFEval (%) | MUSR-Murder (%) | MUSR-Objec
 [🔎 Detailed Performance Breakdown | Intruction-following](https://docs.google.com/spreadsheets/d/1cTPQZbUvelSlLkqj4kO-EQXFDkw17WXKHAeGg02-8Qg/edit?usp=sharing)
-[RewardBench Generative - Unverified](https://huggingface.co/spaces/allenai/reward-bench)
 | Test Name              | Score | Total | Accuracy  |
 |------------------------|-------|-------|-----------|
@@ -92,18 +110,6 @@ Rank | Model | Size (GB) | GSM8K (%) | IFEval (%) | MUSR-Murder (%) | MUSR-Objec
 | Safety                 |       |       | 0.83986486|
 | Reasoning              |       |       | 0.88103618|
-Root Judge outperforms most leading closed models when detecting instruction following failures on evaluations
-while providing detailed, structured justifications on long inputs of up to 32k tokens on internal benchmarks and halubench public.
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/6343d9d3e01a38440eeffc9c/rHq5RakEPkOlnC69MOl1e.png)
-Image 1: Root Signals internal hallucination benchmark. Total pass@1 rates and consistency (delta) assessed via ensemble of leading 3rd party models.
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/6343d9d3e01a38440eeffc9c/zfsh6HTbYH1HpLItWgq8u.png)
-Image 2: Root Signals internal hallucination benchmark. Custom rubric instruction following by high level task.
-Root Judge was tested to support complex, user-defined rating rubrics over large context sizes,
-provide granular qualitative feedback, and support structured evaluation outputs and tool calling.
 Despite our main focus on nuanced and transparent judgement of candidate responses,
 we test the judge model checkpoints extensively on public and private benchmarks,
@@ -111,11 +117,34 @@ to avoid known issues with performance drops such as catastrophic forgetting and
 preserves general capabilities of Llama-3.3-70B-Instruct after dynamic weights quantization,
 while also slightly outperforming it on public instruction following benchmarks such as IFEval and MuSR
-## 3. Getting Started
 We recommend using [SGLang](https://github.com/sgl-project/sglang) for production use together with *xml tags* for important sections in your prompt. At least 96GB of VRAM is recommended.
 While the model runs on 80GB VRAM the effective context size (around 7k total tokens) will be too low for evaluating most RAG inputs.
@@ -154,25 +183,23 @@ docker run \
    --block_size 16 \
 ```
-The model is also available on our [platform](https://rootsignals.ai) as part of our evaluation suite, for no additional cost
-## 4. Model Details
-### 4.1 Overview
 - **Developed by:** [Root Signals Inc](https://www.rootsignals.ai)
 - **Model type:** Text-Only Decoder Transformer
 - **Language(s) (NLP):** Primarily English
 - **Finetuned from model:** meta-llama/Llama-3.3-70B-Instruct
-### 4.2 Training Details
 - **Training regime:** DPO with IPO loss for 3 Epochs, bfloat16 mixed-precision on 384 GPUs
 - **Hardware Type:** LUMI-G / AMD Radeon Instinct™ MI250X
 - **Cloud Provider:** [LUMI Supercomputer](https://lumi-supercomputer.eu)
 - **Compute Region:** Finland
-## 5. Contact
 - [email protected]
 - [email protected]

 **Root Judge**’s performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and
 achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost.
+# 1. Intended Use Cases
 **Root Judge** is primarily intended to be used as an LLM-as-a-Judge in various contexts such as:
 - Detecting context-grounded hallucinations, e.g. for Retrieval Augmented Generation (RAG) settings in an explainable manner, providing a justification for the score
 - Pairwise preference judgments due to strong evaluation instruction-following capabilities
 - Assisting inference-time search or synthetic data tasks that require Best-of-N decisions
 - Privacy-focused settings that require local deployments
+# 2. Performance Summary
+**Root Judge** outperforms leading closed models when detecting instruction following failures on evaluations
+while providing detailed, structured justifications on long inputs of up to 32k tokens on internal benchmarks and halubench public.
+## 2.1 Hallucination Detection (in RAG setting)
 📊 Benchmark: [HaluBench Test Set](https://huggingface.co/datasets/PatronusAI/HaluBench):
 [🔎 Detailed Performance Breakdown - Hallucination Detection](https://docs.google.com/spreadsheets/d/1NM9VgGG9_-1kQbepeoueUTkvT1bDeRndTD4RM5iV7l4/edit?usp=sharing)
+## 2.2 Instruction Following
+📊 Instruction-following performance in various diverse benchmarks compared to other open-weights judge and reward models (higher is better):
 Rank | Model | Size (GB) | GSM8K (%) | IFEval (%) | MUSR-Murder (%) | MUSR-Object (%) | MUSR-Team (%) | Avg Score | Relative to Root Judge (%) |
 | ---|--------------|------------|--------|---------|--------------|--------------|------------|------------|--------------------|
 [🔎 Detailed Performance Breakdown | Intruction-following](https://docs.google.com/spreadsheets/d/1cTPQZbUvelSlLkqj4kO-EQXFDkw17WXKHAeGg02-8Qg/edit?usp=sharing)
+## 2.3 Root Signals Internal Benchmarks
+📊 Benchmark: Root Signals Internal Hallucination Detection Benchmark
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6343d9d3e01a38440eeffc9c/rHq5RakEPkOlnC69MOl1e.png)
+*Image 1: Total pass@1 rates and consistency (delta) assessed via ensemble of leading 3rd party models.*
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6343d9d3e01a38440eeffc9c/zfsh6HTbYH1HpLItWgq8u.png)
+*Image 2: Custom rubric instruction-following by high level task.*
+**Root Judge** was tested to support complex, user-defined scoring (rating) rubrics over large context sizes It provides granular qualitative feedback and supports structured evaluation outputs as well as tool calling.
+## 2.4 Other Benchmarks
+📊 Benchmark: [RewardBench](https://huggingface.co/spaces/allenai/reward-bench)
 | Test Name              | Score | Total | Accuracy  |
 |------------------------|-------|-------|-----------|
 | Safety                 |       |       | 0.83986486|
 | Reasoning              |       |       | 0.88103618|
 Despite our main focus on nuanced and transparent judgement of candidate responses,
 we test the judge model checkpoints extensively on public and private benchmarks,
 preserves general capabilities of Llama-3.3-70B-Instruct after dynamic weights quantization,
 while also slightly outperforming it on public instruction following benchmarks such as IFEval and MuSR
+# 3. Getting Started
+## 3.1 Via Root Signals Python SDK
+Model is available on our [platform](https://rootsignals.ai) as part of our evaluation suite, for no additional cost.
+Install our python library:
+```bash
+pip install root-signals
+```
+Create your custom judge with custom instructions and run evaluation:
+```python
+my_custom_judge = client.evaluators.create(
+    name="Political Text Evaluator",
+    intent="To measure the politics-relatedness of a given text",
+    predicate="Assess if a text containts political jargon or talks about politics: {{response}}",
+    model="RootJudge",
+)
+result = my_custom_judge.run(
+    response="A defence spending target of 3% of GDP is more likely than the 5% aim pushed by US President Donald Trump, say members of the parliamentary Defence Committee."
+)
+print(result.score)  # normalized score between [0-1]
+print(result.justification)  # detailed reasoning for the score
+```
+## 3.2 Locally
 We recommend using [SGLang](https://github.com/sgl-project/sglang) for production use together with *xml tags* for important sections in your prompt. At least 96GB of VRAM is recommended.
 While the model runs on 80GB VRAM the effective context size (around 7k total tokens) will be too low for evaluating most RAG inputs.
    --block_size 16 \
 ```
+# 4. Model Details
+## 4.1 Overview
 - **Developed by:** [Root Signals Inc](https://www.rootsignals.ai)
 - **Model type:** Text-Only Decoder Transformer
 - **Language(s) (NLP):** Primarily English
 - **Finetuned from model:** meta-llama/Llama-3.3-70B-Instruct
+## 4.2 Training Details
 - **Training regime:** DPO with IPO loss for 3 Epochs, bfloat16 mixed-precision on 384 GPUs
 - **Hardware Type:** LUMI-G / AMD Radeon Instinct™ MI250X
 - **Cloud Provider:** [LUMI Supercomputer](https://lumi-supercomputer.eu)
 - **Compute Region:** Finland
+# 5. Contact
 - [email protected]
 - [email protected]