Update README.md
Browse files
README.md
CHANGED
@@ -18,7 +18,7 @@ The model weights are freely available in FP8 to facilitate cost effective resea
|
|
18 |
**Root Judge**’s performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and
|
19 |
achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost.
|
20 |
|
21 |
-
|
22 |
**Root Judge** is primarily intended to be used as an LLM-as-a-Judge in various contexts such as:
|
23 |
- Detecting context-grounded hallucinations, e.g. for Retrieval Augmented Generation (RAG) settings in an explainable manner, providing a justification for the score
|
24 |
- Pairwise preference judgments due to strong evaluation instruction-following capabilities
|
@@ -26,9 +26,12 @@ achieves SOTA on hallucination detection compared to leading closed models, at a
|
|
26 |
- Assisting inference-time search or synthetic data tasks that require Best-of-N decisions
|
27 |
- Privacy-focused settings that require local deployments
|
28 |
|
29 |
-
|
30 |
|
31 |
-
|
|
|
|
|
|
|
32 |
|
33 |
📊 Benchmark: [HaluBench Test Set](https://huggingface.co/datasets/PatronusAI/HaluBench):
|
34 |
|
@@ -44,9 +47,9 @@ Rank | Model | Test Samples | Pass@1 Rate (%) | Cost ($)
|
|
44 |
|
45 |
[🔎 Detailed Performance Breakdown - Hallucination Detection](https://docs.google.com/spreadsheets/d/1NM9VgGG9_-1kQbepeoueUTkvT1bDeRndTD4RM5iV7l4/edit?usp=sharing)
|
46 |
|
47 |
-
|
48 |
|
49 |
-
Instruction-following performance in various diverse benchmarks compared to other open-weights judge and reward models (higher is better):
|
50 |
|
51 |
Rank | Model | Size (GB) | GSM8K (%) | IFEval (%) | MUSR-Murder (%) | MUSR-Object (%) | MUSR-Team (%) | Avg Score | Relative to Root Judge (%) |
|
52 |
| ---|--------------|------------|--------|---------|--------------|--------------|------------|------------|--------------------|
|
@@ -60,7 +63,22 @@ Rank | Model | Size (GB) | GSM8K (%) | IFEval (%) | MUSR-Murder (%) | MUSR-Objec
|
|
60 |
|
61 |
[🔎 Detailed Performance Breakdown | Intruction-following](https://docs.google.com/spreadsheets/d/1cTPQZbUvelSlLkqj4kO-EQXFDkw17WXKHAeGg02-8Qg/edit?usp=sharing)
|
62 |
|
63 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
64 |
|
65 |
| Test Name | Score | Total | Accuracy |
|
66 |
|------------------------|-------|-------|-----------|
|
@@ -92,18 +110,6 @@ Rank | Model | Size (GB) | GSM8K (%) | IFEval (%) | MUSR-Murder (%) | MUSR-Objec
|
|
92 |
| Safety | | | 0.83986486|
|
93 |
| Reasoning | | | 0.88103618|
|
94 |
|
95 |
-
Root Judge outperforms most leading closed models when detecting instruction following failures on evaluations
|
96 |
-
while providing detailed, structured justifications on long inputs of up to 32k tokens on internal benchmarks and halubench public.
|
97 |
-
|
98 |
-

|
99 |
-
Image 1: Root Signals internal hallucination benchmark. Total pass@1 rates and consistency (delta) assessed via ensemble of leading 3rd party models.
|
100 |
-
|
101 |
-
|
102 |
-

|
103 |
-
Image 2: Root Signals internal hallucination benchmark. Custom rubric instruction following by high level task.
|
104 |
-
|
105 |
-
Root Judge was tested to support complex, user-defined rating rubrics over large context sizes,
|
106 |
-
provide granular qualitative feedback, and support structured evaluation outputs and tool calling.
|
107 |
|
108 |
Despite our main focus on nuanced and transparent judgement of candidate responses,
|
109 |
we test the judge model checkpoints extensively on public and private benchmarks,
|
@@ -111,11 +117,34 @@ to avoid known issues with performance drops such as catastrophic forgetting and
|
|
111 |
preserves general capabilities of Llama-3.3-70B-Instruct after dynamic weights quantization,
|
112 |
while also slightly outperforming it on public instruction following benchmarks such as IFEval and MuSR
|
113 |
|
|
|
114 |
|
|
|
115 |
|
|
|
116 |
|
|
|
|
|
|
|
|
|
117 |
|
118 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
119 |
|
120 |
We recommend using [SGLang](https://github.com/sgl-project/sglang) for production use together with *xml tags* for important sections in your prompt. At least 96GB of VRAM is recommended.
|
121 |
While the model runs on 80GB VRAM the effective context size (around 7k total tokens) will be too low for evaluating most RAG inputs.
|
@@ -154,25 +183,23 @@ docker run \
|
|
154 |
--block_size 16 \
|
155 |
```
|
156 |
|
157 |
-
|
158 |
-
|
159 |
-
## 4. Model Details
|
160 |
|
161 |
-
|
162 |
|
163 |
- **Developed by:** [Root Signals Inc](https://www.rootsignals.ai)
|
164 |
- **Model type:** Text-Only Decoder Transformer
|
165 |
- **Language(s) (NLP):** Primarily English
|
166 |
- **Finetuned from model:** meta-llama/Llama-3.3-70B-Instruct
|
167 |
|
168 |
-
|
169 |
|
170 |
- **Training regime:** DPO with IPO loss for 3 Epochs, bfloat16 mixed-precision on 384 GPUs
|
171 |
- **Hardware Type:** LUMI-G / AMD Radeon Instinct™ MI250X
|
172 |
- **Cloud Provider:** [LUMI Supercomputer](https://lumi-supercomputer.eu)
|
173 |
- **Compute Region:** Finland
|
174 |
|
175 |
-
|
176 |
|
177 | |
178 |
|
|
18 |
**Root Judge**’s performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and
|
19 |
achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost.
|
20 |
|
21 |
+
# 1. Intended Use Cases
|
22 |
**Root Judge** is primarily intended to be used as an LLM-as-a-Judge in various contexts such as:
|
23 |
- Detecting context-grounded hallucinations, e.g. for Retrieval Augmented Generation (RAG) settings in an explainable manner, providing a justification for the score
|
24 |
- Pairwise preference judgments due to strong evaluation instruction-following capabilities
|
|
|
26 |
- Assisting inference-time search or synthetic data tasks that require Best-of-N decisions
|
27 |
- Privacy-focused settings that require local deployments
|
28 |
|
29 |
+
# 2. Performance Summary
|
30 |
|
31 |
+
**Root Judge** outperforms leading closed models when detecting instruction following failures on evaluations
|
32 |
+
while providing detailed, structured justifications on long inputs of up to 32k tokens on internal benchmarks and halubench public.
|
33 |
+
|
34 |
+
## 2.1 Hallucination Detection (in RAG setting)
|
35 |
|
36 |
📊 Benchmark: [HaluBench Test Set](https://huggingface.co/datasets/PatronusAI/HaluBench):
|
37 |
|
|
|
47 |
|
48 |
[🔎 Detailed Performance Breakdown - Hallucination Detection](https://docs.google.com/spreadsheets/d/1NM9VgGG9_-1kQbepeoueUTkvT1bDeRndTD4RM5iV7l4/edit?usp=sharing)
|
49 |
|
50 |
+
## 2.2 Instruction Following
|
51 |
|
52 |
+
📊 Instruction-following performance in various diverse benchmarks compared to other open-weights judge and reward models (higher is better):
|
53 |
|
54 |
Rank | Model | Size (GB) | GSM8K (%) | IFEval (%) | MUSR-Murder (%) | MUSR-Object (%) | MUSR-Team (%) | Avg Score | Relative to Root Judge (%) |
|
55 |
| ---|--------------|------------|--------|---------|--------------|--------------|------------|------------|--------------------|
|
|
|
63 |
|
64 |
[🔎 Detailed Performance Breakdown | Intruction-following](https://docs.google.com/spreadsheets/d/1cTPQZbUvelSlLkqj4kO-EQXFDkw17WXKHAeGg02-8Qg/edit?usp=sharing)
|
65 |
|
66 |
+
## 2.3 Root Signals Internal Benchmarks
|
67 |
+
|
68 |
+
📊 Benchmark: Root Signals Internal Hallucination Detection Benchmark
|
69 |
+
|
70 |
+

|
71 |
+
*Image 1: Total pass@1 rates and consistency (delta) assessed via ensemble of leading 3rd party models.*
|
72 |
+
|
73 |
+
|
74 |
+

|
75 |
+
*Image 2: Custom rubric instruction-following by high level task.*
|
76 |
+
|
77 |
+
**Root Judge** was tested to support complex, user-defined scoring (rating) rubrics over large context sizes It provides granular qualitative feedback and supports structured evaluation outputs as well as tool calling.
|
78 |
+
|
79 |
+
## 2.4 Other Benchmarks
|
80 |
+
|
81 |
+
📊 Benchmark: [RewardBench](https://huggingface.co/spaces/allenai/reward-bench)
|
82 |
|
83 |
| Test Name | Score | Total | Accuracy |
|
84 |
|------------------------|-------|-------|-----------|
|
|
|
110 |
| Safety | | | 0.83986486|
|
111 |
| Reasoning | | | 0.88103618|
|
112 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
113 |
|
114 |
Despite our main focus on nuanced and transparent judgement of candidate responses,
|
115 |
we test the judge model checkpoints extensively on public and private benchmarks,
|
|
|
117 |
preserves general capabilities of Llama-3.3-70B-Instruct after dynamic weights quantization,
|
118 |
while also slightly outperforming it on public instruction following benchmarks such as IFEval and MuSR
|
119 |
|
120 |
+
# 3. Getting Started
|
121 |
|
122 |
+
## 3.1 Via Root Signals Python SDK
|
123 |
|
124 |
+
Model is available on our [platform](https://rootsignals.ai) as part of our evaluation suite, for no additional cost.
|
125 |
|
126 |
+
Install our python library:
|
127 |
+
```bash
|
128 |
+
pip install root-signals
|
129 |
+
```
|
130 |
|
131 |
+
Create your custom judge with custom instructions and run evaluation:
|
132 |
+
```python
|
133 |
+
my_custom_judge = client.evaluators.create(
|
134 |
+
name="Political Text Evaluator",
|
135 |
+
intent="To measure the politics-relatedness of a given text",
|
136 |
+
predicate="Assess if a text containts political jargon or talks about politics: {{response}}",
|
137 |
+
model="RootJudge",
|
138 |
+
)
|
139 |
+
|
140 |
+
result = my_custom_judge.run(
|
141 |
+
response="A defence spending target of 3% of GDP is more likely than the 5% aim pushed by US President Donald Trump, say members of the parliamentary Defence Committee."
|
142 |
+
)
|
143 |
+
print(result.score) # normalized score between [0-1]
|
144 |
+
print(result.justification) # detailed reasoning for the score
|
145 |
+
```
|
146 |
+
|
147 |
+
## 3.2 Locally
|
148 |
|
149 |
We recommend using [SGLang](https://github.com/sgl-project/sglang) for production use together with *xml tags* for important sections in your prompt. At least 96GB of VRAM is recommended.
|
150 |
While the model runs on 80GB VRAM the effective context size (around 7k total tokens) will be too low for evaluating most RAG inputs.
|
|
|
183 |
--block_size 16 \
|
184 |
```
|
185 |
|
186 |
+
# 4. Model Details
|
|
|
|
|
187 |
|
188 |
+
## 4.1 Overview
|
189 |
|
190 |
- **Developed by:** [Root Signals Inc](https://www.rootsignals.ai)
|
191 |
- **Model type:** Text-Only Decoder Transformer
|
192 |
- **Language(s) (NLP):** Primarily English
|
193 |
- **Finetuned from model:** meta-llama/Llama-3.3-70B-Instruct
|
194 |
|
195 |
+
## 4.2 Training Details
|
196 |
|
197 |
- **Training regime:** DPO with IPO loss for 3 Epochs, bfloat16 mixed-precision on 384 GPUs
|
198 |
- **Hardware Type:** LUMI-G / AMD Radeon Instinct™ MI250X
|
199 |
- **Cloud Provider:** [LUMI Supercomputer](https://lumi-supercomputer.eu)
|
200 |
- **Compute Region:** Finland
|
201 |
|
202 |
+
# 5. Contact
|
203 |
|
204 | |
205 |