File size: 14,460 Bytes
6b6b61a
 
 
 
 
 
 
 
 
 
 
 
 
2b60fa0
956af36
 
6b6b61a
a8441c4
e3be3a2
6b6b61a
7ea2817
e3be3a2
6d2379a
e3be3a2
 
 
 
 
7ea2817
a8441c4
7ea2817
 
 
 
a8441c4
 
 
 
 
50af80f
 
 
 
 
a8441c4
50af80f
 
 
 
a8441c4
cc89ca3
a8441c4
7ea2817
6b6b61a
7ea2817
cc89ca3
4b82e4e
6347228
a12227a
4b82e4e
 
 
 
 
6347228
cc89ca3
 
6b6b61a
7ea2817
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2b60fa0
 
784b80d
2b60fa0
 
 
784b80d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2b60fa0
6b6b61a
 
 
 
 
 
 
7ea2817
a8441c4
7ea2817
a8441c4
f562cf1
a8441c4
bd98248
7ea2817
 
 
a8441c4
df283dd
068548e
 
 
 
 
df283dd
7ea2817
 
 
 
 
 
 
df283dd
7ea2817
df283dd
 
7ea2817
 
 
 
 
 
 
 
6b6b61a
df283dd
6b6b61a
 
 
 
 
 
 
 
 
 
 
632624a
6b6b61a
 
 
 
 
 
 
 
632624a
98ba31c
6b6b61a
 
 
 
 
 
 
632624a
 
 
6b6b61a
632624a
6b6b61a
 
98ba31c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
abe789e
98ba31c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7ea2817
e3be3a2
7ea2817
e3be3a2
 
 
 
 
6b6b61a
7ea2817
6b6b61a
 
 
 
a8441c4
 
dc993d5
7ea2817
a8441c4
dc993d5
 
f562cf1
dc993d5
 
 
 
4b82e4e
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
---
license: llama3.3
language:
- en
base_model:
- meta-llama/Llama-3.3-70B-Instruct
pipeline_tag: text-generation
tags:
- llm-as-judge
- evaluation
---
# Model Card for RootSignals-Judge-Llama-70B

**Root Judge** is a powerful mid-sized LLM that enables reliable and customizable LLM system evaluations. 
Root Judge was post-trained from *Llama-3.3-70B-Instruct* on a high quality, human-annotated dataset mix for pairwise preference choice judgments and multi-turn instruction following with source citing.
The model weights are freely available in FP8 to facilitate cost effective research as well as commercial use.

**Root Judge**’s performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and 
achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost.

# 1. Intended Use Cases
**Root Judge** is primarily intended to be used as an LLM-as-a-Judge in various contexts such as:
- Detecting context-grounded hallucinations, e.g. for *Retrieval Augmented Generation* (RAG) settings in an explainable manner, providing a justification for the score
- Pairwise preference judgments due to strong evaluation instruction-following capabilities
- Serving as a custom evaluation metric powered by use case specific evaluation rubrics
- Assisting inference-time search or synthetic data tasks that require Best-of-N decisions
- Privacy-focused settings that require local deployments

# 2. Performance Summary

**Root Judge** outperforms leading closed models when detecting instruction following failures on evaluations 
while providing detailed, structured justifications on long inputs of up to 32k tokens on internal benchmarks and halubench public. 

## 2.1 Hallucination Detection (in RAG setting)

📊 Benchmark: [HaluBench Test Set](https://huggingface.co/datasets/PatronusAI/HaluBench):

Rank | Model | Test Samples | Pass@1 Rate (%) | Cost ($)
| --- | --- | --- | --- | --- |   
**1** | **Root Judge** | 14900 | **86.3** | **3.98**
2 | GPT-4o | 14900 | 86.1 | 33.12
3 | o1-preview | 14899 | 85.3 | 1062*
4 | Claude Sonnet-3.5 | 14797 |  85.2 | 42.94
5 | Llama3.1-70b-Instruct| 13969 | 84.7  | 27.43
6 | o1-mini | 14655 | 83.7 | 156
7 | Llama3.1-405b-Instruct | 14881 | 83.6  | 269.82

`*`=benchmarked as o1-preview; at current o1 prices, without reasoning tokens, the cost would start at $198.74 instead  
Local Costs based on lambdalabs instances at January 2025 prices

[🔎 Detailed Performance Breakdown - Hallucination Detection](https://docs.google.com/spreadsheets/d/1NM9VgGG9_-1kQbepeoueUTkvT1bDeRndTD4RM5iV7l4/edit?usp=sharing)

## 2.2 Instruction Following

📊 Instruction-following performance in various diverse benchmarks compared to other open-weights judge and reward models (higher is better):

Rank | Model | VRAM (GB) | GSM8K (%) | IFEval (%) | MUSR-Murder (%) | MUSR-Object (%) | MUSR-Team (%) | Avg Score | Relative to Root Judge (%) |
| ---|--------------|------------|--------|---------|--------------|--------------|------------|------------|--------------------|
**1** | **Root Judge** | 70  | **94.6 ± 0.6** | **93.9** | 52.8 ± 3.2     | 24.6 ± 2.7     | **56.8 ± 3.1** | **64.5** | 100 |
2 | Llama-3.3-70B        | 140 | 94.4 ± 0.6     | 93.4     | 54.0 ± 3.2     | 23.4 ± 2.7     | 56.0 ± 3.2     | 64.3 | 99.5 |
3 | Patronus-70B         | 140 | 91.7 ± 0.8     | 83.7     | 54.4 ± 3.2     | 24.6 ± 2.7     | 48.8 ± 3.2     | 60.6 | 93.9 |
4 | Nemotron-70B         | 70  | 80.1 ± 1.1     | 85.0     | 53.6 ± 3.2     | 23.8 ± 2.7     | 55.6 ± 3.1     | 59.6 | 92.4 |
5 | Qwen-2.5-32B         | 64  | 87.4 ± 0.9     | 87.5     | 58.8 ± 3.1     | 23.1 ± 2.6     | 45.2 ± 3.2     | 60.4 | 93.6 |
6 | Flow Judge           | 16  | 78.7 ± 1.1     | 64.6     | **60.8 ± 3.1** | 23.4 ± 2.7     | 35.6 ± 3.0     | 52.6 | 81.5 |
7 | Glider               | 8   | 78.7 ± 1.1     | 56.5      | 59.2 ± 3.1     | **35.9 ± 3.0** | 43.2 ± 3.1     | 54.7 | 84.8 |

[🔎 Detailed Performance Breakdown | Intruction-following](https://docs.google.com/spreadsheets/d/1cTPQZbUvelSlLkqj4kO-EQXFDkw17WXKHAeGg02-8Qg/edit?usp=sharing)

## 2.3 Root Signals Internal Benchmarks

📊 Benchmark: Root Signals Internal Hallucination Detection Benchmark

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6343d9d3e01a38440eeffc9c/rHq5RakEPkOlnC69MOl1e.png)
*Image 1: Total pass@1 rates and consistency (delta) assessed via ensemble of leading 3rd party models.*


![image/png](https://cdn-uploads.huggingface.co/production/uploads/6343d9d3e01a38440eeffc9c/zfsh6HTbYH1HpLItWgq8u.png)
*Image 2: Custom rubric instruction-following by high level task.*

**Root Judge** was tested to support complex, user-defined scoring (rating) rubrics over large context sizes It provides granular qualitative feedback and supports structured evaluation outputs as well as tool calling.

## 2.4 Other Benchmarks

<details>
  <summary>📊 RewardBench</summary>

[RewardBench](https://huggingface.co/spaces/allenai/reward-bench)
  
| Benchmark Task         | Score | Total | Accuracy  |
|------------------------|-------|-------|-----------|
| alpacaeval-easy        | 99.0  | 100   | 0.99      |
| alpacaeval-hard        | 93.0  | 95    | 0.97894737|
| alpacaeval-length      | 86.0  | 95    | 0.90526316|
| donotanswer            | 73.5  | 136   | 0.54044118|
| hep-cpp                | 159.0 | 164   | 0.96951220|
| hep-go                 | 159.0 | 164   | 0.96951220|
| hep-java               | 161.0 | 164   | 0.98170732|
| hep-js                 | 159.0 | 164   | 0.96951220|
| hep-python             | 158.0 | 164   | 0.96341463|
| hep-rust               | 152.0 | 164   | 0.92682927|
| llmbar-adver-GPTInst   | 69.0  | 92    | 0.75      |
| llmbar-adver-GPTOut    | 39.0  | 47    | 0.82978723|
| llmbar-adver-manual    | 32.0  | 46    | 0.69565217|
| llmbar-adver-neighbor  | 74.0  | 134   | 0.55223881|
| llmbar-natural         | 94.0  | 100   | 0.94      |
| math-prm               | 357.0 | 447   | 0.79865772|
| mt-bench-easy          | 28.0  | 28    | 1.0       |
| mt-bench-hard          | 32.0  | 37    | 0.86486486|
| mt-bench-med           | 40.0  | 40    | 1.0       |
| refusals-dangerous     | 73.5  | 100   | 0.735     |
| refusals-offensive     | 89.0  | 100   | 0.89      |
| xstest-should-refuse   | 140.5 | 154   | 0.91233766|
| xstest-should-respond  | 245.0 | 250   | 0.98      |
| Chat                   |       |       | 0.96648045|
| Chat Hard              |       |       | 0.74561404|
| Safety                 |       |       | 0.83986486|
| Reasoning              |       |       | 0.88103618|

</details>

Despite our main focus on nuanced and transparent judgement of candidate responses, 
we test the judge model checkpoints extensively on public and private benchmarks, 
to avoid known issues with performance drops such as catastrophic forgetting and find that the model 
preserves general capabilities of Llama-3.3-70B-Instruct after dynamic weights quantization, 
while also slightly outperforming it on public instruction following benchmarks such as IFEval and MuSR

# 3. Getting Started

## 3.1 Via Root Signals Python SDK

Model is available on our [platform](https://app.rootsignals.ai/register?utm_campaign=55516392-Hugging%20Face&utm_source=https%3A%2F%2Fhuggingface.co%2Froot-signals) as part of our evaluation suite, for no additional cost.

Install our [python library](https://github.com/root-signals/rs-python-sdk):
```bash
pip install root-signals
```

Import:
```python
from root import RootSignals
client = RootSignals()
```

Create a custom evaluator powered by **Root Judge**:
```python
my_custom_judge = client.evaluators.create(
    name="Political Text Evaluator",
    intent="To measure the politics-relatedness of a given text",
    predicate="Assess if a text containts political jargon or talks about politics: {{response}}",
    model="RootJudge",
)
```

Execute:
```python
result = my_custom_judge.run(
    response="A defence spending target of 3% of GDP is more likely than the 5% aim pushed by US President Donald Trump, say members of the parliamentary Defence Committee."
)
print(result.score)  # normalized score between [0-1]
print(result.justification)  # detailed reasoning for the score
```

## 3.2 Locally

We recommend using [SGLang](https://github.com/sgl-project/sglang) for production use-cases together with *xml tags* for important sections in your prompt. While the model can run on 80GB VRAM, we recommend at least 96GB for evaluating long-context RAG inputs.

SGlang example for a single Nvidia H100 (80GB):
```bash
docker run \
   --gpus all \
   --ipc=host  \
   -p 8000:8000 \
   -v huggingface:/root/.cache/huggingface \
   --volume /etc/localtime:/etc/localtime:ro \
   -d docker.io/lmsysorg/sglang:v0.4.2-cu124-srt \
   python3 -m sglang.launch_server \
   --model-path root-signals/RootSignals-Judge-Llama-70B \
   --host 0.0.0.0 \
   --port 8000 \
   --mem-fraction-static 0.89 \
   --grammar-backend xgrammar \
   --enable-torch-compile \
   --disable-cuda-graph
```

We validated the model on arm64 with [vLLM](https://github.com/vllm-project/vllm) on Nvidia GH200 as well with max outputs up to 64k tokens: 
```bash
docker run \
   --gpus all \
   --ipc=host  \
   -p 8000:8000 \
   -v huggingface:/root/.cache/huggingface \
   --volume /etc/localtime:/etc/localtime:ro \
   -d drikster80/vllm-gh200-openai:v0.6.4.post1 \
   --model root-signals/RootSignals-Judge-Llama-70B \
   --gpu-memory-utilization 0.95 \
   --max-model-len 64k \
   --block_size 16 \
   --enable_prefix_caching
```

Detect hallucinations from context, example uses halubench:
```python
decompose_system_instruction = """
<TASK>
You are a fair judge that detects hallucinations and unjustified assumptions from question-document-answer triplets provided by the user. 
Always follow the instructions below and provide your reasoning and verdict in the format specified.
</TASK>

<INSTRUCTIONS>
#1. Identify key elements in the question.
#2. List all relevant facts provided in the document.
#3. Break down the answer into its component claims.
#4. For each claim in the answer:
#a. Is it explicitly supported by the document? If yes, quote the relevant part.
#b. Is it a reasonable inference from the document? If yes, explain the reasoning.
#c. Is it unsupported or contradicted by the document? If yes, explain why.
#5. Check for any information in the answer that's present in the question but not in the document.
#6. Verify that no additional information is introduced in the answer that isn't in the document or question.
#7. Assess if the answer makes any unjustified connections or assumptions.
</INSTRUCTIONS>

<OUTPUT_EXAMPLE>
{"REASONING": "Your reasoning here where you cite the instruction step by number and provide your reasoning", "VERDICT": "PASS" or "FAIL"}
</OUTPUT_EXAMPLE>
"""

decompose_prompt = """
<QUESTION>: {question} </QUESTION>
<DOCUMENT>: {document} </DOCUMENT>
<ANSWER>: {answer} </ANSWER>
""".strip()

import os
import json
import pandas as pd
from openai import OpenAI
from pprint import pprint
from pydantic import BaseModel

testset_df = pd.read_parquet("hf://datasets/PatronusAI/HaluBench/data/test-00000-of-00001.parquet")
testset_df = testset_df.sample(frac=1).reset_index(drop=True)
example_row = testset_df.iloc[0]

class DecomposeResponse(BaseModel):
    REASONING: str
    VERDICT: str

client = OpenAI(base_url="http://localhost:8000/v1")  # export a different one for e.g. sglang, openrouter, etc.

response = client.beta.chat.completions.parse(
    model="root-signals/RootSignals-Judge-Llama-70B",  # or `RootJudge` if you are using the RootSignals API
    messages=[
        {"role": "system", "content": decompose_system_instruction},
        {"role": "user", "content": decompose_prompt.format(
            question=example_row["question"], 
            document=example_row["passage"], 
            answer=example_row["answer"])},
    ],
    response_format=DecomposeResponse,
).choices[0].message.parsed

pprint(response.REASONING)
pprint(response.VERDICT)
```

```
> ('Following the instructions: #1, the key element in the question is the '
 "nationality of the magazines. #2, the document states that 'The Woman's "
 "Viewpoint was a woman's magazine founded in Texas in 1923' and 'Pick Me Up! "
 "is a British weekly women's magazine'. #3, the answer claims both magazines "
 'are British. #4, checking each claim in the answer: a) The document does not '
 "support the claim that The Woman's Viewpoint is British, instead, it says "
 "the magazine was founded in Texas. b) There's no reasonable inference from "
 "the document that would suggest The Woman's Viewpoint is British. c) The "
 "claim about The Woman's Viewpoint is contradicted by the document. #5, the "
 'answer introduces information (both being British) not supported by the '
 'document. #6, additional information about both magazines being British is '
 'introduced in the answer without being present in the document or question. '
 '#7, the answer makes an unjustified assumption by stating both magazines are '
 "British despite the document clearly stating The Woman's Viewpoint was "
 'founded in Texas, implying it is not British. Therefore, the answer fails to '
 'accurately reflect the information provided in the document and makes '
 'unjustified assumptions based on the information given in the question and '
 "document.', ")
'FAIL'
```

# 4. Model Details

## 4.1 Overview

- **Developed by:** [Root Signals Inc](https://www.rootsignals.ai)
- **Model type:** Text-Only Decoder Transformer
- **Language(s) (NLP):** Primarily English
- **Finetuned from model:** meta-llama/Llama-3.3-70B-Instruct

## 4.2 Training Details

- **Training regime:** DPO with IPO loss for 3 Epochs, bfloat16 mixed-precision on 384 GPUs
- **Hardware Type:** LUMI-G / AMD Radeon Instinct™ MI250X
- **Cloud Provider:** [LUMI Supercomputer](https://lumi-supercomputer.eu)
- **Compute Region:** Finland


# 5. Contact

**Links**
- [Root Signals Homepage](https://www.rootsignals.ai/)
- [Root Signals Platform](https://app.rootsignals.ai/?utm_campaign=55516392-Hugging%20Face&utm_source=https%3A%2F%2Fhuggingface.co%2Froot-signals)
- [Python SDK Docs](https://sdk.rootsignals.ai/en/latest/quickstart.html)
- [Root Signals GitHub](https://github.com/root-signals/rs-python-sdk)
- [Discord](https://discord.gg/EhazTQsFnj)

**Email**
- [email protected]