Ouz-G commited on
Commit
7ea2817
·
verified ·
1 Parent(s): 6347228

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +52 -25
README.md CHANGED
@@ -18,7 +18,7 @@ The model weights are freely available in FP8 to facilitate cost effective resea
18
  **Root Judge**’s performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and
19
  achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost.
20
 
21
- ## 1. Intended Use Cases
22
  **Root Judge** is primarily intended to be used as an LLM-as-a-Judge in various contexts such as:
23
  - Detecting context-grounded hallucinations, e.g. for Retrieval Augmented Generation (RAG) settings in an explainable manner, providing a justification for the score
24
  - Pairwise preference judgments due to strong evaluation instruction-following capabilities
@@ -26,9 +26,12 @@ achieves SOTA on hallucination detection compared to leading closed models, at a
26
  - Assisting inference-time search or synthetic data tasks that require Best-of-N decisions
27
  - Privacy-focused settings that require local deployments
28
 
29
- ## 2. Performance Summary
30
 
31
- ### 2.1 Hallucination Detection (in RAG setting)
 
 
 
32
 
33
  📊 Benchmark: [HaluBench Test Set](https://huggingface.co/datasets/PatronusAI/HaluBench):
34
 
@@ -44,9 +47,9 @@ Rank | Model | Test Samples | Pass@1 Rate (%) | Cost ($)
44
 
45
  [🔎 Detailed Performance Breakdown - Hallucination Detection](https://docs.google.com/spreadsheets/d/1NM9VgGG9_-1kQbepeoueUTkvT1bDeRndTD4RM5iV7l4/edit?usp=sharing)
46
 
47
- ### 2.2 Instruction Following
48
 
49
- Instruction-following performance in various diverse benchmarks compared to other open-weights judge and reward models (higher is better):
50
 
51
  Rank | Model | Size (GB) | GSM8K (%) | IFEval (%) | MUSR-Murder (%) | MUSR-Object (%) | MUSR-Team (%) | Avg Score | Relative to Root Judge (%) |
52
  | ---|--------------|------------|--------|---------|--------------|--------------|------------|------------|--------------------|
@@ -60,7 +63,22 @@ Rank | Model | Size (GB) | GSM8K (%) | IFEval (%) | MUSR-Murder (%) | MUSR-Objec
60
 
61
  [🔎 Detailed Performance Breakdown | Intruction-following](https://docs.google.com/spreadsheets/d/1cTPQZbUvelSlLkqj4kO-EQXFDkw17WXKHAeGg02-8Qg/edit?usp=sharing)
62
 
63
- [RewardBench Generative - Unverified](https://huggingface.co/spaces/allenai/reward-bench)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
 
65
  | Test Name | Score | Total | Accuracy |
66
  |------------------------|-------|-------|-----------|
@@ -92,18 +110,6 @@ Rank | Model | Size (GB) | GSM8K (%) | IFEval (%) | MUSR-Murder (%) | MUSR-Objec
92
  | Safety | | | 0.83986486|
93
  | Reasoning | | | 0.88103618|
94
 
95
- Root Judge outperforms most leading closed models when detecting instruction following failures on evaluations
96
- while providing detailed, structured justifications on long inputs of up to 32k tokens on internal benchmarks and halubench public.
97
-
98
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6343d9d3e01a38440eeffc9c/rHq5RakEPkOlnC69MOl1e.png)
99
- Image 1: Root Signals internal hallucination benchmark. Total pass@1 rates and consistency (delta) assessed via ensemble of leading 3rd party models.
100
-
101
-
102
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6343d9d3e01a38440eeffc9c/zfsh6HTbYH1HpLItWgq8u.png)
103
- Image 2: Root Signals internal hallucination benchmark. Custom rubric instruction following by high level task.
104
-
105
- Root Judge was tested to support complex, user-defined rating rubrics over large context sizes,
106
- provide granular qualitative feedback, and support structured evaluation outputs and tool calling.
107
 
108
  Despite our main focus on nuanced and transparent judgement of candidate responses,
109
  we test the judge model checkpoints extensively on public and private benchmarks,
@@ -111,11 +117,34 @@ to avoid known issues with performance drops such as catastrophic forgetting and
111
  preserves general capabilities of Llama-3.3-70B-Instruct after dynamic weights quantization,
112
  while also slightly outperforming it on public instruction following benchmarks such as IFEval and MuSR
113
 
 
114
 
 
115
 
 
116
 
 
 
 
 
117
 
118
- ## 3. Getting Started
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
119
 
120
  We recommend using [SGLang](https://github.com/sgl-project/sglang) for production use together with *xml tags* for important sections in your prompt. At least 96GB of VRAM is recommended.
121
  While the model runs on 80GB VRAM the effective context size (around 7k total tokens) will be too low for evaluating most RAG inputs.
@@ -154,25 +183,23 @@ docker run \
154
  --block_size 16 \
155
  ```
156
 
157
- The model is also available on our [platform](https://rootsignals.ai) as part of our evaluation suite, for no additional cost
158
-
159
- ## 4. Model Details
160
 
161
- ### 4.1 Overview
162
 
163
  - **Developed by:** [Root Signals Inc](https://www.rootsignals.ai)
164
  - **Model type:** Text-Only Decoder Transformer
165
  - **Language(s) (NLP):** Primarily English
166
  - **Finetuned from model:** meta-llama/Llama-3.3-70B-Instruct
167
 
168
- ### 4.2 Training Details
169
 
170
  - **Training regime:** DPO with IPO loss for 3 Epochs, bfloat16 mixed-precision on 384 GPUs
171
  - **Hardware Type:** LUMI-G / AMD Radeon Instinct™ MI250X
172
  - **Cloud Provider:** [LUMI Supercomputer](https://lumi-supercomputer.eu)
173
  - **Compute Region:** Finland
174
 
175
- ## 5. Contact
176
 
177
178
 
18
  **Root Judge**’s performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and
19
  achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost.
20
 
21
+ # 1. Intended Use Cases
22
  **Root Judge** is primarily intended to be used as an LLM-as-a-Judge in various contexts such as:
23
  - Detecting context-grounded hallucinations, e.g. for Retrieval Augmented Generation (RAG) settings in an explainable manner, providing a justification for the score
24
  - Pairwise preference judgments due to strong evaluation instruction-following capabilities
 
26
  - Assisting inference-time search or synthetic data tasks that require Best-of-N decisions
27
  - Privacy-focused settings that require local deployments
28
 
29
+ # 2. Performance Summary
30
 
31
+ **Root Judge** outperforms leading closed models when detecting instruction following failures on evaluations
32
+ while providing detailed, structured justifications on long inputs of up to 32k tokens on internal benchmarks and halubench public.
33
+
34
+ ## 2.1 Hallucination Detection (in RAG setting)
35
 
36
  📊 Benchmark: [HaluBench Test Set](https://huggingface.co/datasets/PatronusAI/HaluBench):
37
 
 
47
 
48
  [🔎 Detailed Performance Breakdown - Hallucination Detection](https://docs.google.com/spreadsheets/d/1NM9VgGG9_-1kQbepeoueUTkvT1bDeRndTD4RM5iV7l4/edit?usp=sharing)
49
 
50
+ ## 2.2 Instruction Following
51
 
52
+ 📊 Instruction-following performance in various diverse benchmarks compared to other open-weights judge and reward models (higher is better):
53
 
54
  Rank | Model | Size (GB) | GSM8K (%) | IFEval (%) | MUSR-Murder (%) | MUSR-Object (%) | MUSR-Team (%) | Avg Score | Relative to Root Judge (%) |
55
  | ---|--------------|------------|--------|---------|--------------|--------------|------------|------------|--------------------|
 
63
 
64
  [🔎 Detailed Performance Breakdown | Intruction-following](https://docs.google.com/spreadsheets/d/1cTPQZbUvelSlLkqj4kO-EQXFDkw17WXKHAeGg02-8Qg/edit?usp=sharing)
65
 
66
+ ## 2.3 Root Signals Internal Benchmarks
67
+
68
+ 📊 Benchmark: Root Signals Internal Hallucination Detection Benchmark
69
+
70
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6343d9d3e01a38440eeffc9c/rHq5RakEPkOlnC69MOl1e.png)
71
+ *Image 1: Total pass@1 rates and consistency (delta) assessed via ensemble of leading 3rd party models.*
72
+
73
+
74
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6343d9d3e01a38440eeffc9c/zfsh6HTbYH1HpLItWgq8u.png)
75
+ *Image 2: Custom rubric instruction-following by high level task.*
76
+
77
+ **Root Judge** was tested to support complex, user-defined scoring (rating) rubrics over large context sizes It provides granular qualitative feedback and supports structured evaluation outputs as well as tool calling.
78
+
79
+ ## 2.4 Other Benchmarks
80
+
81
+ 📊 Benchmark: [RewardBench](https://huggingface.co/spaces/allenai/reward-bench)
82
 
83
  | Test Name | Score | Total | Accuracy |
84
  |------------------------|-------|-------|-----------|
 
110
  | Safety | | | 0.83986486|
111
  | Reasoning | | | 0.88103618|
112
 
 
 
 
 
 
 
 
 
 
 
 
 
113
 
114
  Despite our main focus on nuanced and transparent judgement of candidate responses,
115
  we test the judge model checkpoints extensively on public and private benchmarks,
 
117
  preserves general capabilities of Llama-3.3-70B-Instruct after dynamic weights quantization,
118
  while also slightly outperforming it on public instruction following benchmarks such as IFEval and MuSR
119
 
120
+ # 3. Getting Started
121
 
122
+ ## 3.1 Via Root Signals Python SDK
123
 
124
+ Model is available on our [platform](https://rootsignals.ai) as part of our evaluation suite, for no additional cost.
125
 
126
+ Install our python library:
127
+ ```bash
128
+ pip install root-signals
129
+ ```
130
 
131
+ Create your custom judge with custom instructions and run evaluation:
132
+ ```python
133
+ my_custom_judge = client.evaluators.create(
134
+ name="Political Text Evaluator",
135
+ intent="To measure the politics-relatedness of a given text",
136
+ predicate="Assess if a text containts political jargon or talks about politics: {{response}}",
137
+ model="RootJudge",
138
+ )
139
+
140
+ result = my_custom_judge.run(
141
+ response="A defence spending target of 3% of GDP is more likely than the 5% aim pushed by US President Donald Trump, say members of the parliamentary Defence Committee."
142
+ )
143
+ print(result.score) # normalized score between [0-1]
144
+ print(result.justification) # detailed reasoning for the score
145
+ ```
146
+
147
+ ## 3.2 Locally
148
 
149
  We recommend using [SGLang](https://github.com/sgl-project/sglang) for production use together with *xml tags* for important sections in your prompt. At least 96GB of VRAM is recommended.
150
  While the model runs on 80GB VRAM the effective context size (around 7k total tokens) will be too low for evaluating most RAG inputs.
 
183
  --block_size 16 \
184
  ```
185
 
186
+ # 4. Model Details
 
 
187
 
188
+ ## 4.1 Overview
189
 
190
  - **Developed by:** [Root Signals Inc](https://www.rootsignals.ai)
191
  - **Model type:** Text-Only Decoder Transformer
192
  - **Language(s) (NLP):** Primarily English
193
  - **Finetuned from model:** meta-llama/Llama-3.3-70B-Instruct
194
 
195
+ ## 4.2 Training Details
196
 
197
  - **Training regime:** DPO with IPO loss for 3 Epochs, bfloat16 mixed-precision on 384 GPUs
198
  - **Hardware Type:** LUMI-G / AMD Radeon Instinct™ MI250X
199
  - **Cloud Provider:** [LUMI Supercomputer](https://lumi-supercomputer.eu)
200
  - **Compute Region:** Finland
201
 
202
+ # 5. Contact
203
 
204
205