readme update
Browse files
README.md
CHANGED
@@ -15,10 +15,10 @@ tags:
|
|
15 |
Root Judge was post-trained from *Llama-3.3-70B-Instruct* on a high quality, human-annotated dataset mix for pairwise preference choice judgments and multi-turn instruction following with source citing.
|
16 |
The model weights are freely available in FP8 to facilitate cost effective research as well as commercial use.
|
17 |
|
18 |
-
Root Judge
|
19 |
achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost.
|
20 |
|
21 |
-
## Intended Use Cases
|
22 |
**Root Judge** is primarily intended to be used as an LLM-as-a-Judge in various contexts such as:
|
23 |
- Detecting context-grounded hallucinations, e.g. for Retrieval Augmented Generation (RAG) settings in an explainable manner, providing a justification for the score
|
24 |
- Pairwise preference judgments due to strong evaluation instruction-following capabilities
|
@@ -26,7 +26,25 @@ achieves SOTA on hallucination detection compared to leading closed models, at a
|
|
26 |
- Assisting inference-time search or synthetic data tasks that require Best-of-N decisions
|
27 |
- Privacy-focused settings that require local deployments
|
28 |
|
29 |
-
## Performance Summary
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
|
31 |
Instruction following comared to open-weights judge and reward models:
|
32 |
| Model | Precision (Size GB) | GSM8K↑ | IFEval↑ | MUSR-Murder↑ | MUSR-Object↑ | MUSR-Team↑ | Avg Score | Relative to RS-1 |
|
@@ -71,18 +89,6 @@ Instruction following comared to open-weights judge and reward models:
|
|
71 |
| Safety | | | 0.83986486|
|
72 |
| Reasoning | | | 0.88103618|
|
73 |
|
74 |
-
|
75 |
-
[Halubench Public test-set](https://huggingface.co/datasets/PatronusAI/HaluBench):
|
76 |
-
Rank | Model | Responses Tested | Pass@1 Rate | False - | False + | Worst Dataset | Cost estimate*
|
77 |
-
| --- | --- | --- | --- | --- | --- | --- | --- |
|
78 |
-
1 | Root Judge (FP8), decompose, t=0.6 | 14900 | **86.26%** | **596** | 1340 | Financebench | **$33.6**
|
79 |
-
2 | gpt-4o-2024-05-13 | 14900 | 86.06% | 1052 | 1025 | DROP | -
|
80 |
-
3 | o1-preview-2024-09-12, t=1 | 14899 | 85.25% | 1134 | 1063 | RagTruth | $1062.08
|
81 |
-
4 | claude-3-5-sonnet-20240620** t=0.6 | 14797 | 85.17% | 1391 | **809** | PubMedQA | -
|
82 |
-
5 | llama3.1:70b-instruct-q8_0 t=0.6| 13969 | 84.67% | 769 | 1373 | DROP | **$33.6**
|
83 |
-
6 | o1-mini-2024-09-12, t=1 | 14655 | 83.71% | 1169 | 1219 | DROP | $156.07
|
84 |
-
7 | llama3.1:405b-instruct-q8_0 t=0.2 | 14881 | 83.58% | 1331 | 1113 | DROP | -
|
85 |
-
|
86 |
Root Judge outperforms most leading closed models when detecting instruction following failures on evaluations
|
87 |
while providing detailed, structured justifications on long inputs of up to 32k tokens on internal benchmarks and halubench public.
|
88 |
|
@@ -102,7 +108,11 @@ to avoid known issues with performance drops such as catastrophic forgetting and
|
|
102 |
preserves general capabilities of Llama-3.3-70B-Instruct after dynamic weights quantization,
|
103 |
while also slightly outperforming it on public instruction following benchmarks such as IFEval and MuSR
|
104 |
|
105 |
-
|
|
|
|
|
|
|
|
|
106 |
|
107 |
We recommend using [SGLang](https://github.com/sgl-project/sglang) for production use together with *xml tags* for important sections in your prompt. At least 96GB of VRAM is recommended.
|
108 |
While the model runs on 80GB VRAM the effective context size (around 7k total tokens) will be too low for evaluating most RAG inputs.
|
@@ -143,18 +153,25 @@ docker run \
|
|
143 |
|
144 |
The model is also available on our [platform](https://rootsignals.ai) as part of our evaluation suite, for no additional cost
|
145 |
|
146 |
-
## Model Details
|
147 |
|
148 |
-
### Overview
|
149 |
|
150 |
- **Developed by:** [Root Signals Inc](https://www.rootsignals.ai)
|
151 |
- **Model type:** Text-Only Decoder Transformer
|
152 |
- **Language(s) (NLP):** Primarily English
|
153 |
- **Finetuned from model:** meta-llama/Llama-3.3-70B-Instruct
|
154 |
|
155 |
-
### Training Details
|
156 |
|
157 |
- **Training regime:** DPO with IPO loss for 3 Epochs, bfloat16 mixed-precision on 384 GPUs
|
158 |
- **Hardware Type:** LUMI-G / AMD Radeon Instinct™ MI250X
|
159 |
- **Cloud Provider:** [LUMI Supercomputer](https://lumi-supercomputer.eu)
|
160 |
-
- **Compute Region:** Finland
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
Root Judge was post-trained from *Llama-3.3-70B-Instruct* on a high quality, human-annotated dataset mix for pairwise preference choice judgments and multi-turn instruction following with source citing.
|
16 |
The model weights are freely available in FP8 to facilitate cost effective research as well as commercial use.
|
17 |
|
18 |
+
**Root Judge**’s performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and
|
19 |
achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost.
|
20 |
|
21 |
+
## 1. Intended Use Cases
|
22 |
**Root Judge** is primarily intended to be used as an LLM-as-a-Judge in various contexts such as:
|
23 |
- Detecting context-grounded hallucinations, e.g. for Retrieval Augmented Generation (RAG) settings in an explainable manner, providing a justification for the score
|
24 |
- Pairwise preference judgments due to strong evaluation instruction-following capabilities
|
|
|
26 |
- Assisting inference-time search or synthetic data tasks that require Best-of-N decisions
|
27 |
- Privacy-focused settings that require local deployments
|
28 |
|
29 |
+
## 2. Performance Summary
|
30 |
+
|
31 |
+
### 2.1 Hallucination Detection (in RAG setting)
|
32 |
+
|
33 |
+
📊 Benchmark: [HaluBench Test Set](https://huggingface.co/datasets/PatronusAI/HaluBench):
|
34 |
+
|
35 |
+
Rank | Model | Test Samples | Pass@1 Rate (%) | Cost ($)
|
36 |
+
| --- | --- | --- | --- | --- |
|
37 |
+
**1** | **Root Judge** (FP8) | 14900 | **86.3** | **34**
|
38 |
+
2 | GPT-4o | 14900 | 86.1 | -
|
39 |
+
3 | o1-preview | 14899 | 85.3 | 1062
|
40 |
+
4 | Claude Sonnet-3.5 | 14797 | 85.2 | -
|
41 |
+
5 | Llama3.1-70b-Instruct| 13969 | 84.7 | 34
|
42 |
+
6 | o1-mini | 14655 | 83.7 | 156
|
43 |
+
7 | Llama3.1-405b-Instruct | 14881 | 83.6 | -
|
44 |
+
|
45 |
+
[🔎 Detailed Performance Breakdown](https://docs.google.com/spreadsheets/d/1NM9VgGG9_-1kQbepeoueUTkvT1bDeRndTD4RM5iV7l4/edit?usp=sharing)
|
46 |
+
|
47 |
+
### 2.2 Instruction Following
|
48 |
|
49 |
Instruction following comared to open-weights judge and reward models:
|
50 |
| Model | Precision (Size GB) | GSM8K↑ | IFEval↑ | MUSR-Murder↑ | MUSR-Object↑ | MUSR-Team↑ | Avg Score | Relative to RS-1 |
|
|
|
89 |
| Safety | | | 0.83986486|
|
90 |
| Reasoning | | | 0.88103618|
|
91 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
92 |
Root Judge outperforms most leading closed models when detecting instruction following failures on evaluations
|
93 |
while providing detailed, structured justifications on long inputs of up to 32k tokens on internal benchmarks and halubench public.
|
94 |
|
|
|
108 |
preserves general capabilities of Llama-3.3-70B-Instruct after dynamic weights quantization,
|
109 |
while also slightly outperforming it on public instruction following benchmarks such as IFEval and MuSR
|
110 |
|
111 |
+
|
112 |
+
|
113 |
+
|
114 |
+
|
115 |
+
## 3. Getting Started
|
116 |
|
117 |
We recommend using [SGLang](https://github.com/sgl-project/sglang) for production use together with *xml tags* for important sections in your prompt. At least 96GB of VRAM is recommended.
|
118 |
While the model runs on 80GB VRAM the effective context size (around 7k total tokens) will be too low for evaluating most RAG inputs.
|
|
|
153 |
|
154 |
The model is also available on our [platform](https://rootsignals.ai) as part of our evaluation suite, for no additional cost
|
155 |
|
156 |
+
## 4. Model Details
|
157 |
|
158 |
+
### 4.1 Overview
|
159 |
|
160 |
- **Developed by:** [Root Signals Inc](https://www.rootsignals.ai)
|
161 |
- **Model type:** Text-Only Decoder Transformer
|
162 |
- **Language(s) (NLP):** Primarily English
|
163 |
- **Finetuned from model:** meta-llama/Llama-3.3-70B-Instruct
|
164 |
|
165 |
+
### 4.2 Training Details
|
166 |
|
167 |
- **Training regime:** DPO with IPO loss for 3 Epochs, bfloat16 mixed-precision on 384 GPUs
|
168 |
- **Hardware Type:** LUMI-G / AMD Radeon Instinct™ MI250X
|
169 |
- **Cloud Provider:** [LUMI Supercomputer](https://lumi-supercomputer.eu)
|
170 |
+
- **Compute Region:** Finland
|
171 |
+
|
172 |
+
## 5. Contact
|
173 |
+
|
174 | |
175 | |
176 | |
177 |