Ouz-G commited on
Commit
a8441c4
·
verified ·
1 Parent(s): e3be3a2

readme update

Browse files
Files changed (1) hide show
  1. README.md +37 -20
README.md CHANGED
@@ -15,10 +15,10 @@ tags:
15
  Root Judge was post-trained from *Llama-3.3-70B-Instruct* on a high quality, human-annotated dataset mix for pairwise preference choice judgments and multi-turn instruction following with source citing.
16
  The model weights are freely available in FP8 to facilitate cost effective research as well as commercial use.
17
 
18
- Root Judges performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and
19
  achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost.
20
 
21
- ## Intended Use Cases
22
  **Root Judge** is primarily intended to be used as an LLM-as-a-Judge in various contexts such as:
23
  - Detecting context-grounded hallucinations, e.g. for Retrieval Augmented Generation (RAG) settings in an explainable manner, providing a justification for the score
24
  - Pairwise preference judgments due to strong evaluation instruction-following capabilities
@@ -26,7 +26,25 @@ achieves SOTA on hallucination detection compared to leading closed models, at a
26
  - Assisting inference-time search or synthetic data tasks that require Best-of-N decisions
27
  - Privacy-focused settings that require local deployments
28
 
29
- ## Performance Summary
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  Instruction following comared to open-weights judge and reward models:
32
  | Model | Precision (Size GB) | GSM8K↑ | IFEval↑ | MUSR-Murder↑ | MUSR-Object↑ | MUSR-Team↑ | Avg Score | Relative to RS-1 |
@@ -71,18 +89,6 @@ Instruction following comared to open-weights judge and reward models:
71
  | Safety | | | 0.83986486|
72
  | Reasoning | | | 0.88103618|
73
 
74
-
75
- [Halubench Public test-set](https://huggingface.co/datasets/PatronusAI/HaluBench):
76
- Rank | Model | Responses Tested | Pass@1 Rate | False - | False + | Worst Dataset | Cost estimate*
77
- | --- | --- | --- | --- | --- | --- | --- | --- |
78
- 1 | Root Judge (FP8), decompose, t=0.6 | 14900 | **86.26%** | **596** | 1340 | Financebench | **$33.6**
79
- 2 | gpt-4o-2024-05-13 | 14900 | 86.06% | 1052 | 1025 | DROP | -
80
- 3 | o1-preview-2024-09-12, t=1 | 14899 | 85.25% | 1134 | 1063 | RagTruth | $1062.08
81
- 4 | claude-3-5-sonnet-20240620** t=0.6 | 14797 | 85.17% | 1391 | **809** | PubMedQA | -
82
- 5 | llama3.1:70b-instruct-q8_0 t=0.6| 13969 | 84.67% | 769 | 1373 | DROP | **$33.6**
83
- 6 | o1-mini-2024-09-12, t=1 | 14655 | 83.71% | 1169 | 1219 | DROP | $156.07
84
- 7 | llama3.1:405b-instruct-q8_0 t=0.2 | 14881 | 83.58% | 1331 | 1113 | DROP | -
85
-
86
  Root Judge outperforms most leading closed models when detecting instruction following failures on evaluations
87
  while providing detailed, structured justifications on long inputs of up to 32k tokens on internal benchmarks and halubench public.
88
 
@@ -102,7 +108,11 @@ to avoid known issues with performance drops such as catastrophic forgetting and
102
  preserves general capabilities of Llama-3.3-70B-Instruct after dynamic weights quantization,
103
  while also slightly outperforming it on public instruction following benchmarks such as IFEval and MuSR
104
 
105
- ## Getting Started
 
 
 
 
106
 
107
  We recommend using [SGLang](https://github.com/sgl-project/sglang) for production use together with *xml tags* for important sections in your prompt. At least 96GB of VRAM is recommended.
108
  While the model runs on 80GB VRAM the effective context size (around 7k total tokens) will be too low for evaluating most RAG inputs.
@@ -143,18 +153,25 @@ docker run \
143
 
144
  The model is also available on our [platform](https://rootsignals.ai) as part of our evaluation suite, for no additional cost
145
 
146
- ## Model Details
147
 
148
- ### Overview
149
 
150
  - **Developed by:** [Root Signals Inc](https://www.rootsignals.ai)
151
  - **Model type:** Text-Only Decoder Transformer
152
  - **Language(s) (NLP):** Primarily English
153
  - **Finetuned from model:** meta-llama/Llama-3.3-70B-Instruct
154
 
155
- ### Training Details
156
 
157
  - **Training regime:** DPO with IPO loss for 3 Epochs, bfloat16 mixed-precision on 384 GPUs
158
  - **Hardware Type:** LUMI-G / AMD Radeon Instinct™ MI250X
159
  - **Cloud Provider:** [LUMI Supercomputer](https://lumi-supercomputer.eu)
160
- - **Compute Region:** Finland
 
 
 
 
 
 
 
 
15
  Root Judge was post-trained from *Llama-3.3-70B-Instruct* on a high quality, human-annotated dataset mix for pairwise preference choice judgments and multi-turn instruction following with source citing.
16
  The model weights are freely available in FP8 to facilitate cost effective research as well as commercial use.
17
 
18
+ **Root Judge**’s performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and
19
  achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost.
20
 
21
+ ## 1. Intended Use Cases
22
  **Root Judge** is primarily intended to be used as an LLM-as-a-Judge in various contexts such as:
23
  - Detecting context-grounded hallucinations, e.g. for Retrieval Augmented Generation (RAG) settings in an explainable manner, providing a justification for the score
24
  - Pairwise preference judgments due to strong evaluation instruction-following capabilities
 
26
  - Assisting inference-time search or synthetic data tasks that require Best-of-N decisions
27
  - Privacy-focused settings that require local deployments
28
 
29
+ ## 2. Performance Summary
30
+
31
+ ### 2.1 Hallucination Detection (in RAG setting)
32
+
33
+ 📊 Benchmark: [HaluBench Test Set](https://huggingface.co/datasets/PatronusAI/HaluBench):
34
+
35
+ Rank | Model | Test Samples | Pass@1 Rate (%) | Cost ($)
36
+ | --- | --- | --- | --- | --- |
37
+ **1** | **Root Judge** (FP8) | 14900 | **86.3** | **34**
38
+ 2 | GPT-4o | 14900 | 86.1 | -
39
+ 3 | o1-preview | 14899 | 85.3 | 1062
40
+ 4 | Claude Sonnet-3.5 | 14797 | 85.2 | -
41
+ 5 | Llama3.1-70b-Instruct| 13969 | 84.7 | 34
42
+ 6 | o1-mini | 14655 | 83.7 | 156
43
+ 7 | Llama3.1-405b-Instruct | 14881 | 83.6 | -
44
+
45
+ [🔎 Detailed Performance Breakdown](https://docs.google.com/spreadsheets/d/1NM9VgGG9_-1kQbepeoueUTkvT1bDeRndTD4RM5iV7l4/edit?usp=sharing)
46
+
47
+ ### 2.2 Instruction Following
48
 
49
  Instruction following comared to open-weights judge and reward models:
50
  | Model | Precision (Size GB) | GSM8K↑ | IFEval↑ | MUSR-Murder↑ | MUSR-Object↑ | MUSR-Team↑ | Avg Score | Relative to RS-1 |
 
89
  | Safety | | | 0.83986486|
90
  | Reasoning | | | 0.88103618|
91
 
 
 
 
 
 
 
 
 
 
 
 
 
92
  Root Judge outperforms most leading closed models when detecting instruction following failures on evaluations
93
  while providing detailed, structured justifications on long inputs of up to 32k tokens on internal benchmarks and halubench public.
94
 
 
108
  preserves general capabilities of Llama-3.3-70B-Instruct after dynamic weights quantization,
109
  while also slightly outperforming it on public instruction following benchmarks such as IFEval and MuSR
110
 
111
+
112
+
113
+
114
+
115
+ ## 3. Getting Started
116
 
117
  We recommend using [SGLang](https://github.com/sgl-project/sglang) for production use together with *xml tags* for important sections in your prompt. At least 96GB of VRAM is recommended.
118
  While the model runs on 80GB VRAM the effective context size (around 7k total tokens) will be too low for evaluating most RAG inputs.
 
153
 
154
  The model is also available on our [platform](https://rootsignals.ai) as part of our evaluation suite, for no additional cost
155
 
156
+ ## 4. Model Details
157
 
158
+ ### 4.1 Overview
159
 
160
  - **Developed by:** [Root Signals Inc](https://www.rootsignals.ai)
161
  - **Model type:** Text-Only Decoder Transformer
162
  - **Language(s) (NLP):** Primarily English
163
  - **Finetuned from model:** meta-llama/Llama-3.3-70B-Instruct
164
 
165
+ ### 4.2 Training Details
166
 
167
  - **Training regime:** DPO with IPO loss for 3 Epochs, bfloat16 mixed-precision on 384 GPUs
168
  - **Hardware Type:** LUMI-G / AMD Radeon Instinct™ MI250X
169
  - **Cloud Provider:** [LUMI Supercomputer](https://lumi-supercomputer.eu)
170
+ - **Compute Region:** Finland
171
+
172
+ ## 5. Contact
173
+
174
175
176
177