Update readme
Browse files
README.md
CHANGED
@@ -16,9 +16,17 @@ Root Judge was post-trained from *Llama-3.3-70B-Instruct* on a high quality, hum
|
|
16 |
The model weights are freely available in FP8 to facilitate cost effective research as well as commercial use.
|
17 |
|
18 |
Root Judge’s performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and
|
19 |
-
achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost.
|
20 |
|
21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
22 |
|
23 |
Instruction following comared to open-weights judge and reward models:
|
24 |
| Model | Precision (Size GB) | GSM8K↑ | IFEval↑ | MUSR-Murder↑ | MUSR-Object↑ | MUSR-Team↑ | Avg Score | Relative to RS-1 |
|
@@ -88,25 +96,12 @@ Image 2: Root Signals internal hallucination benchmark. Custom rubric instructio
|
|
88 |
Root Judge was tested to support complex, user-defined rating rubrics over large context sizes,
|
89 |
provide granular qualitative feedback, and support structured evaluation outputs and tool calling.
|
90 |
|
91 |
-
## Intended Use Cases
|
92 |
-
The model's primary use is as LLM-as-Judge for:
|
93 |
-
detecting context-grounded hallucinations, e.g. for Retrieval-Augmented-Generation (RAG) in explainable manner, providing a justification for the choice
|
94 |
-
pairwise preference judgments, that leverage strong instruction following, with custom rubrics e.g. for assisting with inference time compute or synthetic data tasks that require Best-of-N decisions.
|
95 |
-
privacy-focused deployments, that want to avoid sending data across the public internet
|
96 |
-
|
97 |
Despite our main focus on nuanced and transparent judgement of candidate responses,
|
98 |
we test the judge model checkpoints extensively on public and private benchmarks,
|
99 |
to avoid known issues with performance drops such as catastrophic forgetting and find that the model
|
100 |
preserves general capabilities of Llama-3.3-70B-Instruct after dynamic weights quantization,
|
101 |
while also slightly outperforming it on public instruction following benchmarks such as IFEval and MuSR
|
102 |
|
103 |
-
## Model Description
|
104 |
-
|
105 |
-
- **Developed by:** [Root Signals Inc](https://www.rootsignals.ai)
|
106 |
-
- **Model type:** Text-Only Decoder Transformer
|
107 |
-
- **Language(s) (NLP):** Primarily English
|
108 |
-
- **Finetuned from model:** meta-llama/Llama-3.3-70B-Instruct
|
109 |
-
|
110 |
## Getting Started
|
111 |
|
112 |
We recommend using [SGLang](https://github.com/sgl-project/sglang) for production use together with *xml tags* for important sections in your prompt. At least 96GB of VRAM is recommended.
|
@@ -148,9 +143,16 @@ docker run \
|
|
148 |
|
149 |
The model is also available on our [platform](https://rootsignals.ai) as part of our evaluation suite, for no additional cost
|
150 |
|
151 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
152 |
|
153 |
-
### Training
|
154 |
|
155 |
- **Training regime:** DPO with IPO loss for 3 Epochs, bfloat16 mixed-precision on 384 GPUs
|
156 |
- **Hardware Type:** LUMI-G / AMD Radeon Instinct™ MI250X
|
|
|
16 |
The model weights are freely available in FP8 to facilitate cost effective research as well as commercial use.
|
17 |
|
18 |
Root Judge’s performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and
|
19 |
+
achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost.
|
20 |
|
21 |
+
## Intended Use Cases
|
22 |
+
**Root Judge** is primarily intended to be used as an LLM-as-a-Judge in various contexts such as:
|
23 |
+
- Detecting context-grounded hallucinations, e.g. for Retrieval Augmented Generation (RAG) settings in an explainable manner, providing a justification for the score
|
24 |
+
- Pairwise preference judgments due to strong evaluation instruction-following capabilities
|
25 |
+
- Serving as a custom evaluation metric powered by use case specific evaluation rubrics
|
26 |
+
- Assisting inference-time search or synthetic data tasks that require Best-of-N decisions
|
27 |
+
- Privacy-focused settings that require local deployments
|
28 |
+
|
29 |
+
## Performance Summary
|
30 |
|
31 |
Instruction following comared to open-weights judge and reward models:
|
32 |
| Model | Precision (Size GB) | GSM8K↑ | IFEval↑ | MUSR-Murder↑ | MUSR-Object↑ | MUSR-Team↑ | Avg Score | Relative to RS-1 |
|
|
|
96 |
Root Judge was tested to support complex, user-defined rating rubrics over large context sizes,
|
97 |
provide granular qualitative feedback, and support structured evaluation outputs and tool calling.
|
98 |
|
|
|
|
|
|
|
|
|
|
|
|
|
99 |
Despite our main focus on nuanced and transparent judgement of candidate responses,
|
100 |
we test the judge model checkpoints extensively on public and private benchmarks,
|
101 |
to avoid known issues with performance drops such as catastrophic forgetting and find that the model
|
102 |
preserves general capabilities of Llama-3.3-70B-Instruct after dynamic weights quantization,
|
103 |
while also slightly outperforming it on public instruction following benchmarks such as IFEval and MuSR
|
104 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
105 |
## Getting Started
|
106 |
|
107 |
We recommend using [SGLang](https://github.com/sgl-project/sglang) for production use together with *xml tags* for important sections in your prompt. At least 96GB of VRAM is recommended.
|
|
|
143 |
|
144 |
The model is also available on our [platform](https://rootsignals.ai) as part of our evaluation suite, for no additional cost
|
145 |
|
146 |
+
## Model Details
|
147 |
+
|
148 |
+
### Overview
|
149 |
+
|
150 |
+
- **Developed by:** [Root Signals Inc](https://www.rootsignals.ai)
|
151 |
+
- **Model type:** Text-Only Decoder Transformer
|
152 |
+
- **Language(s) (NLP):** Primarily English
|
153 |
+
- **Finetuned from model:** meta-llama/Llama-3.3-70B-Instruct
|
154 |
|
155 |
+
### Training Details
|
156 |
|
157 |
- **Training regime:** DPO with IPO loss for 3 Epochs, bfloat16 mixed-precision on 384 GPUs
|
158 |
- **Hardware Type:** LUMI-G / AMD Radeon Instinct™ MI250X
|