Ouz-G commited on
Commit
e3be3a2
·
verified ·
1 Parent(s): 956af36

Update readme

Browse files
Files changed (1) hide show
  1. README.md +19 -17
README.md CHANGED
@@ -16,9 +16,17 @@ Root Judge was post-trained from *Llama-3.3-70B-Instruct* on a high quality, hum
16
  The model weights are freely available in FP8 to facilitate cost effective research as well as commercial use.
17
 
18
  Root Judge’s performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and
19
- achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost.
20
 
21
- ### Primary Metrics Summary
 
 
 
 
 
 
 
 
22
 
23
  Instruction following comared to open-weights judge and reward models:
24
  | Model | Precision (Size GB) | GSM8K↑ | IFEval↑ | MUSR-Murder↑ | MUSR-Object↑ | MUSR-Team↑ | Avg Score | Relative to RS-1 |
@@ -88,25 +96,12 @@ Image 2: Root Signals internal hallucination benchmark. Custom rubric instructio
88
  Root Judge was tested to support complex, user-defined rating rubrics over large context sizes,
89
  provide granular qualitative feedback, and support structured evaluation outputs and tool calling.
90
 
91
- ## Intended Use Cases
92
- The model's primary use is as LLM-as-Judge for:
93
- detecting context-grounded hallucinations, e.g. for Retrieval-Augmented-Generation (RAG) in explainable manner, providing a justification for the choice
94
- pairwise preference judgments, that leverage strong instruction following, with custom rubrics e.g. for assisting with inference time compute or synthetic data tasks that require Best-of-N decisions.
95
- privacy-focused deployments, that want to avoid sending data across the public internet
96
-
97
  Despite our main focus on nuanced and transparent judgement of candidate responses,
98
  we test the judge model checkpoints extensively on public and private benchmarks,
99
  to avoid known issues with performance drops such as catastrophic forgetting and find that the model
100
  preserves general capabilities of Llama-3.3-70B-Instruct after dynamic weights quantization,
101
  while also slightly outperforming it on public instruction following benchmarks such as IFEval and MuSR
102
 
103
- ## Model Description
104
-
105
- - **Developed by:** [Root Signals Inc](https://www.rootsignals.ai)
106
- - **Model type:** Text-Only Decoder Transformer
107
- - **Language(s) (NLP):** Primarily English
108
- - **Finetuned from model:** meta-llama/Llama-3.3-70B-Instruct
109
-
110
  ## Getting Started
111
 
112
  We recommend using [SGLang](https://github.com/sgl-project/sglang) for production use together with *xml tags* for important sections in your prompt. At least 96GB of VRAM is recommended.
@@ -148,9 +143,16 @@ docker run \
148
 
149
  The model is also available on our [platform](https://rootsignals.ai) as part of our evaluation suite, for no additional cost
150
 
151
- ## Training Details
 
 
 
 
 
 
 
152
 
153
- ### Training Procedure
154
 
155
  - **Training regime:** DPO with IPO loss for 3 Epochs, bfloat16 mixed-precision on 384 GPUs
156
  - **Hardware Type:** LUMI-G / AMD Radeon Instinct™ MI250X
 
16
  The model weights are freely available in FP8 to facilitate cost effective research as well as commercial use.
17
 
18
  Root Judge’s performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and
19
+ achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost.
20
 
21
+ ## Intended Use Cases
22
+ **Root Judge** is primarily intended to be used as an LLM-as-a-Judge in various contexts such as:
23
+ - Detecting context-grounded hallucinations, e.g. for Retrieval Augmented Generation (RAG) settings in an explainable manner, providing a justification for the score
24
+ - Pairwise preference judgments due to strong evaluation instruction-following capabilities
25
+ - Serving as a custom evaluation metric powered by use case specific evaluation rubrics
26
+ - Assisting inference-time search or synthetic data tasks that require Best-of-N decisions
27
+ - Privacy-focused settings that require local deployments
28
+
29
+ ## Performance Summary
30
 
31
  Instruction following comared to open-weights judge and reward models:
32
  | Model | Precision (Size GB) | GSM8K↑ | IFEval↑ | MUSR-Murder↑ | MUSR-Object↑ | MUSR-Team↑ | Avg Score | Relative to RS-1 |
 
96
  Root Judge was tested to support complex, user-defined rating rubrics over large context sizes,
97
  provide granular qualitative feedback, and support structured evaluation outputs and tool calling.
98
 
 
 
 
 
 
 
99
  Despite our main focus on nuanced and transparent judgement of candidate responses,
100
  we test the judge model checkpoints extensively on public and private benchmarks,
101
  to avoid known issues with performance drops such as catastrophic forgetting and find that the model
102
  preserves general capabilities of Llama-3.3-70B-Instruct after dynamic weights quantization,
103
  while also slightly outperforming it on public instruction following benchmarks such as IFEval and MuSR
104
 
 
 
 
 
 
 
 
105
  ## Getting Started
106
 
107
  We recommend using [SGLang](https://github.com/sgl-project/sglang) for production use together with *xml tags* for important sections in your prompt. At least 96GB of VRAM is recommended.
 
143
 
144
  The model is also available on our [platform](https://rootsignals.ai) as part of our evaluation suite, for no additional cost
145
 
146
+ ## Model Details
147
+
148
+ ### Overview
149
+
150
+ - **Developed by:** [Root Signals Inc](https://www.rootsignals.ai)
151
+ - **Model type:** Text-Only Decoder Transformer
152
+ - **Language(s) (NLP):** Primarily English
153
+ - **Finetuned from model:** meta-llama/Llama-3.3-70B-Instruct
154
 
155
+ ### Training Details
156
 
157
  - **Training regime:** DPO with IPO loss for 3 Epochs, bfloat16 mixed-precision on 384 GPUs
158
  - **Hardware Type:** LUMI-G / AMD Radeon Instinct™ MI250X