root-signals
/

RootSignals-Judge-Llama-70B

@@ -16,9 +16,17 @@ Root Judge was post-trained from *Llama-3.3-70B-Instruct* on a high quality, hum
 The model weights are freely available in FP8 to facilitate cost effective research as well as commercial use.
 Root Judge’s performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and
-achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost.
-### Primary Metrics Summary
 Instruction following comared to open-weights judge and reward models:
 | Model | Precision (Size GB) | GSM8K↑ | IFEval↑ | MUSR-Murder↑ | MUSR-Object↑ | MUSR-Team↑ | Avg Score | Relative to RS-1 |
@@ -88,25 +96,12 @@ Image 2: Root Signals internal hallucination benchmark. Custom rubric instructio
 Root Judge was tested to support complex, user-defined rating rubrics over large context sizes,
 provide granular qualitative feedback, and support structured evaluation outputs and tool calling.
-## Intended Use Cases
-The model's primary use is as LLM-as-Judge for:
-detecting context-grounded hallucinations, e.g. for Retrieval-Augmented-Generation (RAG) in explainable manner, providing a justification for the choice
-pairwise preference judgments, that leverage strong instruction following, with custom rubrics e.g. for assisting with inference time compute or synthetic data tasks that require Best-of-N decisions.
-privacy-focused deployments, that want to avoid sending data across the public internet
 Despite our main focus on nuanced and transparent judgement of candidate responses,
 we test the judge model checkpoints extensively on public and private benchmarks,
 to avoid known issues with performance drops such as catastrophic forgetting and find that the model
 preserves general capabilities of Llama-3.3-70B-Instruct after dynamic weights quantization,
 while also slightly outperforming it on public instruction following benchmarks such as IFEval and MuSR
-## Model Description
-- **Developed by:** [Root Signals Inc](https://www.rootsignals.ai)
-- **Model type:** Text-Only Decoder Transformer
-- **Language(s) (NLP):** Primarily English
-- **Finetuned from model:** meta-llama/Llama-3.3-70B-Instruct
 ## Getting Started
 We recommend using [SGLang](https://github.com/sgl-project/sglang) for production use together with *xml tags* for important sections in your prompt. At least 96GB of VRAM is recommended.
@@ -148,9 +143,16 @@ docker run \
 The model is also available on our [platform](https://rootsignals.ai) as part of our evaluation suite, for no additional cost
-## Training Details
-### Training Procedure
 - **Training regime:** DPO with IPO loss for 3 Epochs, bfloat16 mixed-precision on 384 GPUs
 - **Hardware Type:** LUMI-G / AMD Radeon Instinct™ MI250X

 The model weights are freely available in FP8 to facilitate cost effective research as well as commercial use.
 Root Judge’s performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and
+achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost.
+## Intended Use Cases
+**Root Judge** is primarily intended to be used as an LLM-as-a-Judge in various contexts such as:
+- Detecting context-grounded hallucinations, e.g. for Retrieval Augmented Generation (RAG) settings in an explainable manner, providing a justification for the score
+- Pairwise preference judgments due to strong evaluation instruction-following capabilities
+- Serving as a custom evaluation metric powered by use case specific evaluation rubrics
+- Assisting inference-time search or synthetic data tasks that require Best-of-N decisions
+- Privacy-focused settings that require local deployments
+## Performance Summary
 Instruction following comared to open-weights judge and reward models:
 | Model | Precision (Size GB) | GSM8K↑ | IFEval↑ | MUSR-Murder↑ | MUSR-Object↑ | MUSR-Team↑ | Avg Score | Relative to RS-1 |
 Root Judge was tested to support complex, user-defined rating rubrics over large context sizes,
 provide granular qualitative feedback, and support structured evaluation outputs and tool calling.
 Despite our main focus on nuanced and transparent judgement of candidate responses,
 we test the judge model checkpoints extensively on public and private benchmarks,
 to avoid known issues with performance drops such as catastrophic forgetting and find that the model
 preserves general capabilities of Llama-3.3-70B-Instruct after dynamic weights quantization,
 while also slightly outperforming it on public instruction following benchmarks such as IFEval and MuSR
 ## Getting Started
 We recommend using [SGLang](https://github.com/sgl-project/sglang) for production use together with *xml tags* for important sections in your prompt. At least 96GB of VRAM is recommended.
 The model is also available on our [platform](https://rootsignals.ai) as part of our evaluation suite, for no additional cost
+## Model Details
+### Overview
+- **Developed by:** [Root Signals Inc](https://www.rootsignals.ai)
+- **Model type:** Text-Only Decoder Transformer
+- **Language(s) (NLP):** Primarily English
+- **Finetuned from model:** meta-llama/Llama-3.3-70B-Instruct
+### Training Details
 - **Training regime:** DPO with IPO loss for 3 Epochs, bfloat16 mixed-precision on 384 GPUs
 - **Hardware Type:** LUMI-G / AMD Radeon Instinct™ MI250X