root-signals
/

RootSignals-Judge-Llama-70B

@@ -15,10 +15,10 @@ tags:
 Root Judge was post-trained from *Llama-3.3-70B-Instruct* on a high quality, human-annotated dataset mix for pairwise preference choice judgments and multi-turn instruction following with source citing.
 The model weights are freely available in FP8 to facilitate cost effective research as well as commercial use.
-Root Judge’s performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and
 achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost.
-## Intended Use Cases
 **Root Judge** is primarily intended to be used as an LLM-as-a-Judge in various contexts such as:
 - Detecting context-grounded hallucinations, e.g. for Retrieval Augmented Generation (RAG) settings in an explainable manner, providing a justification for the score
 - Pairwise preference judgments due to strong evaluation instruction-following capabilities
@@ -26,7 +26,25 @@ achieves SOTA on hallucination detection compared to leading closed models, at a
 - Assisting inference-time search or synthetic data tasks that require Best-of-N decisions
 - Privacy-focused settings that require local deployments
-## Performance Summary
 Instruction following comared to open-weights judge and reward models:
 | Model | Precision (Size GB) | GSM8K↑ | IFEval↑ | MUSR-Murder↑ | MUSR-Object↑ | MUSR-Team↑ | Avg Score | Relative to RS-1 |
@@ -71,18 +89,6 @@ Instruction following comared to open-weights judge and reward models:
 | Safety                 |       |       | 0.83986486|
 | Reasoning              |       |       | 0.88103618|
-[Halubench Public test-set](https://huggingface.co/datasets/PatronusAI/HaluBench):
-Rank | Model | Responses Tested | Pass@1 Rate | False - | False + | Worst Dataset | Cost estimate*
-| --- | --- | --- | --- | --- | --- | --- | --- |
-1 | Root Judge (FP8), decompose, t=0.6 | 14900 | **86.26%** | **596** | 1340  | Financebench | **$33.6**
-2 | gpt-4o-2024-05-13 | 14900 | 86.06% | 1052 | 1025 | DROP | -
-3 | o1-preview-2024-09-12, t=1 | 14899 | 85.25% | 1134 | 1063 | RagTruth | $1062.08
-4 | claude-3-5-sonnet-20240620** t=0.6 | 14797 |  85.17% | 1391 | **809** | PubMedQA | -
-5 | llama3.1:70b-instruct-q8_0 t=0.6| 13969 | 84.67%  | 769 | 1373 | DROP | **$33.6**
-6 | o1-mini-2024-09-12, t=1 | 14655 | 83.71% | 1169 | 1219 | DROP | $156.07
-7 | llama3.1:405b-instruct-q8_0 t=0.2 | 14881 | 83.58% | 1331 | 1113 | DROP | -
 Root Judge outperforms most leading closed models when detecting instruction following failures on evaluations
 while providing detailed, structured justifications on long inputs of up to 32k tokens on internal benchmarks and halubench public.
@@ -102,7 +108,11 @@ to avoid known issues with performance drops such as catastrophic forgetting and
 preserves general capabilities of Llama-3.3-70B-Instruct after dynamic weights quantization,
 while also slightly outperforming it on public instruction following benchmarks such as IFEval and MuSR
-## Getting Started
 We recommend using [SGLang](https://github.com/sgl-project/sglang) for production use together with *xml tags* for important sections in your prompt. At least 96GB of VRAM is recommended.
 While the model runs on 80GB VRAM the effective context size (around 7k total tokens) will be too low for evaluating most RAG inputs.
@@ -143,18 +153,25 @@ docker run \
 The model is also available on our [platform](https://rootsignals.ai) as part of our evaluation suite, for no additional cost
-## Model Details
-### Overview
 - **Developed by:** [Root Signals Inc](https://www.rootsignals.ai)
 - **Model type:** Text-Only Decoder Transformer
 - **Language(s) (NLP):** Primarily English
 - **Finetuned from model:** meta-llama/Llama-3.3-70B-Instruct
-### Training Details
 - **Training regime:** DPO with IPO loss for 3 Epochs, bfloat16 mixed-precision on 384 GPUs
 - **Hardware Type:** LUMI-G / AMD Radeon Instinct™ MI250X
 - **Cloud Provider:** [LUMI Supercomputer](https://lumi-supercomputer.eu)
-- **Compute Region:** Finland

 Root Judge was post-trained from *Llama-3.3-70B-Instruct* on a high quality, human-annotated dataset mix for pairwise preference choice judgments and multi-turn instruction following with source citing.
 The model weights are freely available in FP8 to facilitate cost effective research as well as commercial use.
+**Root Judge**’s performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and
 achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost.
+## 1. Intended Use Cases
 **Root Judge** is primarily intended to be used as an LLM-as-a-Judge in various contexts such as:
 - Detecting context-grounded hallucinations, e.g. for Retrieval Augmented Generation (RAG) settings in an explainable manner, providing a justification for the score
 - Pairwise preference judgments due to strong evaluation instruction-following capabilities
 - Assisting inference-time search or synthetic data tasks that require Best-of-N decisions
 - Privacy-focused settings that require local deployments
+## 2. Performance Summary
+### 2.1 Hallucination Detection (in RAG setting)
+📊 Benchmark: [HaluBench Test Set](https://huggingface.co/datasets/PatronusAI/HaluBench):
+Rank | Model | Test Samples | Pass@1 Rate (%) | Cost ($)
+| --- | --- | --- | --- | --- |
+**1** | **Root Judge** (FP8) | 14900 | **86.3** | **34**
+2 | GPT-4o | 14900 | 86.1 | -
+3 | o1-preview | 14899 | 85.3 | 1062
+4 | Claude Sonnet-3.5 | 14797 |  85.2 | -
+5 | Llama3.1-70b-Instruct| 13969 | 84.7  | 34
+6 | o1-mini | 14655 | 83.7 | 156
+7 | Llama3.1-405b-Instruct | 14881 | 83.6  | -
+[🔎 Detailed Performance Breakdown](https://docs.google.com/spreadsheets/d/1NM9VgGG9_-1kQbepeoueUTkvT1bDeRndTD4RM5iV7l4/edit?usp=sharing)
+### 2.2 Instruction Following
 Instruction following comared to open-weights judge and reward models:
 | Model | Precision (Size GB) | GSM8K↑ | IFEval↑ | MUSR-Murder↑ | MUSR-Object↑ | MUSR-Team↑ | Avg Score | Relative to RS-1 |
 | Safety                 |       |       | 0.83986486|
 | Reasoning              |       |       | 0.88103618|
 Root Judge outperforms most leading closed models when detecting instruction following failures on evaluations
 while providing detailed, structured justifications on long inputs of up to 32k tokens on internal benchmarks and halubench public.
 preserves general capabilities of Llama-3.3-70B-Instruct after dynamic weights quantization,
 while also slightly outperforming it on public instruction following benchmarks such as IFEval and MuSR
+## 3. Getting Started
 We recommend using [SGLang](https://github.com/sgl-project/sglang) for production use together with *xml tags* for important sections in your prompt. At least 96GB of VRAM is recommended.
 While the model runs on 80GB VRAM the effective context size (around 7k total tokens) will be too low for evaluating most RAG inputs.
 The model is also available on our [platform](https://rootsignals.ai) as part of our evaluation suite, for no additional cost
+## 4. Model Details
+### 4.1 Overview
 - **Developed by:** [Root Signals Inc](https://www.rootsignals.ai)
 - **Model type:** Text-Only Decoder Transformer
 - **Language(s) (NLP):** Primarily English
 - **Finetuned from model:** meta-llama/Llama-3.3-70B-Instruct
+### 4.2 Training Details
 - **Training regime:** DPO with IPO loss for 3 Epochs, bfloat16 mixed-precision on 384 GPUs
 - **Hardware Type:** LUMI-G / AMD Radeon Instinct™ MI250X
 - **Cloud Provider:** [LUMI Supercomputer](https://lumi-supercomputer.eu)
+- **Compute Region:** Finland
+## 5. Contact
+- [email protected]
+- [email protected]
+- [email protected]
+- [email protected]