Add model card draft
Browse files
README.md
ADDED
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: llama3.3
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
base_model:
|
6 |
+
- meta-llama/Llama-3.3-70B-Instruct
|
7 |
+
pipeline_tag: text-generation
|
8 |
+
tags:
|
9 |
+
- llm-as-judge
|
10 |
+
- evaluation
|
11 |
+
---
|
12 |
+
# Model Card for RootSignals-Judge-Llama-70B
|
13 |
+
|
14 |
+
Root Judge is a powerful mid-sized model that enables reliable and customizable LLM system evaluations.
|
15 |
+
Root Judge was post-trained from Llama-3.3-70B-Instruct on a high quality, human-annotated dataset mix for pairwise preference choice judgments and multi-turn instruction following with source citing.
|
16 |
+
The model weights are freely made available in FP8 to facilitate cost effective research and application use.
|
17 |
+
|
18 |
+
Root Judge’s performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and
|
19 |
+
achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost.
|
20 |
+
|
21 |
+
### Primary Metrics Summary
|
22 |
+
|
23 |
+
Instruction following comared to open-weights judge and reward models:
|
24 |
+
| Model | Precision (Size GB) | GSM8K↑ | IFEval↑ | MUSR-Murder↑ | MUSR-Object↑ | MUSR-Team↑ | Avg Score | Relative to RS-1 |
|
25 |
+
|-------|------------|--------|---------|--------------|--------------|------------|------------|-----------------|
|
26 |
+
| Root Judge | FP8 (~70) | **94.62% ± 0.62%** | **93.88% ± N/A** | 52.80% ± 3.16% | 24.61% ± 2.70%| **56.80% ± 3.14%** | **64.54%** | 100% |
|
27 |
+
| Llama-3.3-70B | bf16 (~140) | 94.39% ± 0.63% | 93.41% ± N/A | 54.00% ± 3.16% | 23.44% ± 2.65% | 56.00% ± 3.15% | 64.25% | 99.5% |
|
28 |
+
| Patronus-70B | bf16 (~140) | 91.66% ± 0.76% | 83.69% ± N/A | 54.40% ± 3.16% | 24.61% ± 2.70% | 48.80% ± 3.17% | 60.63% | 93.9% |
|
29 |
+
| Nemotron-70B | FP8 (~70) | 80.06% ± 1.10% | 85.01% ± N/A | 53.60% ± 3.16% | 23.83% ± 2.67% | 55.60% ± 3.15% | 59.62% | 92.4% |
|
30 |
+
| Qwen-2.5-32B | bf16 (~64) | 87.41% ± 0.91% | 87.53% ± N/A | 58.80% ± 3.12% | 23.05% ± 2.64% | 45.20% ± 3.15% | 60.40% | 93.6% |
|
31 |
+
| Flow-Judge | bf16 (~16)* | 78.70% ± 1.13% | 64.63% ± N/A | **60.80% ± 3.09%** | 23.44% ± 2.65% | 35.60% ± 3.03% | 52.63% | 81.5% |
|
32 |
+
| Glider | bf16 (~8) | 78.70% ± 1.13% | 56.47% ± N/A | 59.20% ± 3.11% | **35.94% ± 3.00%** | 43.20% ± 3.14% | 54.70% | 84.8% |
|
33 |
+
|
34 |
+
|
35 |
+
[Halubench Public test-set](https://huggingface.co/datasets/PatronusAI/HaluBench):
|
36 |
+
Rank | Model | Responses Tested | Pass@1 Rate | False - | False + | Worst Dataset | Cost estimate*
|
37 |
+
| --- | --- | --- | --- | --- | --- | --- | --- |
|
38 |
+
1 | Root Judge (FP8), decompose, t=0.6 | 14900 | 86.26% | 596 | 1340 | Financebench | ±$33.6
|
39 |
+
2 | gpt-4o-2024-05-13 | 14900 | 86.06% | 1052 | 1025 | DROP | -
|
40 |
+
3 | o1-preview-2024-09-12, t=1 | 14899 | 85.25% | 1134 | 1063 | RagTruth | $1062.08
|
41 |
+
4 | claude-3-5-sonnet-20240620** t=0.6 | 14797 | 85.17% | 1391 | 809 | PubMedQA | -
|
42 |
+
5 | llama3.1:70b-instruct-q8_0 t=0.6| 13969 | 84.67% | 769 | 1373 | DROP | ±$33.6
|
43 |
+
6 | o1-mini-2024-09-12, t=1 | 14655 | 83.71% | 1169 | 1219 | DROP | $156.07
|
44 |
+
7 | llama3.1:405b-instruct-q8_0 t=0.2 | 14881 | 83.58% | 1331 | 1113 | DROP | -
|
45 |
+
|
46 |
+
Root Judge outperforms most leading closed models when detecting instruction following failures on evaluations
|
47 |
+
while providing detailed, structured justifications on long inputs of up to 32k tokens on internal benchmarks and halubench public.
|
48 |
+
|
49 |
+

|
50 |
+
Image 1: Root Signals internal hallucination benchmark. Total pass@1 rates and consistency (delta) assessed via ensemble of leading 3rd party models.
|
51 |
+
|
52 |
+
|
53 |
+

|
54 |
+
Image 2: Root Signals internal hallucination benchmark. Custom rubric instruction following by high level task.
|
55 |
+
|
56 |
+
Root Judge was tested to support complex, user-defined rating rubrics over large context sizes,
|
57 |
+
provide granular qualitative feedback, and support structured evaluation outputs and tool calling.
|
58 |
+
|
59 |
+
## Intended Use Cases
|
60 |
+
The model's primary use is as LLM-as-Judge for:
|
61 |
+
detecting context-grounded hallucinations, e.g. for Retrieval-Augmented-Generation (RAG) in explainable manner, providing a justification for the choice
|
62 |
+
pairwise preference judgments, that leverage strong instruction following, with custom rubrics e.g. for assisting with inference time compute or synthetic data tasks that require Best-of-N decisions.
|
63 |
+
privacy-focused deployments, that want to avoid sending data across the public internet
|
64 |
+
|
65 |
+
Despite our main focus on nuanced and transparent judgement of candidate responses,
|
66 |
+
we test the judge model checkpoints extensively on public and private benchmarks,
|
67 |
+
to avoid known issues with performance drops such as catastrophic forgetting and find that the model
|
68 |
+
preserves general capabilities of Llama-3.3-70B-Instruct after dynamic weights quantization,
|
69 |
+
while also slightly outperforming it on public instruction following benchmarks such as IFEval and MuSR
|
70 |
+
|
71 |
+
## Model Description
|
72 |
+
|
73 |
+
- **Developed by:** [Root Signals Inc](https://www.rootsignals.ai)
|
74 |
+
- **Model type:** Text-Only Decoder Transformer
|
75 |
+
- **Language(s) (NLP):** Primarily English
|
76 |
+
- **Finetuned from model:** meta-llama/Llama-3.3-70B-Instruct
|
77 |
+
|
78 |
+
## How to Get Started with the Model
|
79 |
+
|
80 |
+
We recommend using SGLang for production use together with xml tags for important sections in your prompt. At least 96GB of VRAM is recommended.
|
81 |
+
While the model runs on 80GB VRAM the effective context size (around 7k total tokens) will be too low for evaluating most RAG inputs.
|
82 |
+
|
83 |
+
SGlang example for a single Nvidia H100 (80GB):
|
84 |
+
```bash
|
85 |
+
docker run \
|
86 |
+
--gpus all \
|
87 |
+
--ipc=host \
|
88 |
+
-p 8000:8000 \
|
89 |
+
-v huggingface:/root/.cache/huggingface \
|
90 |
+
--volume /etc/localtime:/etc/localtime:ro \
|
91 |
+
-d docker.io/lmsysorg/sglang:v0.4.2-cu124-srt \
|
92 |
+
python3 -m sglang.launch_server \
|
93 |
+
--model-path root-signals/RS1-llama-fast \
|
94 |
+
--host 0.0.0.0 \
|
95 |
+
--port 8000 \
|
96 |
+
--mem-fraction-static 0.89 \
|
97 |
+
--grammar-backend xgrammar \
|
98 |
+
--enable-torch-compile \
|
99 |
+
--disable-cuda-graph
|
100 |
+
```
|
101 |
+
|
102 |
+
We validated the model on arm64 with VLLM on Nvidia GH200 as well with max outputs up to 72k tokens:
|
103 |
+
```
|
104 |
+
docker run \
|
105 |
+
--gpus all \
|
106 |
+
--ipc=host \
|
107 |
+
-p 8000:8000 \
|
108 |
+
-v huggingface:/root/.cache/huggingface \
|
109 |
+
--volume /etc/localtime:/etc/localtime:ro \
|
110 |
+
-d drikster80/vllm-gh200-openai:v0.6.4.post1 \
|
111 |
+
--model root-signals/RS1-llama-fast \
|
112 |
+
--gpu-memory-utilization 0.97 \
|
113 |
+
--max-model-len 72000 \
|
114 |
+
--block_size 16 \
|
115 |
+
```
|
116 |
+
|
117 |
+
The model is also available on our [platform](https://rootsignals.ai) as part of our evaluation suite, for no additional cost
|
118 |
+
|
119 |
+
## Training Details
|
120 |
+
|
121 |
+
### Training Procedure
|
122 |
+
|
123 |
+
- **Training regime:** DPO with IPO loss for 3 Epochs, bfloat16 mixed-precision on 384 GPUs
|
124 |
+
- **Hardware Type:** LUMI-G / AMD Radeon Instinct™ MI250X
|
125 |
+
- **Cloud Provider:** [LUMI Supercomputer](https://lumi-supercomputer.eu)
|
126 |
+
- **Compute Region:** Finland
|