TensorTemplar commited on
Commit
6b6b61a
·
verified ·
1 Parent(s): 5a2298c

Add model card draft

Browse files
Files changed (1) hide show
  1. README.md +126 -0
README.md ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3.3
3
+ language:
4
+ - en
5
+ base_model:
6
+ - meta-llama/Llama-3.3-70B-Instruct
7
+ pipeline_tag: text-generation
8
+ tags:
9
+ - llm-as-judge
10
+ - evaluation
11
+ ---
12
+ # Model Card for RootSignals-Judge-Llama-70B
13
+
14
+ Root Judge is a powerful mid-sized model that enables reliable and customizable LLM system evaluations.
15
+ Root Judge was post-trained from Llama-3.3-70B-Instruct on a high quality, human-annotated dataset mix for pairwise preference choice judgments and multi-turn instruction following with source citing.
16
+ The model weights are freely made available in FP8 to facilitate cost effective research and application use.
17
+
18
+ Root Judge’s performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and
19
+ achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost.
20
+
21
+ ### Primary Metrics Summary
22
+
23
+ Instruction following comared to open-weights judge and reward models:
24
+ | Model | Precision (Size GB) | GSM8K↑ | IFEval↑ | MUSR-Murder↑ | MUSR-Object↑ | MUSR-Team↑ | Avg Score | Relative to RS-1 |
25
+ |-------|------------|--------|---------|--------------|--------------|------------|------------|-----------------|
26
+ | Root Judge | FP8 (~70) | **94.62% ± 0.62%** | **93.88% ± N/A** | 52.80% ± 3.16% | 24.61% ± 2.70%| **56.80% ± 3.14%** | **64.54%** | 100% |
27
+ | Llama-3.3-70B | bf16 (~140) | 94.39% ± 0.63% | 93.41% ± N/A | 54.00% ± 3.16% | 23.44% ± 2.65% | 56.00% ± 3.15% | 64.25% | 99.5% |
28
+ | Patronus-70B | bf16 (~140) | 91.66% ± 0.76% | 83.69% ± N/A | 54.40% ± 3.16% | 24.61% ± 2.70% | 48.80% ± 3.17% | 60.63% | 93.9% |
29
+ | Nemotron-70B | FP8 (~70) | 80.06% ± 1.10% | 85.01% ± N/A | 53.60% ± 3.16% | 23.83% ± 2.67% | 55.60% ± 3.15% | 59.62% | 92.4% |
30
+ | Qwen-2.5-32B | bf16 (~64) | 87.41% ± 0.91% | 87.53% ± N/A | 58.80% ± 3.12% | 23.05% ± 2.64% | 45.20% ± 3.15% | 60.40% | 93.6% |
31
+ | Flow-Judge | bf16 (~16)* | 78.70% ± 1.13% | 64.63% ± N/A | **60.80% ± 3.09%** | 23.44% ± 2.65% | 35.60% ± 3.03% | 52.63% | 81.5% |
32
+ | Glider | bf16 (~8) | 78.70% ± 1.13% | 56.47% ± N/A | 59.20% ± 3.11% | **35.94% ± 3.00%** | 43.20% ± 3.14% | 54.70% | 84.8% |
33
+
34
+
35
+ [Halubench Public test-set](https://huggingface.co/datasets/PatronusAI/HaluBench):
36
+ Rank | Model | Responses Tested | Pass@1 Rate | False - | False + | Worst Dataset | Cost estimate*
37
+ | --- | --- | --- | --- | --- | --- | --- | --- |
38
+ 1 | Root Judge (FP8), decompose, t=0.6 | 14900 | 86.26% | 596 | 1340 | Financebench | ±$33.6
39
+ 2 | gpt-4o-2024-05-13 | 14900 | 86.06% | 1052 | 1025 | DROP | -
40
+ 3 | o1-preview-2024-09-12, t=1 | 14899 | 85.25% | 1134 | 1063 | RagTruth | $1062.08
41
+ 4 | claude-3-5-sonnet-20240620** t=0.6 | 14797 | 85.17% | 1391 | 809 | PubMedQA | -
42
+ 5 | llama3.1:70b-instruct-q8_0 t=0.6| 13969 | 84.67% | 769 | 1373 | DROP | ±$33.6
43
+ 6 | o1-mini-2024-09-12, t=1 | 14655 | 83.71% | 1169 | 1219 | DROP | $156.07
44
+ 7 | llama3.1:405b-instruct-q8_0 t=0.2 | 14881 | 83.58% | 1331 | 1113 | DROP | -
45
+
46
+ Root Judge outperforms most leading closed models when detecting instruction following failures on evaluations
47
+ while providing detailed, structured justifications on long inputs of up to 32k tokens on internal benchmarks and halubench public.
48
+
49
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6343d9d3e01a38440eeffc9c/rHq5RakEPkOlnC69MOl1e.png)
50
+ Image 1: Root Signals internal hallucination benchmark. Total pass@1 rates and consistency (delta) assessed via ensemble of leading 3rd party models.
51
+
52
+
53
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6343d9d3e01a38440eeffc9c/zfsh6HTbYH1HpLItWgq8u.png)
54
+ Image 2: Root Signals internal hallucination benchmark. Custom rubric instruction following by high level task.
55
+
56
+ Root Judge was tested to support complex, user-defined rating rubrics over large context sizes,
57
+ provide granular qualitative feedback, and support structured evaluation outputs and tool calling.
58
+
59
+ ## Intended Use Cases
60
+ The model's primary use is as LLM-as-Judge for:
61
+ detecting context-grounded hallucinations, e.g. for Retrieval-Augmented-Generation (RAG) in explainable manner, providing a justification for the choice
62
+ pairwise preference judgments, that leverage strong instruction following, with custom rubrics e.g. for assisting with inference time compute or synthetic data tasks that require Best-of-N decisions.
63
+ privacy-focused deployments, that want to avoid sending data across the public internet
64
+
65
+ Despite our main focus on nuanced and transparent judgement of candidate responses,
66
+ we test the judge model checkpoints extensively on public and private benchmarks,
67
+ to avoid known issues with performance drops such as catastrophic forgetting and find that the model
68
+ preserves general capabilities of Llama-3.3-70B-Instruct after dynamic weights quantization,
69
+ while also slightly outperforming it on public instruction following benchmarks such as IFEval and MuSR
70
+
71
+ ## Model Description
72
+
73
+ - **Developed by:** [Root Signals Inc](https://www.rootsignals.ai)
74
+ - **Model type:** Text-Only Decoder Transformer
75
+ - **Language(s) (NLP):** Primarily English
76
+ - **Finetuned from model:** meta-llama/Llama-3.3-70B-Instruct
77
+
78
+ ## How to Get Started with the Model
79
+
80
+ We recommend using SGLang for production use together with xml tags for important sections in your prompt. At least 96GB of VRAM is recommended.
81
+ While the model runs on 80GB VRAM the effective context size (around 7k total tokens) will be too low for evaluating most RAG inputs.
82
+
83
+ SGlang example for a single Nvidia H100 (80GB):
84
+ ```bash
85
+ docker run \
86
+ --gpus all \
87
+ --ipc=host \
88
+ -p 8000:8000 \
89
+ -v huggingface:/root/.cache/huggingface \
90
+ --volume /etc/localtime:/etc/localtime:ro \
91
+ -d docker.io/lmsysorg/sglang:v0.4.2-cu124-srt \
92
+ python3 -m sglang.launch_server \
93
+ --model-path root-signals/RS1-llama-fast \
94
+ --host 0.0.0.0 \
95
+ --port 8000 \
96
+ --mem-fraction-static 0.89 \
97
+ --grammar-backend xgrammar \
98
+ --enable-torch-compile \
99
+ --disable-cuda-graph
100
+ ```
101
+
102
+ We validated the model on arm64 with VLLM on Nvidia GH200 as well with max outputs up to 72k tokens:
103
+ ```
104
+ docker run \
105
+ --gpus all \
106
+ --ipc=host \
107
+ -p 8000:8000 \
108
+ -v huggingface:/root/.cache/huggingface \
109
+ --volume /etc/localtime:/etc/localtime:ro \
110
+ -d drikster80/vllm-gh200-openai:v0.6.4.post1 \
111
+ --model root-signals/RS1-llama-fast \
112
+ --gpu-memory-utilization 0.97 \
113
+ --max-model-len 72000 \
114
+ --block_size 16 \
115
+ ```
116
+
117
+ The model is also available on our [platform](https://rootsignals.ai) as part of our evaluation suite, for no additional cost
118
+
119
+ ## Training Details
120
+
121
+ ### Training Procedure
122
+
123
+ - **Training regime:** DPO with IPO loss for 3 Epochs, bfloat16 mixed-precision on 384 GPUs
124
+ - **Hardware Type:** LUMI-G / AMD Radeon Instinct™ MI250X
125
+ - **Cloud Provider:** [LUMI Supercomputer](https://lumi-supercomputer.eu)
126
+ - **Compute Region:** Finland