root-signals
/

RootSignals-Judge-Llama-70B

+---
+license: llama3.3
+language:
+- en
+base_model:
+- meta-llama/Llama-3.3-70B-Instruct
+pipeline_tag: text-generation
+tags:
+- llm-as-judge
+- evaluation
+---
+# Model Card for RootSignals-Judge-Llama-70B
+Root Judge is a powerful mid-sized model that enables reliable and customizable LLM system evaluations.
+Root Judge was post-trained from Llama-3.3-70B-Instruct on a high quality, human-annotated dataset mix for pairwise preference choice judgments and multi-turn instruction following with source citing.
+The model weights are freely made available in FP8 to facilitate cost effective research and application use.
+Root Judge’s performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and
+achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost.
+### Primary Metrics Summary
+Instruction following comared to open-weights judge and reward models:
+| Model | Precision (Size GB) | GSM8K↑ | IFEval↑ | MUSR-Murder↑ | MUSR-Object↑ | MUSR-Team↑ | Avg Score | Relative to RS-1 |
+|-------|------------|--------|---------|--------------|--------------|------------|------------|-----------------|
+| Root Judge | FP8 (~70) | **94.62% ± 0.62%** | **93.88% ± N/A** | 52.80% ± 3.16% | 24.61% ± 2.70%| **56.80% ± 3.14%** | **64.54%** | 100% |
+| Llama-3.3-70B | bf16 (~140) | 94.39% ± 0.63% | 93.41% ± N/A | 54.00% ± 3.16% | 23.44% ± 2.65% | 56.00% ± 3.15% | 64.25% | 99.5% |
+| Patronus-70B | bf16 (~140) | 91.66% ± 0.76% | 83.69% ± N/A | 54.40% ± 3.16% | 24.61% ± 2.70% | 48.80% ± 3.17% | 60.63% | 93.9% |
+| Nemotron-70B | FP8 (~70) | 80.06% ± 1.10% | 85.01% ± N/A | 53.60% ± 3.16% | 23.83% ± 2.67% | 55.60% ± 3.15% | 59.62% | 92.4% |
+| Qwen-2.5-32B | bf16 (~64) | 87.41% ± 0.91% | 87.53% ± N/A | 58.80% ± 3.12% | 23.05% ± 2.64% | 45.20% ± 3.15% | 60.40% | 93.6% |
+| Flow-Judge | bf16 (~16)* | 78.70% ± 1.13% | 64.63% ± N/A | **60.80% ± 3.09%** | 23.44% ± 2.65% | 35.60% ± 3.03% | 52.63% | 81.5% |
+| Glider | bf16 (~8) | 78.70% ± 1.13% | 56.47% ± N/A | 59.20% ± 3.11% | **35.94% ± 3.00%** | 43.20% ± 3.14% | 54.70% | 84.8% |
+[Halubench Public test-set](https://huggingface.co/datasets/PatronusAI/HaluBench):
+Rank | Model | Responses Tested | Pass@1 Rate | False - | False + | Worst Dataset | Cost estimate*
+| --- | --- | --- | --- | --- | --- | --- | --- |
+1 | Root Judge (FP8), decompose, t=0.6 | 14900 | 86.26% | 596 | 1340  | Financebench | ±$33.6
+2 | gpt-4o-2024-05-13 | 14900 | 86.06% | 1052 | 1025 | DROP | -
+3 | o1-preview-2024-09-12, t=1 | 14899 | 85.25% | 1134 | 1063 | RagTruth | $1062.08
+4 | claude-3-5-sonnet-20240620** t=0.6 | 14797 |  85.17% | 1391 | 809 | PubMedQA | -
+5 | llama3.1:70b-instruct-q8_0 t=0.6| 13969 | 84.67%  | 769 | 1373 | DROP | ±$33.6
+6 | o1-mini-2024-09-12, t=1 | 14655 | 83.71% | 1169 | 1219 | DROP | $156.07
+7 | llama3.1:405b-instruct-q8_0 t=0.2 | 14881 | 83.58% | 1331 | 1113 | DROP | -
+Root Judge outperforms most leading closed models when detecting instruction following failures on evaluations
+while providing detailed, structured justifications on long inputs of up to 32k tokens on internal benchmarks and halubench public.
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6343d9d3e01a38440eeffc9c/rHq5RakEPkOlnC69MOl1e.png)
+Image 1: Root Signals internal hallucination benchmark. Total pass@1 rates and consistency (delta) assessed via ensemble of leading 3rd party models.
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6343d9d3e01a38440eeffc9c/zfsh6HTbYH1HpLItWgq8u.png)
+Image 2: Root Signals internal hallucination benchmark. Custom rubric instruction following by high level task.
+Root Judge was tested to support complex, user-defined rating rubrics over large context sizes,
+provide granular qualitative feedback, and support structured evaluation outputs and tool calling.
+## Intended Use Cases
+The model's primary use is as LLM-as-Judge for:
+detecting context-grounded hallucinations, e.g. for Retrieval-Augmented-Generation (RAG) in explainable manner, providing a justification for the choice
+pairwise preference judgments, that leverage strong instruction following, with custom rubrics e.g. for assisting with inference time compute or synthetic data tasks that require Best-of-N decisions.
+privacy-focused deployments, that want to avoid sending data across the public internet
+Despite our main focus on nuanced and transparent judgement of candidate responses,
+we test the judge model checkpoints extensively on public and private benchmarks,
+to avoid known issues with performance drops such as catastrophic forgetting and find that the model
+preserves general capabilities of Llama-3.3-70B-Instruct after dynamic weights quantization,
+while also slightly outperforming it on public instruction following benchmarks such as IFEval and MuSR
+## Model Description
+- **Developed by:** [Root Signals Inc](https://www.rootsignals.ai)
+- **Model type:** Text-Only Decoder Transformer
+- **Language(s) (NLP):** Primarily English
+- **Finetuned from model:** meta-llama/Llama-3.3-70B-Instruct
+## How to Get Started with the Model
+We recommend using SGLang for production use together with xml tags for important sections in your prompt. At least 96GB of VRAM is recommended.
+While the model runs on 80GB VRAM the effective context size (around 7k total tokens) will be too low for evaluating most RAG inputs.
+SGlang example for a single Nvidia H100 (80GB):
+```bash
+docker run \
+   --gpus all \
+   --ipc=host  \
+   -p 8000:8000 \
+   -v huggingface:/root/.cache/huggingface \
+   --volume /etc/localtime:/etc/localtime:ro \
+   -d docker.io/lmsysorg/sglang:v0.4.2-cu124-srt \
+   python3 -m sglang.launch_server \
+   --model-path root-signals/RS1-llama-fast \
+   --host 0.0.0.0 \
+   --port 8000 \
+   --mem-fraction-static 0.89 \
+   --grammar-backend xgrammar \
+   --enable-torch-compile \
+   --disable-cuda-graph
+```
+We validated the model on arm64 with VLLM on Nvidia GH200 as well with max outputs up to 72k tokens:
+```
+docker run \
+   --gpus all \
+   --ipc=host  \
+   -p 8000:8000 \
+   -v huggingface:/root/.cache/huggingface \
+   --volume /etc/localtime:/etc/localtime:ro \
+   -d drikster80/vllm-gh200-openai:v0.6.4.post1 \
+   --model root-signals/RS1-llama-fast \
+   --gpu-memory-utilization 0.97 \
+   --max-model-len 72000 \
+   --block_size 16 \
+```
+The model is also available on our [platform](https://rootsignals.ai) as part of our evaluation suite, for no additional cost
+## Training Details
+### Training Procedure
+- **Training regime:** DPO with IPO loss for 3 Epochs, bfloat16 mixed-precision on 384 GPUs
+- **Hardware Type:** LUMI-G / AMD Radeon Instinct™ MI250X
+- **Cloud Provider:** [LUMI Supercomputer](https://lumi-supercomputer.eu)
+- **Compute Region:** Finland