root-signals
/

RootSignals-Judge-Llama-70B

Text Generation

compressed-tensors

Model card Files Files and versions Community

Ouz-G commited on Feb 14

Commit

6347228

·

verified ·

1 Parent(s): cc89ca3

update

Files changed (1) hide show

README.md +9 -9

README.md CHANGED Viewed

@@ -48,15 +48,15 @@ Rank | Model | Test Samples | Pass@1 Rate (%) | Cost ($)
 Instruction-following performance in various diverse benchmarks compared to other open-weights judge and reward models (higher is better):
-| Model | Size (GB) | GSM8K (%) | IFEval (%) | MUSR-Murder (%) | MUSR-Object (%) | MUSR-Team (%) | Avg Score | Relative to Root Judge (%) |
-|-------|------------|--------|---------|--------------|--------------|------------|------------|-----------------|
-| **Root Judge (FP8)** | 70  | **94.6 ± 0.6** | **93.88** | 52.8 ± 3.2     | 24.6 ± 2.7     | **56.8 ± 3.1** | **64.5** | 100 |
-| Llama-3.3-70B        | 140 | 94.4 ± 0.6     | 93.41     | 54.0 ± 3.2     | 23.4 ± 2.7     | 56.0 ± 3.2     | 64.3 | 99.5 |
-| Patronus-70B         | 140 | 91.7 ± 0.8     | 83.69     | 54.4 ± 3.2     | 24.6 ± 2.7     | 48.8 ± 3.2     | 60.6 | 93.9 |
-| Nemotron-70B         | 70  | 80.1 ± 1.1     | 85.01     | 53.6 ± 3.2     | 23.8 ± 2.7     | 55.6 ± 3.1     | 59.6 | 92.4 |
-| Qwen-2.5-32B         | 64  | 87.4 ± 0.9     | 87.53     | 58.8 ± 3.1     | 23.1 ± 2.6     | 45.2 ± 3.2     | 60.4 | 93.6 |
-| Flow Judge           | 16  | 78.7 ± 1.1     | 64.63     | **60.8 ± 3.1** | 23.4 ± 2.7     | 35.6 ± 3.0     | 52.6 | 81.5 |
-| Glider               | 8   | 78.7 ± 1.1     | 56.5      | 59.2 ± 3.1     | **35.9 ± 3.0** | 43.2 ± 3.1     | 54.7 | 84.8 |
 [🔎 Detailed Performance Breakdown | Intruction-following](https://docs.google.com/spreadsheets/d/1cTPQZbUvelSlLkqj4kO-EQXFDkw17WXKHAeGg02-8Qg/edit?usp=sharing)

 Instruction-following performance in various diverse benchmarks compared to other open-weights judge and reward models (higher is better):
+Rank | Model | Size (GB) | GSM8K (%) | IFEval (%) | MUSR-Murder (%) | MUSR-Object (%) | MUSR-Team (%) | Avg Score | Relative to Root Judge (%) |
+| ---|--------------|------------|--------|---------|--------------|--------------|------------|------------|--------------------|
+**1** | **Root Judge (FP8)** | 70  | **94.6 ± 0.6** | **93.88** | 52.8 ± 3.2     | 24.6 ± 2.7     | **56.8 ± 3.1** | **64.5** | 100 |
+2 | Llama-3.3-70B        | 140 | 94.4 ± 0.6     | 93.41     | 54.0 ± 3.2     | 23.4 ± 2.7     | 56.0 ± 3.2     | 64.3 | 99.5 |
+3 | Patronus-70B         | 140 | 91.7 ± 0.8     | 83.69     | 54.4 ± 3.2     | 24.6 ± 2.7     | 48.8 ± 3.2     | 60.6 | 93.9 |
+4 | Nemotron-70B         | 70  | 80.1 ± 1.1     | 85.01     | 53.6 ± 3.2     | 23.8 ± 2.7     | 55.6 ± 3.1     | 59.6 | 92.4 |
+5 | Qwen-2.5-32B         | 64  | 87.4 ± 0.9     | 87.53     | 58.8 ± 3.1     | 23.1 ± 2.6     | 45.2 ± 3.2     | 60.4 | 93.6 |
+6 | Flow Judge           | 16  | 78.7 ± 1.1     | 64.63     | **60.8 ± 3.1** | 23.4 ± 2.7     | 35.6 ± 3.0     | 52.6 | 81.5 |
+7 | Glider               | 8   | 78.7 ± 1.1     | 56.5      | 59.2 ± 3.1     | **35.9 ± 3.0** | 43.2 ± 3.1     | 54.7 | 84.8 |
 [🔎 Detailed Performance Breakdown | Intruction-following](https://docs.google.com/spreadsheets/d/1cTPQZbUvelSlLkqj4kO-EQXFDkw17WXKHAeGg02-8Qg/edit?usp=sharing)