update
Browse files
README.md
CHANGED
@@ -42,20 +42,23 @@ Rank | Model | Test Samples | Pass@1 Rate (%) | Cost ($)
|
|
42 |
6 | o1-mini | 14655 | 83.7 | 156
|
43 |
7 | Llama3.1-405b-Instruct | 14881 | 83.6 | -
|
44 |
|
45 |
-
[🔎 Detailed Performance Breakdown](https://docs.google.com/spreadsheets/d/1NM9VgGG9_-1kQbepeoueUTkvT1bDeRndTD4RM5iV7l4/edit?usp=sharing)
|
46 |
|
47 |
### 2.2 Instruction Following
|
48 |
|
49 |
-
Instruction
|
50 |
-
|
|
|
51 |
|-------|------------|--------|---------|--------------|--------------|------------|------------|-----------------|
|
52 |
-
| Root Judge
|
53 |
-
| Llama-3.3-70B
|
54 |
-
| Patronus-70B
|
55 |
-
| Nemotron-70B
|
56 |
-
| Qwen-2.5-32B
|
57 |
-
| Flow
|
58 |
-
| Glider
|
|
|
|
|
59 |
|
60 |
[RewardBench Generative - Unverified](https://huggingface.co/spaces/allenai/reward-bench)
|
61 |
|
|
|
42 |
6 | o1-mini | 14655 | 83.7 | 156
|
43 |
7 | Llama3.1-405b-Instruct | 14881 | 83.6 | -
|
44 |
|
45 |
+
[🔎 Detailed Performance Breakdown - Hallucination Detection](https://docs.google.com/spreadsheets/d/1NM9VgGG9_-1kQbepeoueUTkvT1bDeRndTD4RM5iV7l4/edit?usp=sharing)
|
46 |
|
47 |
### 2.2 Instruction Following
|
48 |
|
49 |
+
Instruction-following performance in various diverse benchmarks compared to other open-weights judge and reward models (higher is better):
|
50 |
+
|
51 |
+
| Model | Size (GB) | GSM8K (%) | IFEval (%) | MUSR-Murder (%) | MUSR-Object (%) | MUSR-Team (%) | Avg Score | Relative to Root Judge (%) |
|
52 |
|-------|------------|--------|---------|--------------|--------------|------------|------------|-----------------|
|
53 |
+
| **Root Judge (FP8)** | 70 | **94.6 ± 0.6** | **93.88** | 52.8 ± 3.2 | 24.6 ± 2.7 | **56.8 ± 3.1** | **64.5** | 100 |
|
54 |
+
| Llama-3.3-70B | 140 | 94.4 ± 0.6 | 93.41 | 54.0 ± 3.2 | 23.4 ± 2.7 | 56.0 ± 3.2 | 64.3 | 99.5 |
|
55 |
+
| Patronus-70B | 140 | 91.7 ± 0.8 | 83.69 | 54.4 ± 3.2 | 24.6 ± 2.7 | 48.8 ± 3.2 | 60.6 | 93.9 |
|
56 |
+
| Nemotron-70B | 70 | 80.1 ± 1.1 | 85.01 | 53.6 ± 3.2 | 23.8 ± 2.7 | 55.6 ± 3.1 | 59.6 | 92.4 |
|
57 |
+
| Qwen-2.5-32B | 64 | 87.4 ± 0.9 | 87.53 | 58.8 ± 3.1 | 23.1 ± 2.6 | 45.2 ± 3.2 | 60.4 | 93.6 |
|
58 |
+
| Flow Judge | 16 | 78.7 ± 1.1 | 64.63 | **60.8 ± 3.1** | 23.4 ± 2.7 | 35.6 ± 3.0 | 52.6 | 81.5 |
|
59 |
+
| Glider | 8 | 78.7 ± 1.1 | 56.5 | 59.2 ± 3.1 | **35.9 ± 3.0** | 43.2 ± 3.1 | 54.7 | 84.8 |
|
60 |
+
|
61 |
+
[🔎 Detailed Performance Breakdown | Intruction-following](https://docs.google.com/spreadsheets/d/1cTPQZbUvelSlLkqj4kO-EQXFDkw17WXKHAeGg02-8Qg/edit?usp=sharing)
|
62 |
|
63 |
[RewardBench Generative - Unverified](https://huggingface.co/spaces/allenai/reward-bench)
|
64 |
|