Ouz-G commited on
Commit
cc89ca3
·
verified ·
1 Parent(s): a8441c4
Files changed (1) hide show
  1. README.md +13 -10
README.md CHANGED
@@ -42,20 +42,23 @@ Rank | Model | Test Samples | Pass@1 Rate (%) | Cost ($)
42
  6 | o1-mini | 14655 | 83.7 | 156
43
  7 | Llama3.1-405b-Instruct | 14881 | 83.6 | -
44
 
45
- [🔎 Detailed Performance Breakdown](https://docs.google.com/spreadsheets/d/1NM9VgGG9_-1kQbepeoueUTkvT1bDeRndTD4RM5iV7l4/edit?usp=sharing)
46
 
47
  ### 2.2 Instruction Following
48
 
49
- Instruction following comared to open-weights judge and reward models:
50
- | Model | Precision (Size GB) | GSM8K↑ | IFEval↑ | MUSR-Murder↑ | MUSR-Object↑ | MUSR-Team↑ | Avg Score | Relative to RS-1 |
 
51
  |-------|------------|--------|---------|--------------|--------------|------------|------------|-----------------|
52
- | Root Judge | FP8 (~70) | **94.62% ± 0.62%** | **93.88% ± N/A** | 52.80% ± 3.16% | 24.61% ± 2.70%| **56.80% ± 3.14%** | **64.54%** | 100% |
53
- | Llama-3.3-70B | bf16 (~140) | 94.39% ± 0.63% | 93.41% ± N/A | 54.00% ± 3.16% | 23.44% ± 2.65% | 56.00% ± 3.15% | 64.25% | 99.5% |
54
- | Patronus-70B | bf16 (~140) | 91.66% ± 0.76% | 83.69% ± N/A | 54.40% ± 3.16% | 24.61% ± 2.70% | 48.80% ± 3.17% | 60.63% | 93.9% |
55
- | Nemotron-70B | FP8 (~70) | 80.06% ± 1.10% | 85.01% ± N/A | 53.60% ± 3.16% | 23.83% ± 2.67% | 55.60% ± 3.15% | 59.62% | 92.4% |
56
- | Qwen-2.5-32B | bf16 (~64) | 87.41% ± 0.91% | 87.53% ± N/A | 58.80% ± 3.12% | 23.05% ± 2.64% | 45.20% ± 3.15% | 60.40% | 93.6% |
57
- | Flow-Judge | bf16 (~16)* | 78.70% ± 1.13% | 64.63% ± N/A | **60.80% ± 3.09%** | 23.44% ± 2.65% | 35.60% ± 3.03% | 52.63% | 81.5% |
58
- | Glider | bf16 (~8) | 78.70% ± 1.13% | 56.47% ± N/A | 59.20% ± 3.11% | **35.94% ± 3.00%** | 43.20% ± 3.14% | 54.70% | 84.8% |
 
 
59
 
60
  [RewardBench Generative - Unverified](https://huggingface.co/spaces/allenai/reward-bench)
61
 
 
42
  6 | o1-mini | 14655 | 83.7 | 156
43
  7 | Llama3.1-405b-Instruct | 14881 | 83.6 | -
44
 
45
+ [🔎 Detailed Performance Breakdown - Hallucination Detection](https://docs.google.com/spreadsheets/d/1NM9VgGG9_-1kQbepeoueUTkvT1bDeRndTD4RM5iV7l4/edit?usp=sharing)
46
 
47
  ### 2.2 Instruction Following
48
 
49
+ Instruction-following performance in various diverse benchmarks compared to other open-weights judge and reward models (higher is better):
50
+
51
+ | Model | Size (GB) | GSM8K (%) | IFEval (%) | MUSR-Murder (%) | MUSR-Object (%) | MUSR-Team (%) | Avg Score | Relative to Root Judge (%) |
52
  |-------|------------|--------|---------|--------------|--------------|------------|------------|-----------------|
53
+ | **Root Judge (FP8)** | 70 | **94.6 ± 0.6** | **93.88** | 52.8 ± 3.2 | 24.6 ± 2.7 | **56.8 ± 3.1** | **64.5** | 100 |
54
+ | Llama-3.3-70B | 140 | 94.4 ± 0.6 | 93.41 | 54.0 ± 3.2 | 23.4 ± 2.7 | 56.0 ± 3.2 | 64.3 | 99.5 |
55
+ | Patronus-70B | 140 | 91.7 ± 0.8 | 83.69 | 54.4 ± 3.2 | 24.6 ± 2.7 | 48.8 ± 3.2 | 60.6 | 93.9 |
56
+ | Nemotron-70B | 70 | 80.1 ± 1.1 | 85.01 | 53.6 ± 3.2 | 23.8 ± 2.7 | 55.6 ± 3.1 | 59.6 | 92.4 |
57
+ | Qwen-2.5-32B | 64 | 87.4 ± 0.9 | 87.53 | 58.8 ± 3.1 | 23.1 ± 2.6 | 45.2 ± 3.2 | 60.4 | 93.6 |
58
+ | Flow Judge | 16 | 78.7 ± 1.1 | 64.63 | **60.8 ± 3.1** | 23.4 ± 2.7 | 35.6 ± 3.0 | 52.6 | 81.5 |
59
+ | Glider | 8 | 78.7 ± 1.1 | 56.5 | 59.2 ± 3.1 | **35.9 ± 3.0** | 43.2 ± 3.1 | 54.7 | 84.8 |
60
+
61
+ [🔎 Detailed Performance Breakdown | Intruction-following](https://docs.google.com/spreadsheets/d/1cTPQZbUvelSlLkqj4kO-EQXFDkw17WXKHAeGg02-8Qg/edit?usp=sharing)
62
 
63
  [RewardBench Generative - Unverified](https://huggingface.co/spaces/allenai/reward-bench)
64