Ouz-G commited on
Commit
4b82e4e
·
verified ·
1 Parent(s): 5b4e0de

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -12
README.md CHANGED
@@ -51,14 +51,14 @@ Rank | Model | Test Samples | Pass@1 Rate (%) | Cost ($)
51
 
52
  📊 Instruction-following performance in various diverse benchmarks compared to other open-weights judge and reward models (higher is better):
53
 
54
- Rank | Model | Size (GB) | GSM8K (%) | IFEval (%) | MUSR-Murder (%) | MUSR-Object (%) | MUSR-Team (%) | Avg Score | Relative to Root Judge (%) |
55
  | ---|--------------|------------|--------|---------|--------------|--------------|------------|------------|--------------------|
56
- **1** | **Root Judge (FP8)** | 70 | **94.6 ± 0.6** | **93.88** | 52.8 ± 3.2 | 24.6 ± 2.7 | **56.8 ± 3.1** | **64.5** | 100 |
57
- 2 | Llama-3.3-70B | 140 | 94.4 ± 0.6 | 93.41 | 54.0 ± 3.2 | 23.4 ± 2.7 | 56.0 ± 3.2 | 64.3 | 99.5 |
58
- 3 | Patronus-70B | 140 | 91.7 ± 0.8 | 83.69 | 54.4 ± 3.2 | 24.6 ± 2.7 | 48.8 ± 3.2 | 60.6 | 93.9 |
59
- 4 | Nemotron-70B | 70 | 80.1 ± 1.1 | 85.01 | 53.6 ± 3.2 | 23.8 ± 2.7 | 55.6 ± 3.1 | 59.6 | 92.4 |
60
- 5 | Qwen-2.5-32B | 64 | 87.4 ± 0.9 | 87.53 | 58.8 ± 3.1 | 23.1 ± 2.6 | 45.2 ± 3.2 | 60.4 | 93.6 |
61
- 6 | Flow Judge | 16 | 78.7 ± 1.1 | 64.63 | **60.8 ± 3.1** | 23.4 ± 2.7 | 35.6 ± 3.0 | 52.6 | 81.5 |
62
  7 | Glider | 8 | 78.7 ± 1.1 | 56.5 | 59.2 ± 3.1 | **35.9 ± 3.0** | 43.2 ± 3.1 | 54.7 | 84.8 |
63
 
64
  [🔎 Detailed Performance Breakdown | Intruction-following](https://docs.google.com/spreadsheets/d/1cTPQZbUvelSlLkqj4kO-EQXFDkw17WXKHAeGg02-8Qg/edit?usp=sharing)
@@ -209,8 +209,5 @@ docker run \
209
  - [Root Signals GitHub](https://github.com/root-signals/rs-python-sdk)
210
  - [Discord](https://discord.gg/EhazTQsFnj)
211
 
212
- **Emails**
213
214
215
216
 
51
 
52
  📊 Instruction-following performance in various diverse benchmarks compared to other open-weights judge and reward models (higher is better):
53
 
54
+ Rank | Model | VRAM (GB) | GSM8K (%) | IFEval (%) | MUSR-Murder (%) | MUSR-Object (%) | MUSR-Team (%) | Avg Score | Relative to Root Judge (%) |
55
  | ---|--------------|------------|--------|---------|--------------|--------------|------------|------------|--------------------|
56
+ **1** | **Root Judge (FP8)** | 70 | **94.6 ± 0.6** | **93.9** | 52.8 ± 3.2 | 24.6 ± 2.7 | **56.8 ± 3.1** | **64.5** | 100 |
57
+ 2 | Llama-3.3-70B | 140 | 94.4 ± 0.6 | 93.4 | 54.0 ± 3.2 | 23.4 ± 2.7 | 56.0 ± 3.2 | 64.3 | 99.5 |
58
+ 3 | Patronus-70B | 140 | 91.7 ± 0.8 | 83.7 | 54.4 ± 3.2 | 24.6 ± 2.7 | 48.8 ± 3.2 | 60.6 | 93.9 |
59
+ 4 | Nemotron-70B | 70 | 80.1 ± 1.1 | 85.0 | 53.6 ± 3.2 | 23.8 ± 2.7 | 55.6 ± 3.1 | 59.6 | 92.4 |
60
+ 5 | Qwen-2.5-32B | 64 | 87.4 ± 0.9 | 87.5 | 58.8 ± 3.1 | 23.1 ± 2.6 | 45.2 ± 3.2 | 60.4 | 93.6 |
61
+ 6 | Flow Judge | 16 | 78.7 ± 1.1 | 64.6 | **60.8 ± 3.1** | 23.4 ± 2.7 | 35.6 ± 3.0 | 52.6 | 81.5 |
62
  7 | Glider | 8 | 78.7 ± 1.1 | 56.5 | 59.2 ± 3.1 | **35.9 ± 3.0** | 43.2 ± 3.1 | 54.7 | 84.8 |
63
 
64
  [🔎 Detailed Performance Breakdown | Intruction-following](https://docs.google.com/spreadsheets/d/1cTPQZbUvelSlLkqj4kO-EQXFDkw17WXKHAeGg02-8Qg/edit?usp=sharing)
 
209
  - [Root Signals GitHub](https://github.com/root-signals/rs-python-sdk)
210
  - [Discord](https://discord.gg/EhazTQsFnj)
211
 
212
+ **Email**
213