Add rewardbench scores
Browse files
README.md
CHANGED
@@ -31,6 +31,38 @@ Instruction following comared to open-weights judge and reward models:
|
|
31 |
| Flow-Judge | bf16 (~16)* | 78.70% ± 1.13% | 64.63% ± N/A | **60.80% ± 3.09%** | 23.44% ± 2.65% | 35.60% ± 3.03% | 52.63% | 81.5% |
|
32 |
| Glider | bf16 (~8) | 78.70% ± 1.13% | 56.47% ± N/A | 59.20% ± 3.11% | **35.94% ± 3.00%** | 43.20% ± 3.14% | 54.70% | 84.8% |
|
33 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
|
35 |
[Halubench Public test-set](https://huggingface.co/datasets/PatronusAI/HaluBench):
|
36 |
Rank | Model | Responses Tested | Pass@1 Rate | False - | False + | Worst Dataset | Cost estimate*
|
|
|
31 |
| Flow-Judge | bf16 (~16)* | 78.70% ± 1.13% | 64.63% ± N/A | **60.80% ± 3.09%** | 23.44% ± 2.65% | 35.60% ± 3.03% | 52.63% | 81.5% |
|
32 |
| Glider | bf16 (~8) | 78.70% ± 1.13% | 56.47% ± N/A | 59.20% ± 3.11% | **35.94% ± 3.00%** | 43.20% ± 3.14% | 54.70% | 84.8% |
|
33 |
|
34 |
+
[RewardBench Generative - Unverified](https://huggingface.co/spaces/allenai/reward-bench)
|
35 |
+
|
36 |
+
| Test Name | Score | Total | Accuracy |
|
37 |
+
|------------------------|-------|-------|-----------|
|
38 |
+
| alpacaeval-easy | 99.0 | 100 | 0.99 |
|
39 |
+
| alpacaeval-hard | 93.0 | 95 | 0.97894737|
|
40 |
+
| alpacaeval-length | 86.0 | 95 | 0.90526316|
|
41 |
+
| donotanswer | 73.5 | 136 | 0.54044118|
|
42 |
+
| hep-cpp | 159.0 | 164 | 0.96951220|
|
43 |
+
| hep-go | 159.0 | 164 | 0.96951220|
|
44 |
+
| hep-java | 161.0 | 164 | 0.98170732|
|
45 |
+
| hep-js | 159.0 | 164 | 0.96951220|
|
46 |
+
| hep-python | 158.0 | 164 | 0.96341463|
|
47 |
+
| hep-rust | 152.0 | 164 | 0.92682927|
|
48 |
+
| llmbar-adver-GPTInst | 69.0 | 92 | 0.75 |
|
49 |
+
| llmbar-adver-GPTOut | 39.0 | 47 | 0.82978723|
|
50 |
+
| llmbar-adver-manual | 32.0 | 46 | 0.69565217|
|
51 |
+
| llmbar-adver-neighbor | 74.0 | 134 | 0.55223881|
|
52 |
+
| llmbar-natural | 94.0 | 100 | 0.94 |
|
53 |
+
| math-prm | 357.0 | 447 | 0.79865772|
|
54 |
+
| mt-bench-easy | 28.0 | 28 | 1.0 |
|
55 |
+
| mt-bench-hard | 32.0 | 37 | 0.86486486|
|
56 |
+
| mt-bench-med | 40.0 | 40 | 1.0 |
|
57 |
+
| refusals-dangerous | 73.5 | 100 | 0.735 |
|
58 |
+
| refusals-offensive | 89.0 | 100 | 0.89 |
|
59 |
+
| xstest-should-refuse | 140.5 | 154 | 0.91233766|
|
60 |
+
| xstest-should-respond | 245.0 | 250 | 0.98 |
|
61 |
+
| Chat | | | 0.96648045|
|
62 |
+
| Chat Hard | | | 0.74561404|
|
63 |
+
| Safety | | | 0.83986486|
|
64 |
+
| Reasoning | | | 0.88103618|
|
65 |
+
|
66 |
|
67 |
[Halubench Public test-set](https://huggingface.co/datasets/PatronusAI/HaluBench):
|
68 |
Rank | Model | Responses Tested | Pass@1 Rate | False - | False + | Worst Dataset | Cost estimate*
|