TensorTemplar commited on
Commit
784b80d
·
verified ·
1 Parent(s): f571a1d

Add rewardbench scores

Browse files
Files changed (1) hide show
  1. README.md +32 -0
README.md CHANGED
@@ -31,6 +31,38 @@ Instruction following comared to open-weights judge and reward models:
31
  | Flow-Judge | bf16 (~16)* | 78.70% ± 1.13% | 64.63% ± N/A | **60.80% ± 3.09%** | 23.44% ± 2.65% | 35.60% ± 3.03% | 52.63% | 81.5% |
32
  | Glider | bf16 (~8) | 78.70% ± 1.13% | 56.47% ± N/A | 59.20% ± 3.11% | **35.94% ± 3.00%** | 43.20% ± 3.14% | 54.70% | 84.8% |
33
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
  [Halubench Public test-set](https://huggingface.co/datasets/PatronusAI/HaluBench):
36
  Rank | Model | Responses Tested | Pass@1 Rate | False - | False + | Worst Dataset | Cost estimate*
 
31
  | Flow-Judge | bf16 (~16)* | 78.70% ± 1.13% | 64.63% ± N/A | **60.80% ± 3.09%** | 23.44% ± 2.65% | 35.60% ± 3.03% | 52.63% | 81.5% |
32
  | Glider | bf16 (~8) | 78.70% ± 1.13% | 56.47% ± N/A | 59.20% ± 3.11% | **35.94% ± 3.00%** | 43.20% ± 3.14% | 54.70% | 84.8% |
33
 
34
+ [RewardBench Generative - Unverified](https://huggingface.co/spaces/allenai/reward-bench)
35
+
36
+ | Test Name | Score | Total | Accuracy |
37
+ |------------------------|-------|-------|-----------|
38
+ | alpacaeval-easy | 99.0 | 100 | 0.99 |
39
+ | alpacaeval-hard | 93.0 | 95 | 0.97894737|
40
+ | alpacaeval-length | 86.0 | 95 | 0.90526316|
41
+ | donotanswer | 73.5 | 136 | 0.54044118|
42
+ | hep-cpp | 159.0 | 164 | 0.96951220|
43
+ | hep-go | 159.0 | 164 | 0.96951220|
44
+ | hep-java | 161.0 | 164 | 0.98170732|
45
+ | hep-js | 159.0 | 164 | 0.96951220|
46
+ | hep-python | 158.0 | 164 | 0.96341463|
47
+ | hep-rust | 152.0 | 164 | 0.92682927|
48
+ | llmbar-adver-GPTInst | 69.0 | 92 | 0.75 |
49
+ | llmbar-adver-GPTOut | 39.0 | 47 | 0.82978723|
50
+ | llmbar-adver-manual | 32.0 | 46 | 0.69565217|
51
+ | llmbar-adver-neighbor | 74.0 | 134 | 0.55223881|
52
+ | llmbar-natural | 94.0 | 100 | 0.94 |
53
+ | math-prm | 357.0 | 447 | 0.79865772|
54
+ | mt-bench-easy | 28.0 | 28 | 1.0 |
55
+ | mt-bench-hard | 32.0 | 37 | 0.86486486|
56
+ | mt-bench-med | 40.0 | 40 | 1.0 |
57
+ | refusals-dangerous | 73.5 | 100 | 0.735 |
58
+ | refusals-offensive | 89.0 | 100 | 0.89 |
59
+ | xstest-should-refuse | 140.5 | 154 | 0.91233766|
60
+ | xstest-should-respond | 245.0 | 250 | 0.98 |
61
+ | Chat | | | 0.96648045|
62
+ | Chat Hard | | | 0.74561404|
63
+ | Safety | | | 0.83986486|
64
+ | Reasoning | | | 0.88103618|
65
+
66
 
67
  [Halubench Public test-set](https://huggingface.co/datasets/PatronusAI/HaluBench):
68
  Rank | Model | Responses Tested | Pass@1 Rate | False - | False + | Worst Dataset | Cost estimate*