root-signals
/

RootSignals-Judge-Llama-70B

Text Generation

compressed-tensors

Model card Files Files and versions Community

TensorTemplar commited on Feb 11

Commit

784b80d

·

verified ·

1 Parent(s): f571a1d

Add rewardbench scores

Files changed (1) hide show

README.md +32 -0

README.md CHANGED Viewed

@@ -31,6 +31,38 @@ Instruction following comared to open-weights judge and reward models:
 | Flow-Judge | bf16 (~16)* | 78.70% ± 1.13% | 64.63% ± N/A | **60.80% ± 3.09%** | 23.44% ± 2.65% | 35.60% ± 3.03% | 52.63% | 81.5% |
 | Glider | bf16 (~8) | 78.70% ± 1.13% | 56.47% ± N/A | 59.20% ± 3.11% | **35.94% ± 3.00%** | 43.20% ± 3.14% | 54.70% | 84.8% |
 [Halubench Public test-set](https://huggingface.co/datasets/PatronusAI/HaluBench):
 Rank | Model | Responses Tested | Pass@1 Rate | False - | False + | Worst Dataset | Cost estimate*

 | Flow-Judge | bf16 (~16)* | 78.70% ± 1.13% | 64.63% ± N/A | **60.80% ± 3.09%** | 23.44% ± 2.65% | 35.60% ± 3.03% | 52.63% | 81.5% |
 | Glider | bf16 (~8) | 78.70% ± 1.13% | 56.47% ± N/A | 59.20% ± 3.11% | **35.94% ± 3.00%** | 43.20% ± 3.14% | 54.70% | 84.8% |
+[RewardBench Generative - Unverified](https://huggingface.co/spaces/allenai/reward-bench)
+| Test Name              | Score | Total | Accuracy  |
+|------------------------|-------|-------|-----------|
+| alpacaeval-easy        | 99.0  | 100   | 0.99      |
+| alpacaeval-hard        | 93.0  | 95    | 0.97894737|
+| alpacaeval-length      | 86.0  | 95    | 0.90526316|
+| donotanswer            | 73.5  | 136   | 0.54044118|
+| hep-cpp                | 159.0 | 164   | 0.96951220|
+| hep-go                 | 159.0 | 164   | 0.96951220|
+| hep-java               | 161.0 | 164   | 0.98170732|
+| hep-js                 | 159.0 | 164   | 0.96951220|
+| hep-python             | 158.0 | 164   | 0.96341463|
+| hep-rust               | 152.0 | 164   | 0.92682927|
+| llmbar-adver-GPTInst   | 69.0  | 92    | 0.75      |
+| llmbar-adver-GPTOut    | 39.0  | 47    | 0.82978723|
+| llmbar-adver-manual    | 32.0  | 46    | 0.69565217|
+| llmbar-adver-neighbor  | 74.0  | 134   | 0.55223881|
+| llmbar-natural         | 94.0  | 100   | 0.94      |
+| math-prm               | 357.0 | 447   | 0.79865772|
+| mt-bench-easy          | 28.0  | 28    | 1.0       |
+| mt-bench-hard          | 32.0  | 37    | 0.86486486|
+| mt-bench-med           | 40.0  | 40    | 1.0       |
+| refusals-dangerous     | 73.5  | 100   | 0.735     |
+| refusals-offensive     | 89.0  | 100   | 0.89      |
+| xstest-should-refuse   | 140.5 | 154   | 0.91233766|
+| xstest-should-respond  | 245.0 | 250   | 0.98      |
+| Chat                   |       |       | 0.96648045|
+| Chat Hard              |       |       | 0.74561404|
+| Safety                 |       |       | 0.83986486|
+| Reasoning              |       |       | 0.88103618|
 [Halubench Public test-set](https://huggingface.co/datasets/PatronusAI/HaluBench):
 Rank | Model | Responses Tested | Pass@1 Rate | False - | False + | Worst Dataset | Cost estimate*