root-signals
/

RootSignals-Judge-Llama-70B

Text Generation

compressed-tensors

Model card Files Files and versions Community

TensorTemplar commited on Feb 7

Commit

f571a1d

·

verified ·

1 Parent(s): 6b6b61a

Update README.md

Add highlights for halubench

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -35,11 +35,11 @@ Instruction following comared to open-weights judge and reward models:
 [Halubench Public test-set](https://huggingface.co/datasets/PatronusAI/HaluBench):
 Rank | Model | Responses Tested | Pass@1 Rate | False - | False + | Worst Dataset | Cost estimate*
 | --- | --- | --- | --- | --- | --- | --- | --- |
-1 | Root Judge (FP8), decompose, t=0.6 | 14900 | 86.26% | 596 | 1340  | Financebench | ±$33.6
 2 | gpt-4o-2024-05-13 | 14900 | 86.06% | 1052 | 1025 | DROP | -
 3 | o1-preview-2024-09-12, t=1 | 14899 | 85.25% | 1134 | 1063 | RagTruth | $1062.08
-4 | claude-3-5-sonnet-20240620** t=0.6 | 14797 |  85.17% | 1391 | 809 | PubMedQA | -
-5 | llama3.1:70b-instruct-q8_0 t=0.6| 13969 | 84.67%  | 769 | 1373 | DROP | ±$33.6
 6 | o1-mini-2024-09-12, t=1 | 14655 | 83.71% | 1169 | 1219 | DROP | $156.07
 7 | llama3.1:405b-instruct-q8_0 t=0.2 | 14881 | 83.58% | 1331 | 1113 | DROP | -

 [Halubench Public test-set](https://huggingface.co/datasets/PatronusAI/HaluBench):
 Rank | Model | Responses Tested | Pass@1 Rate | False - | False + | Worst Dataset | Cost estimate*
 | --- | --- | --- | --- | --- | --- | --- | --- |
+1 | Root Judge (FP8), decompose, t=0.6 | 14900 | **86.26%** | **596** | 1340  | Financebench | **$33.6**
 2 | gpt-4o-2024-05-13 | 14900 | 86.06% | 1052 | 1025 | DROP | -
 3 | o1-preview-2024-09-12, t=1 | 14899 | 85.25% | 1134 | 1063 | RagTruth | $1062.08
+4 | claude-3-5-sonnet-20240620** t=0.6 | 14797 |  85.17% | 1391 | **809** | PubMedQA | -
+5 | llama3.1:70b-instruct-q8_0 t=0.6| 13969 | 84.67%  | 769 | 1373 | DROP | **$33.6**
 6 | o1-mini-2024-09-12, t=1 | 14655 | 83.71% | 1169 | 1219 | DROP | $156.07
 7 | llama3.1:405b-instruct-q8_0 t=0.2 | 14881 | 83.58% | 1331 | 1113 | DROP | -