TensorTemplar commited on
Commit
f571a1d
·
verified ·
1 Parent(s): 6b6b61a

Update README.md

Browse files

Add highlights for halubench

Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -35,11 +35,11 @@ Instruction following comared to open-weights judge and reward models:
35
  [Halubench Public test-set](https://huggingface.co/datasets/PatronusAI/HaluBench):
36
  Rank | Model | Responses Tested | Pass@1 Rate | False - | False + | Worst Dataset | Cost estimate*
37
  | --- | --- | --- | --- | --- | --- | --- | --- |
38
- 1 | Root Judge (FP8), decompose, t=0.6 | 14900 | 86.26% | 596 | 1340 | Financebench | ±$33.6
39
  2 | gpt-4o-2024-05-13 | 14900 | 86.06% | 1052 | 1025 | DROP | -
40
  3 | o1-preview-2024-09-12, t=1 | 14899 | 85.25% | 1134 | 1063 | RagTruth | $1062.08
41
- 4 | claude-3-5-sonnet-20240620** t=0.6 | 14797 | 85.17% | 1391 | 809 | PubMedQA | -
42
- 5 | llama3.1:70b-instruct-q8_0 t=0.6| 13969 | 84.67% | 769 | 1373 | DROP | ±$33.6
43
  6 | o1-mini-2024-09-12, t=1 | 14655 | 83.71% | 1169 | 1219 | DROP | $156.07
44
  7 | llama3.1:405b-instruct-q8_0 t=0.2 | 14881 | 83.58% | 1331 | 1113 | DROP | -
45
 
 
35
  [Halubench Public test-set](https://huggingface.co/datasets/PatronusAI/HaluBench):
36
  Rank | Model | Responses Tested | Pass@1 Rate | False - | False + | Worst Dataset | Cost estimate*
37
  | --- | --- | --- | --- | --- | --- | --- | --- |
38
+ 1 | Root Judge (FP8), decompose, t=0.6 | 14900 | **86.26%** | **596** | 1340 | Financebench | **$33.6**
39
  2 | gpt-4o-2024-05-13 | 14900 | 86.06% | 1052 | 1025 | DROP | -
40
  3 | o1-preview-2024-09-12, t=1 | 14899 | 85.25% | 1134 | 1063 | RagTruth | $1062.08
41
+ 4 | claude-3-5-sonnet-20240620** t=0.6 | 14797 | 85.17% | 1391 | **809** | PubMedQA | -
42
+ 5 | llama3.1:70b-instruct-q8_0 t=0.6| 13969 | 84.67% | 769 | 1373 | DROP | **$33.6**
43
  6 | o1-mini-2024-09-12, t=1 | 14655 | 83.71% | 1169 | 1219 | DROP | $156.07
44
  7 | llama3.1:405b-instruct-q8_0 t=0.2 | 14881 | 83.58% | 1331 | 1113 | DROP | -
45