Update README.md
Browse filesAdd highlights for halubench
README.md
CHANGED
@@ -35,11 +35,11 @@ Instruction following comared to open-weights judge and reward models:
|
|
35 |
[Halubench Public test-set](https://huggingface.co/datasets/PatronusAI/HaluBench):
|
36 |
Rank | Model | Responses Tested | Pass@1 Rate | False - | False + | Worst Dataset | Cost estimate*
|
37 |
| --- | --- | --- | --- | --- | --- | --- | --- |
|
38 |
-
1 | Root Judge (FP8), decompose, t=0.6 | 14900 | 86.26
|
39 |
2 | gpt-4o-2024-05-13 | 14900 | 86.06% | 1052 | 1025 | DROP | -
|
40 |
3 | o1-preview-2024-09-12, t=1 | 14899 | 85.25% | 1134 | 1063 | RagTruth | $1062.08
|
41 |
-
4 | claude-3-5-sonnet-20240620** t=0.6 | 14797 | 85.17% | 1391 | 809 | PubMedQA | -
|
42 |
-
5 | llama3.1:70b-instruct-q8_0 t=0.6| 13969 | 84.67% | 769 | 1373 | DROP |
|
43 |
6 | o1-mini-2024-09-12, t=1 | 14655 | 83.71% | 1169 | 1219 | DROP | $156.07
|
44 |
7 | llama3.1:405b-instruct-q8_0 t=0.2 | 14881 | 83.58% | 1331 | 1113 | DROP | -
|
45 |
|
|
|
35 |
[Halubench Public test-set](https://huggingface.co/datasets/PatronusAI/HaluBench):
|
36 |
Rank | Model | Responses Tested | Pass@1 Rate | False - | False + | Worst Dataset | Cost estimate*
|
37 |
| --- | --- | --- | --- | --- | --- | --- | --- |
|
38 |
+
1 | Root Judge (FP8), decompose, t=0.6 | 14900 | **86.26%** | **596** | 1340 | Financebench | **$33.6**
|
39 |
2 | gpt-4o-2024-05-13 | 14900 | 86.06% | 1052 | 1025 | DROP | -
|
40 |
3 | o1-preview-2024-09-12, t=1 | 14899 | 85.25% | 1134 | 1063 | RagTruth | $1062.08
|
41 |
+
4 | claude-3-5-sonnet-20240620** t=0.6 | 14797 | 85.17% | 1391 | **809** | PubMedQA | -
|
42 |
+
5 | llama3.1:70b-instruct-q8_0 t=0.6| 13969 | 84.67% | 769 | 1373 | DROP | **$33.6**
|
43 |
6 | o1-mini-2024-09-12, t=1 | 14655 | 83.71% | 1169 | 1219 | DROP | $156.07
|
44 |
7 | llama3.1:405b-instruct-q8_0 t=0.2 | 14881 | 83.58% | 1331 | 1113 | DROP | -
|
45 |
|