File size: 4,320 Bytes
5f5f8c0
 
 
 
 
 
 
 
 
 
 
 
 
f70147f
 
 
 
 
 
 
 
 
 
 
c729368
 
f70147f
c729368
 
 
f70147f
 
 
 
 
 
905bc36
 
 
f70147f
 
905bc36
 
 
 
 
 
 
 
f70147f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5f5f8c0
f70147f
 
 
 
 
5f5f8c0
 
 
 
0ec4ac6
7f27fdb
5f5f8c0
 
 
7f27fdb
5f5f8c0
 
 
0ec4ac6
5f5f8c0
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
license: mit
library_name: peft
tags:
- trl
- sft
- generated_from_trainer
base_model: microsoft/Phi-3-mini-4k-instruct
model-index:
- name: outputs
  results: []
---

## Merged Model Performance

This repository contains our hallucination evaluation PEFT adapter model.

### Hallucination Detection Metrics

Our merged model achieves the following performance on a binary classification task for detecting hallucinations in language model outputs:

```
              precision    recall  f1-score   support

           0       0.85      0.71      0.77       100
           1       0.75      0.87      0.81       100

    accuracy                           0.79       200
   macro avg       0.80      0.79      0.79       200
weighted avg       0.80      0.79      0.79       200
```

### Model Usage
For best results, we recommend starting with the following prompting strategy (and encourage tweaks as you see fit):

```python
def format_input(reference, query, response):
    prompt = f"""Your job is to evaluate whether a machine learning model has hallucinated or not.
    A hallucination occurs when the response is coherent but factually incorrect or nonsensical
    outputs that are not grounded in the provided context.
    You are given the following information:
    ####INFO####
    [Knowledge]: {reference}
    [User Input]: {query}
    [Model Response]: {response}
    ####END INFO####
    Based on the information provided is the model output a hallucination? Respond with only "yes" or "no"
    """
    return input

text = format_input(query='Based on the follwoing
          <context>Walrus are the largest mammal</context>
          answer the question
          <query> What is the best PC?</query>',
          response='The best PC is the mac')

messages = [
    {"role": "user", "content": text}
]

pipe = pipeline(
    "text-generation",
    model=base_model,
    model_kwargs={"attn_implementation": attn_implementation, "torch_dtype": torch.float16},
    tokenizer=tokenizer,
)
generation_args = {
      "max_new_tokens": 2,
      "return_full_text": False,
      "temperature": 0.01,
      "do_sample": True,
  }

output = pipe(messages, **generation_args)
print(f'Hallucination: {output[0]['generated_text'].strip().lower()}')
# Hallucination: yes
```

### Comparison with Other Models

We compared our merged model's performance on the hallucination detection benchmark against several other state-of-the-art language models:

| Model                 | Precision | Recall | F1     |
|---------------------- |----------:|-------:|-------:|
| Our Merged Model      | 0.75      | 0.87   | 0.81   |
| GPT-4                 | 0.93      | 0.72   | 0.82   |
| GPT-4 Turbo           | 0.97      | 0.70   | 0.81   |
| Gemini Pro            | 0.89      | 0.53   | 0.67   |
| GPT-3.5               | 0.89      | 0.65   | 0.75   |
| GPT-3.5-turbo-instruct| 0.89      | 0.80   | 0.84   |
| Palm 2 (Text Bison)   | 1.00      | 0.44   | 0.61   |
| Claude V2             | 0.80      | 0.95   | 0.87   |

As shown in the table, our merged model achieves one of the highest F1 scores of 0.81, outperforming several other state-of-the-art language models on this hallucination detection task.

We will continue to improve and fine-tune our merged model to achieve even better performance across various benchmarks and tasks.

Citations:
 Scores from arize/phoenix

### Training Data

@misc{HaluEval,
  author = {Junyi Li and Xiaoxue Cheng and Wayne Xin Zhao and Jian-Yun Nie and Ji-Rong Wen },
  title = {HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models},
  year = {2023},
  journal={arXiv preprint arXiv:2305.11747},
  url={https://arxiv.org/abs/2305.11747}
}
 
### Framework versions

- PEFT 0.11.1
- Transformers 4.41.2
- Pytorch 2.3.0+cu121
- Datasets 2.19.2
- Tokenizers 0.19.1

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 2
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 8
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 10
- training_steps: 150

### Framework versions

- PEFT 0.11.1
- Transformers 4.41.2
- Pytorch 2.3.0+cu121
- Datasets 2.19.2
- Tokenizers 0.19.1