zli12321 commited on
Commit
1487152
Β·
verified Β·
1 Parent(s): 547babd

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +300 -3
README.md CHANGED
@@ -1,3 +1,300 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ inference: false
3
+ license: mit
4
+ language:
5
+ - en
6
+ metrics:
7
+ - exact_match
8
+ - f1
9
+ - bertscore
10
+ pipeline_tag: text-classification
11
+ ---
12
+ # QA-Evaluation-Metrics πŸ“Š
13
+
14
+ [![PyPI version qa-metrics](https://img.shields.io/pypi/v/qa-metrics.svg)](https://pypi.org/project/qa-metrics/)
15
+ [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Ke23KIeHFdPWad0BModmcWKZ6jSbF5nI?usp=sharing)
16
+
17
+ > A fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models.
18
+
19
+ > `pip install qa-metrics` is all you need!
20
+
21
+ ## πŸŽ‰ Latest Updates
22
+
23
+ - **Version 0.2.19 Released!**
24
+ - Paper accepted to EMNLP 2024 Findings! πŸŽ“
25
+ - Enhanced PEDANTS with multi-pipeline support and improved edge case handling
26
+ - Added support for OpenAI GPT-series and Claude Series models (OpenAI version > 1.0)
27
+ - Integrated support for open-source models (LLaMA-2-70B-chat, LLaVA-1.5, etc.) via [deepinfra](https://deepinfra.com/models)
28
+ - Introduced trained tiny-bert for QA evaluation (18MB model size)
29
+ - Added direct Huggingface model download support for TransformerMatcher
30
+
31
+ ## πŸš€ Quick Start
32
+
33
+ ## Table of Contents
34
+ * 1. [Normalized Exact Match](#em)
35
+ * 2. [Token F1 Score](#f1)
36
+ * 3. [PEDANTS](#pedants)
37
+ * 4. [Finetuned Neural Matching](#neural)
38
+ * 5. [Prompting LLM](#llm)
39
+
40
+ ### Prerequisites
41
+ - Python >= 3.6
42
+ - openai >= 1.0
43
+
44
+ ### Installation
45
+ ```bash
46
+ pip install qa-metrics
47
+ ```
48
+
49
+ ## πŸ’‘ Features
50
+
51
+ Our package offers six QA evaluation methods with varying strengths:
52
+
53
+ | Method | Best For | Cost | Correlation with Human Judgment |
54
+ |--------|----------|------|--------------------------------|
55
+ | Normalized Exact Match | Short-form QA (NQ-OPEN, HotpotQA, etc.) | Free | Good |
56
+ | PEDANTS | Both short & medium-form QA | Free | Very High |
57
+ | [Neural Evaluation](https://huggingface.co/zli12321/answer_equivalence_tiny_bert) | Both short & long-form QA | Free | High |
58
+ | [Open Source LLM Evaluation](https://huggingface.co/zli12321/prometheus2-2B) | All QA types | Free | High |
59
+ | Black-box LLM Evaluation | All QA types | Paid | Highest |
60
+
61
+
62
+
63
+ ## πŸ“– Documentation
64
+
65
+ ### 1. <a name='em'></a>Normalized Exact Match
66
+
67
+ #### Method: `em_match`
68
+ **Parameters**
69
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question
70
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated
71
+
72
+ **Returns**
73
+ - `boolean`: True if there are any exact normalized matches between gold and candidate answers
74
+
75
+ ```python
76
+ from qa_metrics.em import em_match
77
+
78
+ reference_answer = ["The Frog Prince", "The Princess and the Frog"]
79
+ candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
80
+ match_result = em_match(reference_answer, candidate_answer)
81
+ ```
82
+
83
+ ### 2. <a name='f1'></a>F1 Score
84
+
85
+ #### Method: `f1_score_with_precision_recall`
86
+ **Parameters**
87
+ - `reference_answer` (str): A gold (correct) answer to the question
88
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated
89
+
90
+ **Returns**
91
+ - `dictionary`: Contains the F1 score, precision, and recall between a gold and candidate answer
92
+
93
+ #### Method: `f1_match`
94
+ **Parameters**
95
+ - `reference_answer` (list of str): List of gold answers
96
+ - `candidate_answer` (str): Candidate answer to evaluate
97
+ - `threshold` (float): F1 score threshold for considering a match (default: 0.5)
98
+
99
+ **Returns**
100
+ - `boolean`: True if F1 score exceeds threshold for any gold answer
101
+
102
+ ```python
103
+ from qa_metrics.f1 import f1_match, f1_score_with_precision_recall
104
+
105
+ f1_stats = f1_score_with_precision_recall(reference_answer[0], candidate_answer)
106
+ match_result = f1_match(reference_answer, candidate_answer, threshold=0.5)
107
+ ```
108
+
109
+ ### 3. <a name='pedants'></a>PEDANTS
110
+
111
+ #### Method: `get_score`
112
+ **Parameters**
113
+ - `reference_answer` (str): A Gold answer
114
+ - `candidate_answer` (str): Candidate answer to evaluate
115
+ - `question` (str): The question being evaluated
116
+
117
+ **Returns**
118
+ - `float`: The similarity score between two strings (0 to 1)
119
+
120
+ #### Method: `get_highest_score`
121
+ **Parameters**
122
+ - `reference_answer` (list of str): List of gold answers
123
+ - `candidate_answer` (str): Candidate answer to evaluate
124
+ - `question` (str): The question being evaluated
125
+
126
+ **Returns**
127
+ - `dictionary`: Contains the gold answer and candidate answer pair with highest matching score
128
+
129
+ #### Method: `get_scores`
130
+ **Parameters**
131
+ - `reference_answer` (list of str): List of gold answers
132
+ - `candidate_answer` (str): Candidate answer to evaluate
133
+ - `question` (str): The question being evaluated
134
+
135
+ **Returns**
136
+ - `dictionary`: Contains matching scores for all gold answer and candidate answer pairs
137
+
138
+ #### Method: `evaluate`
139
+ **Parameters**
140
+ - `reference_answer` (list of str): List of gold answers
141
+ - `candidate_answer` (str): Candidate answer to evaluate
142
+ - `question` (str): The question being evaluated
143
+
144
+ **Returns**
145
+ - `boolean`: True if candidate answer matches any gold answer
146
+
147
+ #### Method: `get_question_type`
148
+ **Parameters**
149
+ - `reference_answer` (list of str): List of gold answers
150
+ - `question` (str): The question being evaluated
151
+
152
+ **Returns**
153
+ - `list`: The type of the question (what, who, when, how, why, which, where)
154
+
155
+ #### Method: `get_judgement_type`
156
+ **Parameters**
157
+ - `reference_answer` (list of str): List of gold answers
158
+ - `candidate_answer` (str): Candidate answer to evaluate
159
+ - `question` (str): The question being evaluated
160
+
161
+ **Returns**
162
+ - `list`: A list revised rules applicable to judge answer correctness
163
+
164
+ ```python
165
+ from qa_metrics.pedant import PEDANT
166
+
167
+ pedant = PEDANT()
168
+ scores = pedant.get_scores(reference_answer, candidate_answer, question)
169
+ match_result = pedant.evaluate(reference_answer, candidate_answer, question)
170
+ ```
171
+
172
+ ### 4. <a name='neural'></a>Transformer Neural Evaluation
173
+
174
+ #### Method: `get_score`
175
+ **Parameters**
176
+ - `reference_answer` (str): A Gold answer
177
+ - `candidate_answer` (str): Candidate answer to evaluate
178
+ - `question` (str): The question being evaluated
179
+
180
+ **Returns**
181
+ - `float`: The similarity score between two strings (0 to 1)
182
+
183
+ #### Method: `get_highest_score`
184
+ **Parameters**
185
+ - `reference_answer` (list of str): List of gold answers
186
+ - `candidate_answer` (str): Candidate answer to evaluate
187
+ - `question` (str): The question being evaluated
188
+
189
+ **Returns**
190
+ - `dictionary`: Contains the gold answer and candidate answer pair with highest matching score
191
+
192
+ #### Method: `get_scores`
193
+ **Parameters**
194
+ - `reference_answer` (list of str): List of gold answers
195
+ - `candidate_answer` (str): Candidate answer to evaluate
196
+ - `question` (str): The question being evaluated
197
+
198
+ **Returns**
199
+ - `dictionary`: Contains matching scores for all gold answer and candidate answer pairs
200
+
201
+ #### Method: `transformer_match`
202
+ **Parameters**
203
+ - `reference_answer` (list of str): List of gold answers
204
+ - `candidate_answer` (str): Candidate answer to evaluate
205
+ - `question` (str): The question being evaluated
206
+
207
+ **Returns**
208
+ - `boolean`: True if transformer model considers candidate answer equivalent to any gold answer
209
+
210
+ ```python
211
+ from qa_metrics.transformerMatcher import TransformerMatcher
212
+
213
+ ### supports `zli12321/answer_equivalence_bert`, `zli12321/answer_equivalence_distilbert`, `zli12321/answer_equivalence_roberta`, `zli12321/answer_equivalence_distilroberta`
214
+ tm = TransformerMatcher("zli12321/answer_equivalence_tiny_bert")
215
+ match_result = tm.transformer_match(reference_answer, candidate_answer, question)
216
+ ```
217
+
218
+ ### 5. <a name='llm'></a>LLM Integration
219
+
220
+ #### Method: `prompt_gpt`
221
+ **Parameters**
222
+ - `prompt` (str): The input prompt text
223
+ - `model_engine` (str): OpenAI model to use (e.g., 'gpt-3.5-turbo')
224
+ - `temperature` (float): Controls randomness (0-1)
225
+ - `max_tokens` (int): Maximum tokens in response
226
+
227
+ ```python
228
+ from qa_metrics.prompt_llm import CloseLLM
229
+
230
+ model = CloseLLM()
231
+ model.set_openai_api_key(YOUR_OPENAI_KEY)
232
+ result = model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo')
233
+ ```
234
+
235
+ #### Method: `prompt_claude`
236
+ **Parameters**
237
+ - `prompt` (str): The input prompt text
238
+ - `model_engine` (str): Claude model to use
239
+ - `anthropic_version` (str): API version
240
+ - `max_tokens_to_sample` (int): Maximum tokens in response
241
+ - `temperature` (float): Controls randomness (0-1)
242
+
243
+ ```python
244
+ model = CloseLLM()
245
+ model.set_anthropic_api_key(YOUR_ANTHROPIC_KEY)
246
+ result = model.prompt_claude(prompt=prompt, model_engine='claude-v1')
247
+ ```
248
+
249
+ #### Method: `prompt`
250
+ **Parameters**
251
+ - `message` (str): The input message text
252
+ - `model_engine` (str): Model to use
253
+ - `temperature` (float): Controls randomness (0-1)
254
+ - `max_tokens` (int): Maximum tokens in response
255
+
256
+ ```python
257
+ from qa_metrics.prompt_open_llm import OpenLLM
258
+
259
+ model = OpenLLM()
260
+ model.set_deepinfra_key(YOUR_DEEPINFRA_KEY)
261
+ result = model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1')
262
+ ```
263
+
264
+ ## πŸ€— Model Hub
265
+
266
+ Our fine-tuned models are available on Huggingface:
267
+ - [BERT](https://huggingface.co/Zongxia/answer_equivalence_bert)
268
+ - [DistilRoBERTa](https://huggingface.co/Zongxia/answer_equivalence_distilroberta)
269
+ - [DistilBERT](https://huggingface.co/Zongxia/answer_equivalence_distilbert)
270
+ - [RoBERTa](https://huggingface.co/Zongxia/answer_equivalence_roberta)
271
+ - [Tiny-BERT](https://huggingface.co/Zongxia/answer_equivalence_tiny_bert)
272
+ - [RoBERTa-Large](https://huggingface.co/Zongxia/answer_equivalence_roberta-large)
273
+
274
+ ## πŸ“š Resources
275
+
276
+ - [Full Paper](https://arxiv.org/abs/2402.11161)
277
+ - [Dataset Repository](https://github.com/zli12321/Answer_Equivalence_Dataset.git)
278
+ - [Supported Models on Deepinfra](https://deepinfra.com/models)
279
+
280
+ ## πŸ“„ Citation
281
+
282
+ ```bibtex
283
+ @misc{li2024pedantspreciseevaluationsdiverse,
284
+ title={PEDANTS: Cheap but Effective and Interpretable Answer Equivalence},
285
+ author={Zongxia Li and Ishani Mondal and Yijun Liang and Huy Nghiem and Jordan Lee Boyd-Graber},
286
+ year={2024},
287
+ eprint={2402.11161},
288
+ archivePrefix={arXiv},
289
+ primaryClass={cs.CL},
290
+ url={https://arxiv.org/abs/2402.11161},
291
+ }
292
+ ```
293
+
294
+ ## πŸ“ License
295
+
296
+ This project is licensed under the [MIT License](LICENSE.md).
297
+
298
+ ## πŸ“¬ Contact
299
+
300
+ For questions or comments, please contact: [email protected]