semscore implementation and readme
Browse files- README.md +43 -19
- requirements.txt +4 -1
- semscore.py +79 -32
README.md
CHANGED
@@ -3,7 +3,7 @@ title: SemScore
|
|
3 |
tags:
|
4 |
- evaluate
|
5 |
- metric
|
6 |
-
description: '
|
7 |
sdk: gradio
|
8 |
sdk_version: 3.19.1
|
9 |
app_file: app.py
|
@@ -12,37 +12,61 @@ pinned: false
|
|
12 |
|
13 |
# Metric Card for SemScore
|
14 |
|
15 |
-
***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
|
16 |
-
|
17 |
## Metric Description
|
18 |
-
|
19 |
|
20 |
## How to Use
|
21 |
-
|
22 |
|
23 |
-
|
|
|
24 |
|
25 |
-
|
26 |
-
|
27 |
-
- **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
|
28 |
|
29 |
-
|
30 |
|
31 |
-
|
|
|
32 |
|
33 |
-
|
34 |
|
35 |
-
|
36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
37 |
|
38 |
-
|
39 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
40 |
|
41 |
## Limitations and Bias
|
42 |
-
|
43 |
|
|
|
44 |
## Citation
|
45 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
46 |
|
47 |
## Further References
|
48 |
-
|
|
|
3 |
tags:
|
4 |
- evaluate
|
5 |
- metric
|
6 |
+
description: 'SemScore measures semantic textual similarity between candidate and reference texts. It has been shown to strongly correlate with human judgment on a system-level when evaluating the instructing following capabilities of language models. Given a set of model-generated outputs and target completions, a pre-trained sentence-transformer is used to calculate cosine similarities between them.'
|
7 |
sdk: gradio
|
8 |
sdk_version: 3.19.1
|
9 |
app_file: app.py
|
|
|
12 |
|
13 |
# Metric Card for SemScore
|
14 |
|
|
|
|
|
15 |
## Metric Description
|
16 |
+
SemScore measures semantic textual similarity between candidate and reference texts. It has been shown to strongly correlate with human judgment on a system-level when evaluating the instructing following capabilities of language models. Given a set of model-generated outputs and target completions, a pre-trained [sentence transformer](https://www.sbert.net) is used to calculate cosine similarities between them.
|
17 |
|
18 |
## How to Use
|
19 |
+
When loading SemScore, you can choose any pre-trained encoder-only model uploaded to HF Hub in order to compute the score. The default model (if no `model_name` is specified) is `sentence-transformers/all-mpnet-base-v2`.
|
20 |
|
21 |
+
```python
|
22 |
+
import evaluate
|
23 |
|
24 |
+
semscore = evaluate.load("semscore", "model_name")
|
25 |
+
```
|
|
|
26 |
|
27 |
+
SemScore takes 2 mandatory arguments in order to calculate the final score:
|
28 |
|
29 |
+
- `predictions`: a list of strings with model predictions (e.g. isntruction completions) to score.
|
30 |
+
- `references`: a list of strings with "gold" references (e.g. target completions).
|
31 |
|
32 |
+
It also accepts optional arguments:
|
33 |
|
34 |
+
Its optional arguments are:
|
35 |
+
|
36 |
+
- `batch_size`: the batch size for calculating the score (default value is `32`).
|
37 |
+
- `device`: CPU/GPU device on which the score will be calculated (default value is `None`, i.e. `cpu`).
|
38 |
+
|
39 |
+
|
40 |
+
```python
|
41 |
+
predictions = ['This is an example sentence', 'Each sentence is considered']
|
42 |
+
references = ['This is an example sentence', 'Each sentence is considered']
|
43 |
|
44 |
+
results = semscore.compute(predictions=predictions, references=references, batch_size=2, device="cuda:0")
|
45 |
+
```
|
46 |
+
|
47 |
+
### Output Values
|
48 |
+
The output of SemScore is a dictionary with the following values:
|
49 |
+
|
50 |
+
- `semscore`: aggregated system-level SemScore.
|
51 |
+
- `similarities`: cosine similarities between individual prediction-reference pairs.
|
52 |
+
|
53 |
+
#### Values from Popular Papers
|
54 |
+
[SemScore paper](https://arxiv.org/abs/2401.17072) reports correlation of SemScore to human ratings in comparison to other popular metrics relying on "gold" references for predictions, as well as reference-free LLM-based evaluation methods. The comparison is done based on evaluation of instruction-tuned LLMs.
|
55 |
|
56 |
## Limitations and Bias
|
57 |
+
One limitation of SemScore is its dependence on an underlying transformer model to compute semantic textual similarity between model and target outputs. This implementation relies on the strongest sentence transformer model, as reported by the authors of the `sentence-transformers` library, by default. However, better embedding models have become available since the publication of the SemScore paper (e.g. those listed in the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard)).
|
58 |
|
59 |
+
In addition, a more general limitation is that SemScore requires at least one gold-standard target output against which to compare a generated response. This target output should be human created or at least human-vetted.
|
60 |
## Citation
|
61 |
+
@misc{semscore,
|
62 |
+
title={SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity},
|
63 |
+
author={Ansar Aynetdinov and Alan Akbik},
|
64 |
+
year={2024},
|
65 |
+
eprint={2401.17072},
|
66 |
+
archivePrefix={arXiv},
|
67 |
+
primaryClass={cs.CL},
|
68 |
+
url={https://arxiv.org/abs/2401.17072},
|
69 |
+
}
|
70 |
|
71 |
## Further References
|
72 |
+
- [SemScore paper](https://arxiv.org/abs/2401.17072)
|
requirements.txt
CHANGED
@@ -1 +1,4 @@
|
|
1 |
-
git+https://github.com/huggingface/evaluate@main
|
|
|
|
|
|
|
|
1 |
+
git+https://github.com/huggingface/evaluate@main
|
2 |
+
torch
|
3 |
+
transformers
|
4 |
+
tqdm
|
semscore.py
CHANGED
@@ -11,51 +11,52 @@
|
|
11 |
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
12 |
# See the License for the specific language governing permissions and
|
13 |
# limitations under the License.
|
14 |
-
"""
|
15 |
|
16 |
import evaluate
|
17 |
import datasets
|
|
|
|
|
|
|
18 |
|
19 |
-
|
20 |
-
# TODO: Add BibTeX citation
|
21 |
_CITATION = """\
|
22 |
-
@
|
23 |
-
title
|
24 |
-
|
25 |
-
year={
|
|
|
|
|
|
|
|
|
26 |
}
|
27 |
"""
|
28 |
|
29 |
-
# TODO: Add description of the module here
|
30 |
_DESCRIPTION = """\
|
31 |
-
|
|
|
32 |
"""
|
33 |
|
34 |
|
35 |
-
# TODO: Add description of the arguments of the module here
|
36 |
_KWARGS_DESCRIPTION = """
|
37 |
Calculates how good are predictions given some references, using certain scores
|
38 |
Args:
|
39 |
-
predictions: list of predictions to score. Each
|
40 |
-
should be a string
|
41 |
-
references: list of reference
|
42 |
-
|
|
|
43 |
Returns:
|
44 |
-
|
45 |
-
|
46 |
Examples:
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
>>>
|
51 |
-
>>> results
|
52 |
-
|
53 |
-
{'accuracy': 1.0}
|
54 |
"""
|
55 |
|
56 |
-
# TODO: Define external resources urls if needed
|
57 |
-
BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
|
58 |
-
|
59 |
|
60 |
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
|
61 |
class SemScore(evaluate.Metric):
|
@@ -83,13 +84,59 @@ class SemScore(evaluate.Metric):
|
|
83 |
|
84 |
def _download_and_prepare(self, dl_manager):
|
85 |
"""Optional: download external resources useful to compute the scores"""
|
86 |
-
|
87 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
88 |
|
89 |
-
def
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
90 |
"""Returns the scores"""
|
91 |
-
|
92 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
93 |
return {
|
94 |
-
"
|
|
|
95 |
}
|
|
|
11 |
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
12 |
# See the License for the specific language governing permissions and
|
13 |
# limitations under the License.
|
14 |
+
"""SemScore metric"""
|
15 |
|
16 |
import evaluate
|
17 |
import datasets
|
18 |
+
import torch
|
19 |
+
from transformers import AutoTokenizer, AutoModel
|
20 |
+
from tqdm import tqdm
|
21 |
|
|
|
|
|
22 |
_CITATION = """\
|
23 |
+
@misc{semscore,
|
24 |
+
title={SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity},
|
25 |
+
author={Ansar Aynetdinov and Alan Akbik},
|
26 |
+
year={2024},
|
27 |
+
eprint={2401.17072},
|
28 |
+
archivePrefix={arXiv},
|
29 |
+
primaryClass={cs.CL},
|
30 |
+
url={https://arxiv.org/abs/2401.17072},
|
31 |
}
|
32 |
"""
|
33 |
|
|
|
34 |
_DESCRIPTION = """\
|
35 |
+
SemScore measures semantic textual similarity between candidate and reference texts. It has been shown to
|
36 |
+
strongly correlate with human judgment on a system-level when evaluating instruction-tuned models.
|
37 |
"""
|
38 |
|
39 |
|
|
|
40 |
_KWARGS_DESCRIPTION = """
|
41 |
Calculates how good are predictions given some references, using certain scores
|
42 |
Args:
|
43 |
+
predictions (list of str): list of predictions (instruction completions) to score. Each prediction
|
44 |
+
should be a string.
|
45 |
+
references (list of str): list of references (target completions). Each reference should be a string.
|
46 |
+
batch_size (int): the batch size for predictions.
|
47 |
+
device (str): CPU/GPU device.
|
48 |
Returns:
|
49 |
+
semscore: aggregated system-level SemScore,
|
50 |
+
similarities: cosine similarities between individual prediction-reference pairs,
|
51 |
Examples:
|
52 |
+
>>> predictions = ['This is an example sentence', 'Each sentence is considered']
|
53 |
+
>>> references = ['This is an example sentence', 'Each sentence is considered']
|
54 |
+
>>> semscore = evaluate.load("semscore")
|
55 |
+
>>> results = semscore.compute(predictions=predictions, references=references)
|
56 |
+
>>> print(results['semscore'])
|
57 |
+
100.0
|
|
|
58 |
"""
|
59 |
|
|
|
|
|
|
|
60 |
|
61 |
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
|
62 |
class SemScore(evaluate.Metric):
|
|
|
84 |
|
85 |
def _download_and_prepare(self, dl_manager):
|
86 |
"""Optional: download external resources useful to compute the scores"""
|
87 |
+
if self.config_name is None:
|
88 |
+
checkpoint = "sentence-transformers/all-mpnet-base-v2"
|
89 |
+
else:
|
90 |
+
checkpoint = self.config_name
|
91 |
+
# Load model and tokenizer from HuggingFace Hub
|
92 |
+
self.model = AutoModel.from_pretrained(checkpoint)
|
93 |
+
self.model.eval()
|
94 |
+
self.tokenizer = AutoTokenizer.from_pretrained(checkpoint)
|
95 |
|
96 |
+
def mean_pooling(model_output, attention_mask):
|
97 |
+
"""Mean pooling over all tokens - take attention mask into account for correct averaging"""
|
98 |
+
token_embeddings = model_output[0]
|
99 |
+
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
|
100 |
+
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
|
101 |
+
|
102 |
+
def _compute(
|
103 |
+
self,
|
104 |
+
predictions,
|
105 |
+
references,
|
106 |
+
batch_size=32,
|
107 |
+
device=None,
|
108 |
+
):
|
109 |
"""Returns the scores"""
|
110 |
+
|
111 |
+
assert len(predictions) == len(references), "predictions and references should have the same length."
|
112 |
+
if device is not None:
|
113 |
+
if "cuda" in device:
|
114 |
+
assert torch.cuda.is_available()
|
115 |
+
self.model.to(device)
|
116 |
+
else:
|
117 |
+
device = "cpu"
|
118 |
+
|
119 |
+
pooled_refs, pooled_preds = [], []
|
120 |
+
|
121 |
+
with torch.inference_mode():
|
122 |
+
for i in tqdm(range(0, len(references), batch_size), desc="Processing batches"):
|
123 |
+
batch_refs = references[i : i + batch_size]
|
124 |
+
batch_preds = predictions[i : i + batch_size]
|
125 |
+
encoded_refs = self.tokenizer(batch_refs, padding=True, truncation=True, return_tensors='pt')
|
126 |
+
encoded_preds = self.tokenizer(batch_preds, padding=True, truncation=True, return_tensors='pt')
|
127 |
+
model_output_refs = self.model(**encoded_refs.to(device))
|
128 |
+
model_output_preds = self.model(**encoded_predictions.to(device))
|
129 |
+
batch_pooled_refs = mean_pooling(model_output_refs, encoded_refs['attention_mask'])
|
130 |
+
batch_pooled_preds = mean_pooling(model_output_preds, encoded_preds['attention_mask'])
|
131 |
+
pooled_refs.append(batch_pooled_refs)
|
132 |
+
pooled_preds.append(batch_pooled_preds)
|
133 |
+
pooled_refs, pooled_preds = torch.cat(pooled_refs), torch.cat(pooled_preds)
|
134 |
+
|
135 |
+
similarities = torch.nn.functional.F.cosine_similarity(pooled_refs, pooled_preds)
|
136 |
+
similarities = similarities * 100
|
137 |
+
semscore = torch.mean(similarities)
|
138 |
+
|
139 |
return {
|
140 |
+
"semscore": round(semscore.item(), 2),
|
141 |
+
"similarities": similarities.tolist()
|
142 |
}
|