aynetdia commited on
Commit
a8d5201
·
1 Parent(s): 3e3849c

semscore implementation and readme

Browse files
Files changed (3) hide show
  1. README.md +43 -19
  2. requirements.txt +4 -1
  3. semscore.py +79 -32
README.md CHANGED
@@ -3,7 +3,7 @@ title: SemScore
3
  tags:
4
  - evaluate
5
  - metric
6
- description: 'TODO: add a description here'
7
  sdk: gradio
8
  sdk_version: 3.19.1
9
  app_file: app.py
@@ -12,37 +12,61 @@ pinned: false
12
 
13
  # Metric Card for SemScore
14
 
15
- ***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
16
-
17
  ## Metric Description
18
- *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
19
 
20
  ## How to Use
21
- *Give general statement of how to use the metric*
22
 
23
- *Provide simplest possible example for using the metric*
 
24
 
25
- ### Inputs
26
- *List all input arguments in the format below*
27
- - **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
28
 
29
- ### Output Values
30
 
31
- *Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
 
32
 
33
- *State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
34
 
35
- #### Values from Popular Papers
36
- *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
 
 
 
 
 
 
 
37
 
38
- ### Examples
39
- *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
 
 
 
 
 
 
 
 
 
40
 
41
  ## Limitations and Bias
42
- *Note any known limitations or biases that the metric has, with links and references if possible.*
43
 
 
44
  ## Citation
45
- *Cite the source where this metric was introduced.*
 
 
 
 
 
 
 
 
46
 
47
  ## Further References
48
- *Add any useful further references.*
 
3
  tags:
4
  - evaluate
5
  - metric
6
+ description: 'SemScore measures semantic textual similarity between candidate and reference texts. It has been shown to strongly correlate with human judgment on a system-level when evaluating the instructing following capabilities of language models. Given a set of model-generated outputs and target completions, a pre-trained sentence-transformer is used to calculate cosine similarities between them.'
7
  sdk: gradio
8
  sdk_version: 3.19.1
9
  app_file: app.py
 
12
 
13
  # Metric Card for SemScore
14
 
 
 
15
  ## Metric Description
16
+ SemScore measures semantic textual similarity between candidate and reference texts. It has been shown to strongly correlate with human judgment on a system-level when evaluating the instructing following capabilities of language models. Given a set of model-generated outputs and target completions, a pre-trained [sentence transformer](https://www.sbert.net) is used to calculate cosine similarities between them.
17
 
18
  ## How to Use
19
+ When loading SemScore, you can choose any pre-trained encoder-only model uploaded to HF Hub in order to compute the score. The default model (if no `model_name` is specified) is `sentence-transformers/all-mpnet-base-v2`.
20
 
21
+ ```python
22
+ import evaluate
23
 
24
+ semscore = evaluate.load("semscore", "model_name")
25
+ ```
 
26
 
27
+ SemScore takes 2 mandatory arguments in order to calculate the final score:
28
 
29
+ - `predictions`: a list of strings with model predictions (e.g. isntruction completions) to score.
30
+ - `references`: a list of strings with "gold" references (e.g. target completions).
31
 
32
+ It also accepts optional arguments:
33
 
34
+ Its optional arguments are:
35
+
36
+ - `batch_size`: the batch size for calculating the score (default value is `32`).
37
+ - `device`: CPU/GPU device on which the score will be calculated (default value is `None`, i.e. `cpu`).
38
+
39
+
40
+ ```python
41
+ predictions = ['This is an example sentence', 'Each sentence is considered']
42
+ references = ['This is an example sentence', 'Each sentence is considered']
43
 
44
+ results = semscore.compute(predictions=predictions, references=references, batch_size=2, device="cuda:0")
45
+ ```
46
+
47
+ ### Output Values
48
+ The output of SemScore is a dictionary with the following values:
49
+
50
+ - `semscore`: aggregated system-level SemScore.
51
+ - `similarities`: cosine similarities between individual prediction-reference pairs.
52
+
53
+ #### Values from Popular Papers
54
+ [SemScore paper](https://arxiv.org/abs/2401.17072) reports correlation of SemScore to human ratings in comparison to other popular metrics relying on "gold" references for predictions, as well as reference-free LLM-based evaluation methods. The comparison is done based on evaluation of instruction-tuned LLMs.
55
 
56
  ## Limitations and Bias
57
+ One limitation of SemScore is its dependence on an underlying transformer model to compute semantic textual similarity between model and target outputs. This implementation relies on the strongest sentence transformer model, as reported by the authors of the `sentence-transformers` library, by default. However, better embedding models have become available since the publication of the SemScore paper (e.g. those listed in the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard)).
58
 
59
+ In addition, a more general limitation is that SemScore requires at least one gold-standard target output against which to compare a generated response. This target output should be human created or at least human-vetted.
60
  ## Citation
61
+ @misc{semscore,
62
+ title={SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity},
63
+ author={Ansar Aynetdinov and Alan Akbik},
64
+ year={2024},
65
+ eprint={2401.17072},
66
+ archivePrefix={arXiv},
67
+ primaryClass={cs.CL},
68
+ url={https://arxiv.org/abs/2401.17072},
69
+ }
70
 
71
  ## Further References
72
+ - [SemScore paper](https://arxiv.org/abs/2401.17072)
requirements.txt CHANGED
@@ -1 +1,4 @@
1
- git+https://github.com/huggingface/evaluate@main
 
 
 
 
1
+ git+https://github.com/huggingface/evaluate@main
2
+ torch
3
+ transformers
4
+ tqdm
semscore.py CHANGED
@@ -11,51 +11,52 @@
11
  # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
  # See the License for the specific language governing permissions and
13
  # limitations under the License.
14
- """TODO: Add a description here."""
15
 
16
  import evaluate
17
  import datasets
 
 
 
18
 
19
-
20
- # TODO: Add BibTeX citation
21
  _CITATION = """\
22
- @InProceedings{huggingface:module,
23
- title = {A great new module},
24
- authors={huggingface, Inc.},
25
- year={2020}
 
 
 
 
26
  }
27
  """
28
 
29
- # TODO: Add description of the module here
30
  _DESCRIPTION = """\
31
- This new module is designed to solve this great ML task and is crafted with a lot of care.
 
32
  """
33
 
34
 
35
- # TODO: Add description of the arguments of the module here
36
  _KWARGS_DESCRIPTION = """
37
  Calculates how good are predictions given some references, using certain scores
38
  Args:
39
- predictions: list of predictions to score. Each predictions
40
- should be a string with tokens separated by spaces.
41
- references: list of reference for each prediction. Each
42
- reference should be a string with tokens separated by spaces.
 
43
  Returns:
44
- accuracy: description of the first score,
45
- another_score: description of the second score,
46
  Examples:
47
- Examples should be written in doctest format, and should illustrate how
48
- to use the function.
49
-
50
- >>> my_new_module = evaluate.load("my_new_module")
51
- >>> results = my_new_module.compute(references=[0, 1], predictions=[0, 1])
52
- >>> print(results)
53
- {'accuracy': 1.0}
54
  """
55
 
56
- # TODO: Define external resources urls if needed
57
- BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
58
-
59
 
60
  @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
61
  class SemScore(evaluate.Metric):
@@ -83,13 +84,59 @@ class SemScore(evaluate.Metric):
83
 
84
  def _download_and_prepare(self, dl_manager):
85
  """Optional: download external resources useful to compute the scores"""
86
- # TODO: Download external resources if needed
87
- pass
 
 
 
 
 
 
88
 
89
- def _compute(self, predictions, references):
 
 
 
 
 
 
 
 
 
 
 
 
90
  """Returns the scores"""
91
- # TODO: Compute the different scores of the module
92
- accuracy = sum(i == j for i, j in zip(predictions, references)) / len(predictions)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
93
  return {
94
- "accuracy": accuracy,
 
95
  }
 
11
  # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
  # See the License for the specific language governing permissions and
13
  # limitations under the License.
14
+ """SemScore metric"""
15
 
16
  import evaluate
17
  import datasets
18
+ import torch
19
+ from transformers import AutoTokenizer, AutoModel
20
+ from tqdm import tqdm
21
 
 
 
22
  _CITATION = """\
23
+ @misc{semscore,
24
+ title={SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity},
25
+ author={Ansar Aynetdinov and Alan Akbik},
26
+ year={2024},
27
+ eprint={2401.17072},
28
+ archivePrefix={arXiv},
29
+ primaryClass={cs.CL},
30
+ url={https://arxiv.org/abs/2401.17072},
31
  }
32
  """
33
 
 
34
  _DESCRIPTION = """\
35
+ SemScore measures semantic textual similarity between candidate and reference texts. It has been shown to
36
+ strongly correlate with human judgment on a system-level when evaluating instruction-tuned models.
37
  """
38
 
39
 
 
40
  _KWARGS_DESCRIPTION = """
41
  Calculates how good are predictions given some references, using certain scores
42
  Args:
43
+ predictions (list of str): list of predictions (instruction completions) to score. Each prediction
44
+ should be a string.
45
+ references (list of str): list of references (target completions). Each reference should be a string.
46
+ batch_size (int): the batch size for predictions.
47
+ device (str): CPU/GPU device.
48
  Returns:
49
+ semscore: aggregated system-level SemScore,
50
+ similarities: cosine similarities between individual prediction-reference pairs,
51
  Examples:
52
+ >>> predictions = ['This is an example sentence', 'Each sentence is considered']
53
+ >>> references = ['This is an example sentence', 'Each sentence is considered']
54
+ >>> semscore = evaluate.load("semscore")
55
+ >>> results = semscore.compute(predictions=predictions, references=references)
56
+ >>> print(results['semscore'])
57
+ 100.0
 
58
  """
59
 
 
 
 
60
 
61
  @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
62
  class SemScore(evaluate.Metric):
 
84
 
85
  def _download_and_prepare(self, dl_manager):
86
  """Optional: download external resources useful to compute the scores"""
87
+ if self.config_name is None:
88
+ checkpoint = "sentence-transformers/all-mpnet-base-v2"
89
+ else:
90
+ checkpoint = self.config_name
91
+ # Load model and tokenizer from HuggingFace Hub
92
+ self.model = AutoModel.from_pretrained(checkpoint)
93
+ self.model.eval()
94
+ self.tokenizer = AutoTokenizer.from_pretrained(checkpoint)
95
 
96
+ def mean_pooling(model_output, attention_mask):
97
+ """Mean pooling over all tokens - take attention mask into account for correct averaging"""
98
+ token_embeddings = model_output[0]
99
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
100
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
101
+
102
+ def _compute(
103
+ self,
104
+ predictions,
105
+ references,
106
+ batch_size=32,
107
+ device=None,
108
+ ):
109
  """Returns the scores"""
110
+
111
+ assert len(predictions) == len(references), "predictions and references should have the same length."
112
+ if device is not None:
113
+ if "cuda" in device:
114
+ assert torch.cuda.is_available()
115
+ self.model.to(device)
116
+ else:
117
+ device = "cpu"
118
+
119
+ pooled_refs, pooled_preds = [], []
120
+
121
+ with torch.inference_mode():
122
+ for i in tqdm(range(0, len(references), batch_size), desc="Processing batches"):
123
+ batch_refs = references[i : i + batch_size]
124
+ batch_preds = predictions[i : i + batch_size]
125
+ encoded_refs = self.tokenizer(batch_refs, padding=True, truncation=True, return_tensors='pt')
126
+ encoded_preds = self.tokenizer(batch_preds, padding=True, truncation=True, return_tensors='pt')
127
+ model_output_refs = self.model(**encoded_refs.to(device))
128
+ model_output_preds = self.model(**encoded_predictions.to(device))
129
+ batch_pooled_refs = mean_pooling(model_output_refs, encoded_refs['attention_mask'])
130
+ batch_pooled_preds = mean_pooling(model_output_preds, encoded_preds['attention_mask'])
131
+ pooled_refs.append(batch_pooled_refs)
132
+ pooled_preds.append(batch_pooled_preds)
133
+ pooled_refs, pooled_preds = torch.cat(pooled_refs), torch.cat(pooled_preds)
134
+
135
+ similarities = torch.nn.functional.F.cosine_similarity(pooled_refs, pooled_preds)
136
+ similarities = similarities * 100
137
+ semscore = torch.mean(similarities)
138
+
139
  return {
140
+ "semscore": round(semscore.item(), 2),
141
+ "similarities": similarities.tolist()
142
  }