Spaces:

aynetdia
/

semscore

Running

App Files Files Community

aynetdia commited on 3 days ago

Commit

a8d5201

1 Parent(s): 3e3849c

semscore implementation and readme

Browse files

Files changed (3) hide show

README.md +43 -19
requirements.txt +4 -1
semscore.py +79 -32

README.md CHANGED Viewed

@@ -3,7 +3,7 @@ title: SemScore
 tags:
 - evaluate
 - metric
-description: 'TODO: add a description here'
 sdk: gradio
 sdk_version: 3.19.1
 app_file: app.py
@@ -12,37 +12,61 @@ pinned: false
 # Metric Card for SemScore
-***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
 ## Metric Description
-*Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 ## How to Use
-*Give general statement of how to use the metric*
-*Provide simplest possible example for using the metric*
-### Inputs
-*List all input arguments in the format below*
-- **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
-### Output Values
-*Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
-*State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
-#### Values from Popular Papers
-*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
-### Examples
-*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
 ## Limitations and Bias
-*Note any known limitations or biases that the metric has, with links and references if possible.*
 ## Citation
-*Cite the source where this metric was introduced.*
 ## Further References
-*Add any useful further references.*

 tags:
 - evaluate
 - metric
+description: 'SemScore measures semantic textual similarity between candidate and reference texts. It has been shown to strongly correlate with human judgment on a system-level when evaluating the instructing following capabilities of language models. Given a set of model-generated outputs and target completions, a pre-trained sentence-transformer is used to calculate cosine similarities between them.'
 sdk: gradio
 sdk_version: 3.19.1
 app_file: app.py
 # Metric Card for SemScore
 ## Metric Description
+SemScore measures semantic textual similarity between candidate and reference texts. It has been shown to strongly correlate with human judgment on a system-level when evaluating the instructing following capabilities of language models. Given a set of model-generated outputs and target completions, a pre-trained [sentence transformer](https://www.sbert.net) is used to calculate cosine similarities between them.
 ## How to Use
+When loading SemScore, you can choose any pre-trained encoder-only model uploaded to HF Hub in order to compute the score. The default model (if no `model_name` is specified) is `sentence-transformers/all-mpnet-base-v2`.
+```python
+import evaluate
+semscore = evaluate.load("semscore", "model_name")
+```
+SemScore takes 2 mandatory arguments in order to calculate the final score:
+- `predictions`: a list of strings with model predictions (e.g. isntruction completions) to score.
+- `references`: a list of strings with "gold" references (e.g. target completions).
+It also accepts optional arguments:
+Its optional arguments are:
+- `batch_size`: the batch size for calculating the score (default value is `32`).
+- `device`: CPU/GPU device on which the score will be calculated (default value is `None`, i.e. `cpu`).
+```python
+predictions = ['This is an example sentence', 'Each sentence is considered']
+references = ['This is an example sentence', 'Each sentence is considered']
+results = semscore.compute(predictions=predictions, references=references, batch_size=2, device="cuda:0")
+```
+### Output Values
+The output of SemScore is a dictionary with the following values:
+- `semscore`: aggregated system-level SemScore.
+- `similarities`: cosine similarities between individual prediction-reference pairs.
+#### Values from Popular Papers
+[SemScore paper](https://arxiv.org/abs/2401.17072) reports correlation of SemScore to human ratings in comparison to other popular metrics relying on "gold" references for predictions, as well as reference-free LLM-based evaluation methods. The comparison is done based on evaluation of instruction-tuned LLMs.
 ## Limitations and Bias
+One limitation of SemScore is its dependence on an underlying transformer model to compute semantic textual similarity between model and target outputs. This implementation relies on the strongest sentence transformer model, as reported by the authors of the `sentence-transformers` library, by default. However, better embedding models have become available since the publication of the SemScore paper (e.g. those listed in the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard)).
+In addition, a more general limitation is that SemScore requires at least one gold-standard target output against which to compare a generated response. This target output should be human created or at least human-vetted.
 ## Citation
+@misc{semscore,
+    title={SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity},
+    author={Ansar Aynetdinov and Alan Akbik},
+    year={2024},
+    eprint={2401.17072},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL},
+    url={https://arxiv.org/abs/2401.17072},
+}
 ## Further References
+- [SemScore paper](https://arxiv.org/abs/2401.17072)

requirements.txt CHANGED Viewed

	@@ -1 +1,4 @@
1	- git+https://github.com/huggingface/evaluate@main

+git+https://github.com/huggingface/evaluate@main
+torch
+transformers
+tqdm

semscore.py CHANGED Viewed

@@ -11,51 +11,52 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-"""TODO: Add a description here."""
 import evaluate
 import datasets
-# TODO: Add BibTeX citation
 _CITATION = """\
-@InProceedings{huggingface:module,
-title = {A great new module},
-authors={huggingface, Inc.},
-year={2020}
 }
 """
-# TODO: Add description of the module here
 _DESCRIPTION = """\
-This new module is designed to solve this great ML task and is crafted with a lot of care.
 """
-# TODO: Add description of the arguments of the module here
 _KWARGS_DESCRIPTION = """
 Calculates how good are predictions given some references, using certain scores
 Args:
-    predictions: list of predictions to score. Each predictions
-        should be a string with tokens separated by spaces.
-    references: list of reference for each prediction. Each
-        reference should be a string with tokens separated by spaces.
 Returns:
-    accuracy: description of the first score,
-    another_score: description of the second score,
 Examples:
-    Examples should be written in doctest format, and should illustrate how
-    to use the function.
-    >>> my_new_module = evaluate.load("my_new_module")
-    >>> results = my_new_module.compute(references=[0, 1], predictions=[0, 1])
-    >>> print(results)
-    {'accuracy': 1.0}
 """
-# TODO: Define external resources urls if needed
-BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
 @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
 class SemScore(evaluate.Metric):
@@ -83,13 +84,59 @@ class SemScore(evaluate.Metric):
     def _download_and_prepare(self, dl_manager):
         """Optional: download external resources useful to compute the scores"""
-        # TODO: Download external resources if needed
-        pass
-    def _compute(self, predictions, references):
         """Returns the scores"""
-        # TODO: Compute the different scores of the module
-        accuracy = sum(i == j for i, j in zip(predictions, references)) / len(predictions)
         return {
-            "accuracy": accuracy,
         }

 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+"""SemScore metric"""
 import evaluate
 import datasets
+import torch
+from transformers import AutoTokenizer, AutoModel
+from tqdm import tqdm
 _CITATION = """\
+@misc{semscore,
+    title={SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity},
+    author={Ansar Aynetdinov and Alan Akbik},
+    year={2024},
+    eprint={2401.17072},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL},
+    url={https://arxiv.org/abs/2401.17072},
 }
 """
 _DESCRIPTION = """\
+SemScore measures semantic textual similarity between candidate and reference texts. It has been shown to
+strongly correlate with human judgment on a system-level when evaluating instruction-tuned models.
 """
 _KWARGS_DESCRIPTION = """
 Calculates how good are predictions given some references, using certain scores
 Args:
+    predictions (list of str): list of predictions (instruction completions) to score. Each prediction
+        should be a string.
+    references (list of str): list of references (target completions). Each reference should be a string.
+    batch_size (int): the batch size for predictions.
+    device (str): CPU/GPU device.
 Returns:
+    semscore: aggregated system-level SemScore,
+    similarities: cosine similarities between individual prediction-reference pairs,
 Examples:
+    >>> predictions = ['This is an example sentence', 'Each sentence is considered']
+    >>> references = ['This is an example sentence', 'Each sentence is considered']
+    >>> semscore = evaluate.load("semscore")
+    >>> results = semscore.compute(predictions=predictions, references=references)
+    >>> print(results['semscore'])
+    100.0
 """
 @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
 class SemScore(evaluate.Metric):
     def _download_and_prepare(self, dl_manager):
         """Optional: download external resources useful to compute the scores"""
+        if self.config_name is None:
+            checkpoint = "sentence-transformers/all-mpnet-base-v2"
+        else:
+            checkpoint = self.config_name
+        # Load model and tokenizer from HuggingFace Hub
+        self.model = AutoModel.from_pretrained(checkpoint)
+        self.model.eval()
+        self.tokenizer = AutoTokenizer.from_pretrained(checkpoint)
+    def mean_pooling(model_output, attention_mask):
+        """Mean pooling over all tokens - take attention mask into account for correct averaging"""
+        token_embeddings = model_output[0]
+        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
+    def _compute(
+        self,
+        predictions,
+        references,
+        batch_size=32,
+        device=None,
+        ):
         """Returns the scores"""
+        assert len(predictions) == len(references), "predictions and references should have the same length."
+        if device is not None:
+            if "cuda" in device:
+                assert torch.cuda.is_available()
+            self.model.to(device)
+        else:
+            device = "cpu"
+        pooled_refs, pooled_preds = [], []
+        with torch.inference_mode():
+            for i in tqdm(range(0, len(references), batch_size), desc="Processing batches"):
+                batch_refs = references[i : i + batch_size]
+                batch_preds = predictions[i : i + batch_size]
+                encoded_refs = self.tokenizer(batch_refs, padding=True, truncation=True, return_tensors='pt')
+                encoded_preds = self.tokenizer(batch_preds, padding=True, truncation=True, return_tensors='pt')
+                model_output_refs = self.model(**encoded_refs.to(device))
+                model_output_preds = self.model(**encoded_predictions.to(device))
+                batch_pooled_refs = mean_pooling(model_output_refs, encoded_refs['attention_mask'])
+                batch_pooled_preds = mean_pooling(model_output_preds, encoded_preds['attention_mask'])
+                pooled_refs.append(batch_pooled_refs)
+                pooled_preds.append(batch_pooled_preds)
+        pooled_refs, pooled_preds = torch.cat(pooled_refs), torch.cat(pooled_preds)
+        similarities = torch.nn.functional.F.cosine_similarity(pooled_refs, pooled_preds)
+        similarities = similarities * 100
+        semscore = torch.mean(similarities)
         return {
+            "semscore": round(semscore.item(), 2),
+            "similarities": similarities.tolist()
         }