Spaces:
Sleeping
Sleeping
Update Space (evaluate main: 828c6327)
Browse files- README.md +106 -4
- app.py +6 -0
- frugalscore.py +117 -0
- requirements.txt +5 -0
README.md
CHANGED
|
@@ -1,12 +1,114 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
colorFrom: blue
|
| 5 |
-
colorTo:
|
| 6 |
sdk: gradio
|
| 7 |
sdk_version: 3.0.2
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
|
|
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title:
|
| 3 |
+
emoji: 🤗
|
| 4 |
colorFrom: blue
|
| 5 |
+
colorTo: red
|
| 6 |
sdk: gradio
|
| 7 |
sdk_version: 3.0.2
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
+
tags:
|
| 11 |
+
- evaluate
|
| 12 |
+
- metric
|
| 13 |
---
|
| 14 |
|
| 15 |
+
|
| 16 |
+
## Metric Description
|
| 17 |
+
FrugalScore is a reference-based metric for Natural Language Generation (NLG) model evaluation. It is based on a distillation approach that allows to learn a fixed, low cost version of any expensive NLG metric, while retaining most of its original performance.
|
| 18 |
+
|
| 19 |
+
The FrugalScore models are obtained by continuing the pretraining of small models on a synthetic dataset constructed using summarization, backtranslation and denoising models. During the training, the small models learn the internal mapping of the expensive metric, including any similarity function.
|
| 20 |
+
|
| 21 |
+
## How to use
|
| 22 |
+
|
| 23 |
+
When loading FrugalScore, you can indicate the model you wish to use to compute the score. The default model is `moussaKam/frugalscore_tiny_bert-base_bert-score`, and a full list of models can be found in the [Limitations and bias](#Limitations-and-bias) section.
|
| 24 |
+
|
| 25 |
+
```python
|
| 26 |
+
>>> frugalscore = evaluate.load("frugalscore", "moussaKam/frugalscore_medium_bert-base_mover-score")
|
| 27 |
+
```
|
| 28 |
+
|
| 29 |
+
FrugalScore calculates how good are the predictions given some references, based on a set of scores.
|
| 30 |
+
|
| 31 |
+
The inputs it takes are:
|
| 32 |
+
|
| 33 |
+
`predictions`: a list of strings representing the predictions to score.
|
| 34 |
+
|
| 35 |
+
`references`: a list of string representing the references for each prediction.
|
| 36 |
+
|
| 37 |
+
Its optional arguments are:
|
| 38 |
+
|
| 39 |
+
`batch_size`: the batch size for predictions (default value is `32`).
|
| 40 |
+
|
| 41 |
+
`max_length`: the maximum sequence length (default value is `128`).
|
| 42 |
+
|
| 43 |
+
`device`: either "gpu" or "cpu" (default value is `None`).
|
| 44 |
+
|
| 45 |
+
```python
|
| 46 |
+
>>> results = frugalscore.compute(predictions=['hello there', 'huggingface'], references=['hello world', 'hugging face'], batch_size=16, max_length=64, device="gpu")
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
## Output values
|
| 50 |
+
|
| 51 |
+
The output of FrugalScore is a dictionary with the list of scores for each prediction-reference pair:
|
| 52 |
+
```python
|
| 53 |
+
{'scores': [0.6307541, 0.6449357]}
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
### Values from popular papers
|
| 57 |
+
The [original FrugalScore paper](https://arxiv.org/abs/2110.08559) reported that FrugalScore-Tiny retains 97.7/94.7% of the original performance compared to [BertScore](https://huggingface.co/metrics/bertscore) while running 54 times faster and having 84 times less parameters.
|
| 58 |
+
|
| 59 |
+
## Examples
|
| 60 |
+
|
| 61 |
+
Maximal values (exact match between `references` and `predictions`):
|
| 62 |
+
|
| 63 |
+
```python
|
| 64 |
+
>>> frugalscore = evaluate.load("frugalscore")
|
| 65 |
+
>>> results = frugalscore.compute(predictions=['hello world'], references=['hello world'])
|
| 66 |
+
>>> print(results)
|
| 67 |
+
{'scores': [0.9891098]}
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
Partial values:
|
| 71 |
+
|
| 72 |
+
```python
|
| 73 |
+
>>> frugalscore = evaluate.load("frugalscore")
|
| 74 |
+
>>> results = frugalscore.compute(predictions=['hello world'], references=['hugging face'])
|
| 75 |
+
>>> print(results)
|
| 76 |
+
{'scores': [0.42482382]}
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
## Limitations and bias
|
| 80 |
+
|
| 81 |
+
FrugalScore is based on [BertScore](https://huggingface.co/metrics/bertscore) and [MoverScore](https://arxiv.org/abs/1909.02622), and the models used are based on the original models used for these scores.
|
| 82 |
+
|
| 83 |
+
The full list of available models for FrugalScore is:
|
| 84 |
+
|
| 85 |
+
| FrugalScore | Student | Teacher | Method |
|
| 86 |
+
|----------------------------------------------------|-------------|----------------|------------|
|
| 87 |
+
| [moussaKam/frugalscore_tiny_bert-base_bert-score](https://huggingface.co/moussaKam/frugalscore_tiny_bert-base_bert-score) | BERT-tiny | BERT-Base | BERTScore |
|
| 88 |
+
| [moussaKam/frugalscore_small_bert-base_bert-score](https://huggingface.co/moussaKam/frugalscore_small_bert-base_bert-score) | BERT-small | BERT-Base | BERTScore |
|
| 89 |
+
| [moussaKam/frugalscore_medium_bert-base_bert-score](https://huggingface.co/moussaKam/frugalscore_medium_bert-base_bert-score) | BERT-medium | BERT-Base | BERTScore |
|
| 90 |
+
| [moussaKam/frugalscore_tiny_roberta_bert-score](https://huggingface.co/moussaKam/frugalscore_tiny_roberta_bert-score) | BERT-tiny | RoBERTa-Large | BERTScore |
|
| 91 |
+
| [moussaKam/frugalscore_small_roberta_bert-score](https://huggingface.co/moussaKam/frugalscore_small_roberta_bert-score) | BERT-small | RoBERTa-Large | BERTScore |
|
| 92 |
+
| [moussaKam/frugalscore_medium_roberta_bert-score](https://huggingface.co/moussaKam/frugalscore_medium_roberta_bert-score) | BERT-medium | RoBERTa-Large | BERTScore |
|
| 93 |
+
| [moussaKam/frugalscore_tiny_deberta_bert-score](https://huggingface.co/moussaKam/frugalscore_tiny_deberta_bert-score) | BERT-tiny | DeBERTa-XLarge | BERTScore |
|
| 94 |
+
| [moussaKam/frugalscore_small_deberta_bert-score](https://huggingface.co/moussaKam/frugalscore_small_deberta_bert-score) | BERT-small | DeBERTa-XLarge | BERTScore |
|
| 95 |
+
| [moussaKam/frugalscore_medium_deberta_bert-score](https://huggingface.co/moussaKam/frugalscore_medium_deberta_bert-score) | BERT-medium | DeBERTa-XLarge | BERTScore |
|
| 96 |
+
| [moussaKam/frugalscore_tiny_bert-base_mover-score](https://huggingface.co/moussaKam/frugalscore_tiny_bert-base_mover-score) | BERT-tiny | BERT-Base | MoverScore |
|
| 97 |
+
| [moussaKam/frugalscore_small_bert-base_mover-score](https://huggingface.co/moussaKam/frugalscore_small_bert-base_mover-score) | BERT-small | BERT-Base | MoverScore |
|
| 98 |
+
| [moussaKam/frugalscore_medium_bert-base_mover-score](https://huggingface.co/moussaKam/frugalscore_medium_bert-base_mover-score) | BERT-medium | BERT-Base | MoverScore |
|
| 99 |
+
|
| 100 |
+
Depending on the size of the model picked, the loading time will vary: the `tiny` models will load very quickly, whereas the `medium` ones can take several minutes, depending on your Internet connection.
|
| 101 |
+
|
| 102 |
+
## Citation
|
| 103 |
+
```bibtex
|
| 104 |
+
@article{eddine2021frugalscore,
|
| 105 |
+
title={FrugalScore: Learning Cheaper, Lighter and Faster Evaluation Metrics for Automatic Text Generation},
|
| 106 |
+
author={Eddine, Moussa Kamal and Shang, Guokan and Tixier, Antoine J-P and Vazirgiannis, Michalis},
|
| 107 |
+
journal={arXiv preprint arXiv:2110.08559},
|
| 108 |
+
year={2021}
|
| 109 |
+
}
|
| 110 |
+
```
|
| 111 |
+
|
| 112 |
+
## Further References
|
| 113 |
+
- [Original FrugalScore code](https://github.com/moussaKam/FrugalScore)
|
| 114 |
+
- [FrugalScore paper](https://arxiv.org/abs/2110.08559)
|
app.py
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import evaluate
|
| 2 |
+
from evaluate.utils import launch_gradio_widget
|
| 3 |
+
|
| 4 |
+
|
| 5 |
+
module = evaluate.load("frugalscore")
|
| 6 |
+
launch_gradio_widget(module)
|
frugalscore.py
ADDED
|
@@ -0,0 +1,117 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Copyright 2022 The HuggingFace Datasets Authors and the current metric script contributor.
|
| 2 |
+
#
|
| 3 |
+
# Licensed under the Apache License, Version 2.0 (the "License");
|
| 4 |
+
# you may not use this file except in compliance with the License.
|
| 5 |
+
# You may obtain a copy of the License at
|
| 6 |
+
#
|
| 7 |
+
# http://www.apache.org/licenses/LICENSE-2.0
|
| 8 |
+
#
|
| 9 |
+
# Unless required by applicable law or agreed to in writing, software
|
| 10 |
+
# distributed under the License is distributed on an "AS IS" BASIS,
|
| 11 |
+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
| 12 |
+
# See the License for the specific language governing permissions and
|
| 13 |
+
# limitations under the License.
|
| 14 |
+
"""FrugalScore metric."""
|
| 15 |
+
|
| 16 |
+
import datasets
|
| 17 |
+
import torch
|
| 18 |
+
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
|
| 19 |
+
|
| 20 |
+
import evaluate
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
_CITATION = """\
|
| 24 |
+
@article{eddine2021frugalscore,
|
| 25 |
+
title={FrugalScore: Learning Cheaper, Lighter and Faster Evaluation Metrics for Automatic Text Generation},
|
| 26 |
+
author={Eddine, Moussa Kamal and Shang, Guokan and Tixier, Antoine J-P and Vazirgiannis, Michalis},
|
| 27 |
+
journal={arXiv preprint arXiv:2110.08559},
|
| 28 |
+
year={2021}
|
| 29 |
+
}
|
| 30 |
+
"""
|
| 31 |
+
|
| 32 |
+
_DESCRIPTION = """\
|
| 33 |
+
FrugalScore is a reference-based metric for NLG models evaluation. It is based on a distillation approach that allows to learn a fixed, low cost version of any expensive NLG metric, while retaining most of its original performance.
|
| 34 |
+
"""
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
_KWARGS_DESCRIPTION = """
|
| 38 |
+
Calculates how good are predictions given some references, using certain scores.
|
| 39 |
+
Args:
|
| 40 |
+
predictions (list of str): list of predictions to score. Each predictions
|
| 41 |
+
should be a string.
|
| 42 |
+
references (list of str): list of reference for each prediction. Each
|
| 43 |
+
reference should be a string.
|
| 44 |
+
batch_size (int): the batch size for predictions.
|
| 45 |
+
max_length (int): maximum sequence length.
|
| 46 |
+
device (str): either gpu or cpu
|
| 47 |
+
Returns:
|
| 48 |
+
scores (list of int): list of scores.
|
| 49 |
+
Examples:
|
| 50 |
+
>>> frugalscore = evaluate.load("frugalscore")
|
| 51 |
+
>>> results = frugalscore.compute(predictions=['hello there', 'huggingface'], references=['hello world', 'hugging face'])
|
| 52 |
+
>>> print([round(s, 3) for s in results["scores"]])
|
| 53 |
+
[0.631, 0.645]
|
| 54 |
+
"""
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
|
| 58 |
+
class FRUGALSCORE(evaluate.EvaluationModule):
|
| 59 |
+
def _info(self):
|
| 60 |
+
return evaluate.EvaluationModuleInfo(
|
| 61 |
+
description=_DESCRIPTION,
|
| 62 |
+
citation=_CITATION,
|
| 63 |
+
inputs_description=_KWARGS_DESCRIPTION,
|
| 64 |
+
features=datasets.Features(
|
| 65 |
+
{
|
| 66 |
+
"predictions": datasets.Value("string"),
|
| 67 |
+
"references": datasets.Value("string"),
|
| 68 |
+
}
|
| 69 |
+
),
|
| 70 |
+
homepage="https://github.com/moussaKam/FrugalScore",
|
| 71 |
+
)
|
| 72 |
+
|
| 73 |
+
def _download_and_prepare(self, dl_manager):
|
| 74 |
+
if self.config_name == "default":
|
| 75 |
+
checkpoint = "moussaKam/frugalscore_tiny_bert-base_bert-score"
|
| 76 |
+
else:
|
| 77 |
+
checkpoint = self.config_name
|
| 78 |
+
self.model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
|
| 79 |
+
self.tokenizer = AutoTokenizer.from_pretrained(checkpoint)
|
| 80 |
+
|
| 81 |
+
def _compute(
|
| 82 |
+
self,
|
| 83 |
+
predictions,
|
| 84 |
+
references,
|
| 85 |
+
batch_size=32,
|
| 86 |
+
max_length=128,
|
| 87 |
+
device=None,
|
| 88 |
+
):
|
| 89 |
+
"""Returns the scores"""
|
| 90 |
+
assert len(predictions) == len(
|
| 91 |
+
references
|
| 92 |
+
), "predictions and references should have the same number of sentences."
|
| 93 |
+
if device is not None:
|
| 94 |
+
assert device in ["gpu", "cpu"], "device should be either gpu or cpu."
|
| 95 |
+
else:
|
| 96 |
+
device = "gpu" if torch.cuda.is_available() else "cpu"
|
| 97 |
+
training_args = TrainingArguments(
|
| 98 |
+
"trainer",
|
| 99 |
+
fp16=(device == "gpu"),
|
| 100 |
+
per_device_eval_batch_size=batch_size,
|
| 101 |
+
report_to="all",
|
| 102 |
+
no_cuda=(device == "cpu"),
|
| 103 |
+
log_level="warning",
|
| 104 |
+
)
|
| 105 |
+
dataset = {"sentence1": predictions, "sentence2": references}
|
| 106 |
+
raw_datasets = datasets.Dataset.from_dict(dataset)
|
| 107 |
+
|
| 108 |
+
def tokenize_function(data):
|
| 109 |
+
return self.tokenizer(
|
| 110 |
+
data["sentence1"], data["sentence2"], max_length=max_length, truncation=True, padding=True
|
| 111 |
+
)
|
| 112 |
+
|
| 113 |
+
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
|
| 114 |
+
tokenized_datasets.remove_columns(["sentence1", "sentence2"])
|
| 115 |
+
trainer = Trainer(self.model, training_args, tokenizer=self.tokenizer)
|
| 116 |
+
predictions = trainer.predict(tokenized_datasets)
|
| 117 |
+
return {"scores": list(predictions.predictions.squeeze(-1))}
|
requirements.txt
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# TODO: fix github to release
|
| 2 |
+
git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
|
| 3 |
+
datasets~=2.0
|
| 4 |
+
torch
|
| 5 |
+
transformers
|