Spaces:

sign
/

signwriting_similarity

Sleeping

App Files Files Community

AmitMY commited on about 1 month ago

Commit

361b7e8

1 Parent(s): 4b64f2c

score

Browse files

Files changed (4) hide show

README.md +22 -30
requirements.txt +2 -1
signwriting_similarity.py +95 -51
tests.py +43 -11

README.md CHANGED Viewed

@@ -1,48 +1,40 @@
----
-title: SignWriting Similarity
-tags:
-- evaluate
-- metric
-description: "TODO: add a description here"
-sdk: gradio
-sdk_version: 3.19.1
-app_file: app.py
-pinned: false
----
 # Metric Card for SignWriting Similarity
-***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
 ## Metric Description
-*Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 ## How to Use
-*Give general statement of how to use the metric*
-*Provide simplest possible example for using the metric*
 ### Inputs
-*List all input arguments in the format below*
-- **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
 ### Output Values
-*Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
-*State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
-#### Values from Popular Papers
-*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
-### Examples
-*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
 ## Limitations and Bias
-*Note any known limitations or biases that the metric has, with links and references if possible.*
 ## Citation
-*Cite the source where this metric was introduced.*
-## Further References
-*Add any useful further references.*

 # Metric Card for SignWriting Similarity
 ## Metric Description
+The Symbol Distance Metric is a novel evaluation metric specifically designed for SignWriting, a visual writing system for signed languages. Unlike traditional string-based metrics (e.g., BLEU, chrF), this metric directly considers the visual and spatial properties of individual symbols used in SignWriting, such as base shape, orientation, rotation, and position. It is primarily used to evaluate model outputs in SignWriting transcription and translation tasks, offering a similarity score between a predicted and a reference sign.
 ## How to Use
+The metric is used by passing two SignWriting signs (as sets of symbols) and computing a similarity score that reflects how closely they match in terms of symbol content and layout.
 ### Inputs
+* **hypothesis** *(List\[Symbol]):* The output sign, represented as a list of symbols with visual and spatial properties.
+* **reference** *(List\[Symbol]):* The gold/reference sign, in the same format.
+* **alpha** *(float, default=2.0):* Controls exponential scaling of symbol distance normalization.
+* **beta** *(float, default=2.0):* Controls the penalty for sign length mismatches.
+* **gamma** *(float, default=1.0):* Controls final exponential scaling of the overall score.
 ### Output Values
+Returns a dictionary like:
+```python
+{"score": 0.83}
+```
+This metric outputs a score between 0 and 1:
+* **1.0**: Perfect similarity (identical signs)
+* **0.0**: Complete dissimilarity
+  Higher scores are better. A score above 0.8 is typically considered very good for single sign comparisons.
 ## Limitations and Bias
+* The metric relies on a manually defined distance function for symbol attributes, which may not fully capture perceptual similarity.
+* Performance has primarily been validated qualitatively; quantitative alignment with human judgment is ongoing.
+* It assumes symbol independence and uses a Hungarian matching algorithm, which may miss some higher-order structural patterns in complex signs.
+* Currently more suitable for evaluating single signs than continuous signing sequences.
 ## Citation
+Amit Moryossef, Rotem Zilberman, Ohad Langer (2024). *Effective Sign Language Evaluation via SignWriting*. [arXiv:2410.13668](https://arxiv.org/abs/2410.13668)

requirements.txt CHANGED Viewed

	@@ -1 +1,2 @@
1	- git+https://github.com/huggingface/evaluate@main


1	+ git+https://github.com/huggingface/evaluate@main
2	+ git+https://github.com/sign-language-processing/signwriting-evaluation

signwriting_similarity.py CHANGED Viewed

@@ -11,85 +11,129 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-"""TODO: Add a description here."""
 import evaluate
 import datasets
-# TODO: Add BibTeX citation
 _CITATION = """\
-@InProceedings{huggingface:module,
-title = {A great new module},
-authors={huggingface, Inc.},
-year={2020}
 }
 """
-# TODO: Add description of the module here
 _DESCRIPTION = """\
-This new module is designed to solve this great ML task and is crafted with a lot of care.
 """
-# TODO: Add description of the arguments of the module here
 _KWARGS_DESCRIPTION = """
-Calculates how good are predictions given some references, using certain scores
 Args:
-    predictions: list of predictions to score. Each predictions
-        should be a string with tokens separated by spaces.
-    references: list of reference for each prediction. Each
-        reference should be a string with tokens separated by spaces.
 Returns:
-    accuracy: description of the first score,
-    another_score: description of the second score,
 Examples:
-    Examples should be written in doctest format, and should illustrate how
-    to use the function.
-    >>> my_new_module = evaluate.load("my_new_module")
-    >>> results = my_new_module.compute(references=[0, 1], predictions=[0, 1])
-    >>> print(results)
-    {'accuracy': 1.0}
-"""
-# TODO: Define external resources urls if needed
-BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
 @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
 class SignWritingSimilarity(evaluate.Metric):
-    """TODO: Short description of my evaluation module."""
     def _info(self):
-        # TODO: Specifies the evaluate.EvaluationModuleInfo object
         return evaluate.MetricInfo(
-            # This is the description that will appear on the modules page.
             module_type="metric",
             description=_DESCRIPTION,
             citation=_CITATION,
             inputs_description=_KWARGS_DESCRIPTION,
-            # This defines the format of each prediction and reference
-            features=datasets.Features({
-                'predictions': datasets.Value('int64'),
-                'references': datasets.Value('int64'),
-            }),
-            # Homepage of the module for documentation
-            homepage="http://module.homepage",
-            # Additional links to the codebase or references
-            codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
-            reference_urls=["http://path.to.reference.url/new_module"]
         )
-    def _download_and_prepare(self, dl_manager):
-        """Optional: download external resources useful to compute the scores"""
-        # TODO: Download external resources if needed
-        pass
     def _compute(self, predictions, references):
-        """Returns the scores"""
-        # TODO: Compute the different scores of the module
-        accuracy = sum(i == j for i, j in zip(predictions, references)) / len(predictions)
-        return {
-            "accuracy": accuracy,
-        }

 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+"""SignWriting Similarity metric from the signwriting-evaluation package"""
 import evaluate
 import datasets
+from signwriting_evaluation.metrics.similarity import SignWritingSimilarityMetric
 _CITATION = """\
+@misc{moryossef2024signwritingevaluationeffectivesignlanguage,
+  title={signwriting-evaluation: Effective Sign Language Evaluation via SignWriting},
+  author={Amit Moryossef and Rotem Zilberman and Ohad Langer},
+  year={2024},
+  eprint={2410.13668},
+  archivePrefix={arXiv},
+  primaryClass={cs.CL},
+  url={https://arxiv.org/abs/2410.13668},
 }
 """
 _DESCRIPTION = """\
+SignWriting Similarity metric from the signwriting-evaluation package
 """
 _KWARGS_DESCRIPTION = """
+Produces similarity scores for hypotheses given reference translations.
 Args:
+    predictions (list of str):
+        The predicted sentences.
+    references (list of list of str):
+        The references. There should be one reference sub-list for each prediction sentence.
 Returns:
+    score (float): The similarity score between 0 and 1
 Examples:
+    Example 1 -- basic similarity score:
+        >>> predictions = ["M530x538S37602508x462S15a11493x494S20e00488x510S22f03469x517"]
+        >>> references = [["M519x534S37900497x466S3770b497x485S15a51491x501S22f03481x513"]]
+        >>> metric = evaluate.load("signwriting_similarity")
+        >>> results = metric.compute(predictions=predictions, references=references)
+        >>> print(results)
+        {'score': 0.5509574768254414}
+    Example 2 -- identical signs in different order:
+        >>> predictions = ["M530x538S37602508x462S15a11493x494S20e00488x510S22f03469x517"]
+        >>> references = [["M530x538S22f03469x517S37602508x462S20e00488x510S15a11493x494"]]
+        >>> metric = evaluate.load("signwriting_similarity")
+        >>> results = metric.compute(predictions=predictions, references=references)
+        >>> print(results)
+        {'score': 1.0}
+    Example 3 -- slightly different symbols:
+        >>> predictions = ["M530x538S17600508x462S15a11493x494S20e00488x510S22f03469x517"]
+        >>> references = [["M530x538S17600508x462S12a11493x494S20e00488x510S22f13469x517"]]
+        >>> metric = evaluate.load("signwriting_similarity")
+        >>> results = metric.compute(predictions=predictions, references=references)
+        >>> print(results)
+        {'score': 0.8326259781509948}
+    Example 4 -- multiple references, one good and one bad:
+        >>> predictions = ["M530x538S17600508x462S15a11493x494S20e00488x510S22f03469x517"]
+        >>> references = [["M530x538S17600508x462S12a11493x494S20e00488x510S22f13469x517"], ["M530x538S17600508x462"]]
+        >>> metric = evaluate.load("signwriting_similarity")
+        >>> results = metric.compute(predictions=predictions, references=references)
+        >>> print(results)
+        {'score': 0.8326259781509948}
+    Example 5 -- multiple signs in hypothesis:
+        >>> predictions = ["M530x538S17600508x462S15a11493x494S20e00488x510S22f03469x517 M530x538S17600508x462S15a11493x494S20e00488x510S22f03469x517"]
+        >>> references = [["M530x538S17600508x462S12a11493x494S20e00488x510S22f13469x517"]]
+        >>> metric = evaluate.load("signwriting_similarity")
+        >>> results = metric.compute(predictions=predictions, references=references)
+        >>> print(results)
+        {'score': 0.4163129890754974}
+    Example 6 -- sign order does not affect similarity:
+        >>> predictions = ["M530x538S17600508x462S15a11493x494S20e00488x510S22f03469x517 M530x538S17600508x462S12a11493x494S20e00488x510S22f13469x517"]
+        >>> references = [["M530x538S17600508x462S12a11493x494S20e00488x510S22f13469x517 M530x538S17600508x462S15a11493x494S20e00488x510S22f03469x517"]]
+        >>> metric = evaluate.load("signwriting_similarity")
+        >>> results = metric.compute(predictions=predictions, references=references)
+        >>> print(results)
+        {'score': 1.0}
+    Example 7 -- invalid FSW input should result in 0 score:
+        >>> predictions = ["M<s><s>M<s>p483"]
+        >>> references = [["M<s><s>M<s>p483"]]
+        >>> metric = evaluate.load("signwriting_similarity")
+        >>> results = metric.compute(predictions=predictions, references=references)
+        >>> print(results)
+        {'score': 0.0}
+"""
 @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
 class SignWritingSimilarity(evaluate.Metric):
+    metric = SignWritingSimilarityMetric()
     def _info(self):
         return evaluate.MetricInfo(
             module_type="metric",
             description=_DESCRIPTION,
             citation=_CITATION,
             inputs_description=_KWARGS_DESCRIPTION,
+            homepage="https://github.com/sign-language-processing/signwriting-evaluation",
+            features=[
+                datasets.Features(
+                    {
+                        "predictions": datasets.Value("string", id="sequence"),
+                        "references": datasets.Sequence(datasets.Value("string", id="sequence"), id="references"),
+                    }
+                ),
+                datasets.Features(
+                    {
+                        "predictions": datasets.Value("string", id="sequence"),
+                        "references": datasets.Value("string", id="sequence"),
+                    }
+                ),
+            ],
+            codebase_urls=["https://github.com/sign-language-processing/signwriting-evaluation"],
+            reference_urls=[
+                "https://github.com/sign-language-processing/signwriting-evaluation",
+            ],
         )
     def _compute(self, predictions, references):
+        score = self.metric.corpus_score(predictions, references)
+        return {"score": score}

tests.py CHANGED Viewed

@@ -1,17 +1,49 @@
 test_cases = [
     {
-        "predictions": [0, 0],
-        "references": [1, 1],
-        "result": {"metric_score": 0}
     },
     {
-        "predictions": [1, 1],
-        "references": [1, 1],
-        "result": {"metric_score": 1}
     },
     {
-        "predictions": [1, 0],
-        "references": [1, 1],
-        "result": {"metric_score": 0.5}
-    }
-]

 test_cases = [
     {
+        "predictions": ["M530x538S37602508x462S15a11493x494S20e00488x510S22f03469x517"],
+        "references": ["M519x534S37900497x466S3770b497x485S15a51491x501S22f03481x513"],
+        "result": {"score": 0.5509574768254414},
     },
     {
+        "predictions": ["M530x538S37602508x462S15a11493x494S20e00488x510S22f03469x517"],
+        "references": ["M530x538S22f03469x517S37602508x462S20e00488x510S15a11493x494"],
+        "result": {"score": 1.0},
     },
     {
+        "predictions": ["M530x538S17600508x462S15a11493x494S20e00488x510S22f03469x517"],
+        "references": ["M530x538S17600508x462S12a11493x494S20e00488x510S22f13469x517"],
+        "result": {"score": 0.8326259781509948},
+    },
+    {
+        "predictions": ["M530x538S17600508x462S15a11493x494S20e00488x510S22f03469x517"],
+        "references": [
+            "M530x538S17600508x462S12a11493x494S20e00488x510S22f13469x517",
+            "M530x538S17600508x462"
+        ],
+        "result": {"score": 0.8326259781509948},
+    },
+    {
+        "predictions": [
+            "M530x538S17600508x462S15a11493x494S20e00488x510S22f03469x517 "
+            "M530x538S17600508x462S15a11493x494S20e00488x510S22f03469x517"
+        ],
+        "references": ["M530x538S17600508x462S12a11493x494S20e00488x510S22f13469x517"],
+        "result": {"score": 0.4163129890754974},
+    },
+    {
+        "predictions": [
+            "M530x538S17600508x462S15a11493x494S20e00488x510S22f03469x517 "
+            "M530x538S17600508x462S12a11493x494S20e00488x510S22f13469x517"
+        ],
+        "references": [
+            "M530x538S17600508x462S12a11493x494S20e00488x510S22f13469x517 "
+            "M530x538S17600508x462S15a11493x494S20e00488x510S22f03469x517"
+        ],
+        "result": {"score": 1.0},
+    },
+    {
+        "predictions": ["M<s><s>M<s>p483"],
+        "references": ["M<s><s>M<s>p483"],
+        "result": {"score": 0.0},
+    },
+]