AmitMY commited on
Commit
361b7e8
·
1 Parent(s): 4b64f2c
Files changed (4) hide show
  1. README.md +22 -30
  2. requirements.txt +2 -1
  3. signwriting_similarity.py +95 -51
  4. tests.py +43 -11
README.md CHANGED
@@ -1,48 +1,40 @@
1
- ---
2
- title: SignWriting Similarity
3
- tags:
4
- - evaluate
5
- - metric
6
- description: "TODO: add a description here"
7
- sdk: gradio
8
- sdk_version: 3.19.1
9
- app_file: app.py
10
- pinned: false
11
- ---
12
-
13
  # Metric Card for SignWriting Similarity
14
 
15
- ***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
16
-
17
  ## Metric Description
18
- *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
19
 
20
  ## How to Use
21
- *Give general statement of how to use the metric*
22
-
23
- *Provide simplest possible example for using the metric*
24
 
25
  ### Inputs
26
- *List all input arguments in the format below*
27
- - **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
 
 
 
 
28
 
29
  ### Output Values
30
 
31
- *Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
32
 
33
- *State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
 
 
34
 
35
- #### Values from Popular Papers
36
- *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
37
 
38
- ### Examples
39
- *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
 
40
 
41
  ## Limitations and Bias
42
- *Note any known limitations or biases that the metric has, with links and references if possible.*
 
 
 
 
43
 
44
  ## Citation
45
- *Cite the source where this metric was introduced.*
46
 
47
- ## Further References
48
- *Add any useful further references.*
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Metric Card for SignWriting Similarity
2
 
 
 
3
  ## Metric Description
4
+ The Symbol Distance Metric is a novel evaluation metric specifically designed for SignWriting, a visual writing system for signed languages. Unlike traditional string-based metrics (e.g., BLEU, chrF), this metric directly considers the visual and spatial properties of individual symbols used in SignWriting, such as base shape, orientation, rotation, and position. It is primarily used to evaluate model outputs in SignWriting transcription and translation tasks, offering a similarity score between a predicted and a reference sign.
5
 
6
  ## How to Use
7
+ The metric is used by passing two SignWriting signs (as sets of symbols) and computing a similarity score that reflects how closely they match in terms of symbol content and layout.
 
 
8
 
9
  ### Inputs
10
+
11
+ * **hypothesis** *(List\[Symbol]):* The output sign, represented as a list of symbols with visual and spatial properties.
12
+ * **reference** *(List\[Symbol]):* The gold/reference sign, in the same format.
13
+ * **alpha** *(float, default=2.0):* Controls exponential scaling of symbol distance normalization.
14
+ * **beta** *(float, default=2.0):* Controls the penalty for sign length mismatches.
15
+ * **gamma** *(float, default=1.0):* Controls final exponential scaling of the overall score.
16
 
17
  ### Output Values
18
 
19
+ Returns a dictionary like:
20
 
21
+ ```python
22
+ {"score": 0.83}
23
+ ```
24
 
25
+ This metric outputs a score between 0 and 1:
 
26
 
27
+ * **1.0**: Perfect similarity (identical signs)
28
+ * **0.0**: Complete dissimilarity
29
+ Higher scores are better. A score above 0.8 is typically considered very good for single sign comparisons.
30
 
31
  ## Limitations and Bias
32
+
33
+ * The metric relies on a manually defined distance function for symbol attributes, which may not fully capture perceptual similarity.
34
+ * Performance has primarily been validated qualitatively; quantitative alignment with human judgment is ongoing.
35
+ * It assumes symbol independence and uses a Hungarian matching algorithm, which may miss some higher-order structural patterns in complex signs.
36
+ * Currently more suitable for evaluating single signs than continuous signing sequences.
37
 
38
  ## Citation
 
39
 
40
+ Amit Moryossef, Rotem Zilberman, Ohad Langer (2024). *Effective Sign Language Evaluation via SignWriting*. [arXiv:2410.13668](https://arxiv.org/abs/2410.13668)
 
requirements.txt CHANGED
@@ -1 +1,2 @@
1
- git+https://github.com/huggingface/evaluate@main
 
 
1
+ git+https://github.com/huggingface/evaluate@main
2
+ git+https://github.com/sign-language-processing/signwriting-evaluation
signwriting_similarity.py CHANGED
@@ -11,85 +11,129 @@
11
  # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
  # See the License for the specific language governing permissions and
13
  # limitations under the License.
14
- """TODO: Add a description here."""
15
 
16
  import evaluate
17
  import datasets
 
18
 
19
-
20
- # TODO: Add BibTeX citation
21
  _CITATION = """\
22
- @InProceedings{huggingface:module,
23
- title = {A great new module},
24
- authors={huggingface, Inc.},
25
- year={2020}
 
 
 
 
26
  }
27
  """
28
 
29
- # TODO: Add description of the module here
30
  _DESCRIPTION = """\
31
- This new module is designed to solve this great ML task and is crafted with a lot of care.
32
  """
33
 
34
-
35
- # TODO: Add description of the arguments of the module here
36
  _KWARGS_DESCRIPTION = """
37
- Calculates how good are predictions given some references, using certain scores
 
38
  Args:
39
- predictions: list of predictions to score. Each predictions
40
- should be a string with tokens separated by spaces.
41
- references: list of reference for each prediction. Each
42
- reference should be a string with tokens separated by spaces.
43
  Returns:
44
- accuracy: description of the first score,
45
- another_score: description of the second score,
46
  Examples:
47
- Examples should be written in doctest format, and should illustrate how
48
- to use the function.
 
 
 
 
 
49
 
50
- >>> my_new_module = evaluate.load("my_new_module")
51
- >>> results = my_new_module.compute(references=[0, 1], predictions=[0, 1])
52
- >>> print(results)
53
- {'accuracy': 1.0}
54
- """
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
- # TODO: Define external resources urls if needed
57
- BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
 
59
 
60
  @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
61
  class SignWritingSimilarity(evaluate.Metric):
62
- """TODO: Short description of my evaluation module."""
63
 
64
  def _info(self):
65
- # TODO: Specifies the evaluate.EvaluationModuleInfo object
66
  return evaluate.MetricInfo(
67
- # This is the description that will appear on the modules page.
68
  module_type="metric",
69
  description=_DESCRIPTION,
70
  citation=_CITATION,
71
  inputs_description=_KWARGS_DESCRIPTION,
72
- # This defines the format of each prediction and reference
73
- features=datasets.Features({
74
- 'predictions': datasets.Value('int64'),
75
- 'references': datasets.Value('int64'),
76
- }),
77
- # Homepage of the module for documentation
78
- homepage="http://module.homepage",
79
- # Additional links to the codebase or references
80
- codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
81
- reference_urls=["http://path.to.reference.url/new_module"]
 
 
 
 
 
 
 
 
 
82
  )
83
 
84
- def _download_and_prepare(self, dl_manager):
85
- """Optional: download external resources useful to compute the scores"""
86
- # TODO: Download external resources if needed
87
- pass
88
-
89
  def _compute(self, predictions, references):
90
- """Returns the scores"""
91
- # TODO: Compute the different scores of the module
92
- accuracy = sum(i == j for i, j in zip(predictions, references)) / len(predictions)
93
- return {
94
- "accuracy": accuracy,
95
- }
 
11
  # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
  # See the License for the specific language governing permissions and
13
  # limitations under the License.
14
+ """SignWriting Similarity metric from the signwriting-evaluation package"""
15
 
16
  import evaluate
17
  import datasets
18
+ from signwriting_evaluation.metrics.similarity import SignWritingSimilarityMetric
19
 
 
 
20
  _CITATION = """\
21
+ @misc{moryossef2024signwritingevaluationeffectivesignlanguage,
22
+ title={signwriting-evaluation: Effective Sign Language Evaluation via SignWriting},
23
+ author={Amit Moryossef and Rotem Zilberman and Ohad Langer},
24
+ year={2024},
25
+ eprint={2410.13668},
26
+ archivePrefix={arXiv},
27
+ primaryClass={cs.CL},
28
+ url={https://arxiv.org/abs/2410.13668},
29
  }
30
  """
31
 
 
32
  _DESCRIPTION = """\
33
+ SignWriting Similarity metric from the signwriting-evaluation package
34
  """
35
 
 
 
36
  _KWARGS_DESCRIPTION = """
37
+ Produces similarity scores for hypotheses given reference translations.
38
+
39
  Args:
40
+ predictions (list of str):
41
+ The predicted sentences.
42
+ references (list of list of str):
43
+ The references. There should be one reference sub-list for each prediction sentence.
44
  Returns:
45
+ score (float): The similarity score between 0 and 1
 
46
  Examples:
47
+ Example 1 -- basic similarity score:
48
+ >>> predictions = ["M530x538S37602508x462S15a11493x494S20e00488x510S22f03469x517"]
49
+ >>> references = [["M519x534S37900497x466S3770b497x485S15a51491x501S22f03481x513"]]
50
+ >>> metric = evaluate.load("signwriting_similarity")
51
+ >>> results = metric.compute(predictions=predictions, references=references)
52
+ >>> print(results)
53
+ {'score': 0.5509574768254414}
54
 
55
+ Example 2 -- identical signs in different order:
56
+ >>> predictions = ["M530x538S37602508x462S15a11493x494S20e00488x510S22f03469x517"]
57
+ >>> references = [["M530x538S22f03469x517S37602508x462S20e00488x510S15a11493x494"]]
58
+ >>> metric = evaluate.load("signwriting_similarity")
59
+ >>> results = metric.compute(predictions=predictions, references=references)
60
+ >>> print(results)
61
+ {'score': 1.0}
62
+
63
+ Example 3 -- slightly different symbols:
64
+ >>> predictions = ["M530x538S17600508x462S15a11493x494S20e00488x510S22f03469x517"]
65
+ >>> references = [["M530x538S17600508x462S12a11493x494S20e00488x510S22f13469x517"]]
66
+ >>> metric = evaluate.load("signwriting_similarity")
67
+ >>> results = metric.compute(predictions=predictions, references=references)
68
+ >>> print(results)
69
+ {'score': 0.8326259781509948}
70
+
71
+ Example 4 -- multiple references, one good and one bad:
72
+ >>> predictions = ["M530x538S17600508x462S15a11493x494S20e00488x510S22f03469x517"]
73
+ >>> references = [["M530x538S17600508x462S12a11493x494S20e00488x510S22f13469x517"], ["M530x538S17600508x462"]]
74
+ >>> metric = evaluate.load("signwriting_similarity")
75
+ >>> results = metric.compute(predictions=predictions, references=references)
76
+ >>> print(results)
77
+ {'score': 0.8326259781509948}
78
 
79
+ Example 5 -- multiple signs in hypothesis:
80
+ >>> predictions = ["M530x538S17600508x462S15a11493x494S20e00488x510S22f03469x517 M530x538S17600508x462S15a11493x494S20e00488x510S22f03469x517"]
81
+ >>> references = [["M530x538S17600508x462S12a11493x494S20e00488x510S22f13469x517"]]
82
+ >>> metric = evaluate.load("signwriting_similarity")
83
+ >>> results = metric.compute(predictions=predictions, references=references)
84
+ >>> print(results)
85
+ {'score': 0.4163129890754974}
86
+
87
+ Example 6 -- sign order does not affect similarity:
88
+ >>> predictions = ["M530x538S17600508x462S15a11493x494S20e00488x510S22f03469x517 M530x538S17600508x462S12a11493x494S20e00488x510S22f13469x517"]
89
+ >>> references = [["M530x538S17600508x462S12a11493x494S20e00488x510S22f13469x517 M530x538S17600508x462S15a11493x494S20e00488x510S22f03469x517"]]
90
+ >>> metric = evaluate.load("signwriting_similarity")
91
+ >>> results = metric.compute(predictions=predictions, references=references)
92
+ >>> print(results)
93
+ {'score': 1.0}
94
+
95
+ Example 7 -- invalid FSW input should result in 0 score:
96
+ >>> predictions = ["M<s><s>M<s>p483"]
97
+ >>> references = [["M<s><s>M<s>p483"]]
98
+ >>> metric = evaluate.load("signwriting_similarity")
99
+ >>> results = metric.compute(predictions=predictions, references=references)
100
+ >>> print(results)
101
+ {'score': 0.0}
102
+ """
103
 
104
 
105
  @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
106
  class SignWritingSimilarity(evaluate.Metric):
107
+ metric = SignWritingSimilarityMetric()
108
 
109
  def _info(self):
 
110
  return evaluate.MetricInfo(
 
111
  module_type="metric",
112
  description=_DESCRIPTION,
113
  citation=_CITATION,
114
  inputs_description=_KWARGS_DESCRIPTION,
115
+ homepage="https://github.com/sign-language-processing/signwriting-evaluation",
116
+ features=[
117
+ datasets.Features(
118
+ {
119
+ "predictions": datasets.Value("string", id="sequence"),
120
+ "references": datasets.Sequence(datasets.Value("string", id="sequence"), id="references"),
121
+ }
122
+ ),
123
+ datasets.Features(
124
+ {
125
+ "predictions": datasets.Value("string", id="sequence"),
126
+ "references": datasets.Value("string", id="sequence"),
127
+ }
128
+ ),
129
+ ],
130
+ codebase_urls=["https://github.com/sign-language-processing/signwriting-evaluation"],
131
+ reference_urls=[
132
+ "https://github.com/sign-language-processing/signwriting-evaluation",
133
+ ],
134
  )
135
 
 
 
 
 
 
136
  def _compute(self, predictions, references):
137
+ score = self.metric.corpus_score(predictions, references)
138
+
139
+ return {"score": score}
 
 
 
tests.py CHANGED
@@ -1,17 +1,49 @@
1
  test_cases = [
2
  {
3
- "predictions": [0, 0],
4
- "references": [1, 1],
5
- "result": {"metric_score": 0}
6
  },
7
  {
8
- "predictions": [1, 1],
9
- "references": [1, 1],
10
- "result": {"metric_score": 1}
11
  },
12
  {
13
- "predictions": [1, 0],
14
- "references": [1, 1],
15
- "result": {"metric_score": 0.5}
16
- }
17
- ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  test_cases = [
2
  {
3
+ "predictions": ["M530x538S37602508x462S15a11493x494S20e00488x510S22f03469x517"],
4
+ "references": ["M519x534S37900497x466S3770b497x485S15a51491x501S22f03481x513"],
5
+ "result": {"score": 0.5509574768254414},
6
  },
7
  {
8
+ "predictions": ["M530x538S37602508x462S15a11493x494S20e00488x510S22f03469x517"],
9
+ "references": ["M530x538S22f03469x517S37602508x462S20e00488x510S15a11493x494"],
10
+ "result": {"score": 1.0},
11
  },
12
  {
13
+ "predictions": ["M530x538S17600508x462S15a11493x494S20e00488x510S22f03469x517"],
14
+ "references": ["M530x538S17600508x462S12a11493x494S20e00488x510S22f13469x517"],
15
+ "result": {"score": 0.8326259781509948},
16
+ },
17
+ {
18
+ "predictions": ["M530x538S17600508x462S15a11493x494S20e00488x510S22f03469x517"],
19
+ "references": [
20
+ "M530x538S17600508x462S12a11493x494S20e00488x510S22f13469x517",
21
+ "M530x538S17600508x462"
22
+ ],
23
+ "result": {"score": 0.8326259781509948},
24
+ },
25
+ {
26
+ "predictions": [
27
+ "M530x538S17600508x462S15a11493x494S20e00488x510S22f03469x517 "
28
+ "M530x538S17600508x462S15a11493x494S20e00488x510S22f03469x517"
29
+ ],
30
+ "references": ["M530x538S17600508x462S12a11493x494S20e00488x510S22f13469x517"],
31
+ "result": {"score": 0.4163129890754974},
32
+ },
33
+ {
34
+ "predictions": [
35
+ "M530x538S17600508x462S15a11493x494S20e00488x510S22f03469x517 "
36
+ "M530x538S17600508x462S12a11493x494S20e00488x510S22f13469x517"
37
+ ],
38
+ "references": [
39
+ "M530x538S17600508x462S12a11493x494S20e00488x510S22f13469x517 "
40
+ "M530x538S17600508x462S15a11493x494S20e00488x510S22f03469x517"
41
+ ],
42
+ "result": {"score": 1.0},
43
+ },
44
+ {
45
+ "predictions": ["M<s><s>M<s>p483"],
46
+ "references": ["M<s><s>M<s>p483"],
47
+ "result": {"score": 0.0},
48
+ },
49
+ ]