ginic commited on
Commit
50602cd
·
1 Parent(s): 0a94375

Added examples and input/outputs in README

Browse files
Files changed (2) hide show
  1. README.md +67 -15
  2. phone_distance.py +6 -7
README.md CHANGED
@@ -15,37 +15,89 @@ pinned: false
15
 
16
  # Metric Card for Phone Distance
17
 
18
- ***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
19
-
20
  ## Metric Description
21
- *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 
22
 
23
  ## How to Use
24
- *Give general statement of how to use the metric*
25
 
26
- *Provide simplest possible example for using the metric*
 
 
 
 
27
 
28
  ### Inputs
29
- *List all input arguments in the format below*
30
- - **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
 
 
31
 
32
- ### Output Values
33
 
34
- *Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
 
 
 
 
 
 
 
35
 
36
- *State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
37
 
38
  #### Values from Popular Papers
39
- *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
40
 
41
  ### Examples
42
- *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
  ## Limitations and Bias
45
- *Note any known limitations or biases that the metric has, with links and references if possible.*
 
46
 
47
  ## Citation
48
- *Cite the source where this metric was introduced.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
  ## Further References
51
- *Add any useful further references.*
 
 
 
15
 
16
  # Metric Card for Phone Distance
17
 
 
 
18
  ## Metric Description
19
+ Measures of distance in terms of articulatory phonological features can help understand differences between strings in the International Phonetic Alphabet (IPA) in a linguistically motivated way.
20
+ This is useful when evaluating speech recognition or orthographic to IPA conversion tasks. These are Levenshtein distances for comparing strings where the smallest unit of measurement is based on phones or articulatory phonological features, rather than Unicode characters.
21
 
22
  ## How to Use
 
23
 
24
+ ```python
25
+ import evaluate
26
+ phone_distance = evaluate.load("ginic/phone_distance")
27
+ phone_distance.compute(predictions=["bob", "ði"], references=["pop", "ðə"])
28
+ ```
29
 
30
  ### Inputs
31
+ - **predictions** (`list` of `str`): Transcriptions to score.
32
+ - **references** (`list` of `str`) : Reference strings serving as ground truth.
33
+ - **feature_model** (`str`): Set which panphon.distance.Distance feature parsing model is used, choose from `"strict"`, `"permissive"`, `"segment"`. Defaults to `"segment"`.
34
+ - **is_normalize_pfer** (`bool`): Set to `True `to normalize PFER by the largest number of phones in the prediction, reference pair. Defaults to `False`. When this is used PFER will no longer obey the triangle inequality.
35
 
 
36
 
37
+ ### Output Values
38
+ The computation returns a dictionary with the following key and values:
39
+ - **phone_error_rates** (`list` of `float`): Phone error rate (PER) gives edit distance in terms of phones for each prediction-reference pair, rather than Unicode characters, since phones can consist of multiple characters. It is normalized by the number of phones of the reference string. The result with have the same length as the input prediction and reference lists.
40
+ - **mean_phone_error_rate** (`float`): Overall mean of PER.
41
+ - **phone_feature_error_rates** (`list` of `float`): Phone feature error rate (PFER) is Levenshtein distance between strings where distance between individual phones is computed using Hamming distance between phonetic features for each prediction-reference pair. By default it is a metric that obeys the triangle equality, but can also be normalized by number of phones.
42
+ - **mean_phone_feature_error_rates** (`float`): Overall mean of PFER.
43
+ - **feature_error_rates** (`list` of `float`): Feature error rate (FER) is the edit distance in terms of articulatory features normalized by the number of phones in the reference, computed for each prediction-reference pair.
44
+ - **mean_feature_error_rates** (`float`): Overall mean of FER.
45
 
 
46
 
47
  #### Values from Popular Papers
48
+ [Universal Automatic Phonetic Transcription into the International Phonetic Alphabet (Taguchi et al.)](https://www.isca-archive.org/interspeech_2023/taguchi23_interspeech.html) reported an overall PER of 0.21 and PFER of 0.057 on supervised phonetic transcription of in-domain languages, a PER of 0.632 and PFER of 0.213 on zero-shot phonetic transcription of languages not seen in training data. On the zero-shot languages they also reported inter-annotator scores between human annotators as PER 0.533 and PFER 0.196.
49
 
50
  ### Examples
51
+
52
+ Simplest use case to compute phone error rates between two IPA strings:
53
+ ```python
54
+ >>> phone_distance.compute(predictions=["bob", "ði", "spin"], references=["pop", "ðə", "spʰin"])
55
+ {'phone_error_rates': [0.6666666666666666, 0.5, 0.25], 'mean_phone_error_rate': 0.47222222222222215, 'phone_feature_error_rates': [0.08333333333333333, 0.125, 0.041666666666666664], 'mean_phone_feature_error_rates': 0.08333333333333333, 'feature_error_rates': [0.027777777777777776, 0.0625, 0.30208333333333337], 'mean_feature_error_rates': 0.13078703703703706}
56
+ ```
57
+
58
+ Normalize phone feature error rate by the length of the reference string:
59
+ ```python
60
+ >>> phone_distance.compute(predictions=["bob", "ði"], references=["pop", "ðə"], is_normalize_pfer=True)
61
+ {'phone_error_rates': [0.6666666666666666, 0.5], 'mean_phone_error_rate': 0.5833333333333333, 'phone_feature_error_rates': [0.027777777777777776, 0.0625], 'mean_phone_feature_error_rates': 0.04513888888888889, 'feature_error_rates': [0.027777777777777776, 0.0625], 'mean_feature_error_rates': 0.04513888888888889}
62
+ ```
63
+
64
+ Error rates may be greater than 1.0 if the reference string is shorter than the prediction string:
65
+ ```python
66
+ >>> phone_distance.compute(predictions=["bob"], references=["po"])
67
+ {'phone_error_rates': [1.0], 'mean_phone_error_rate': 1.0, 'phone_feature_error_rates': [1.0416666666666667], 'mean_phone_feature_error_rates': 1.0416666666666667, 'feature_error_rates': [0.020833333333333332], 'mean_feature_error_rates': 0.020833333333333332}
68
+ ```
69
+
70
+ Empty reference strings will cause an ValueError, you should handle them separately:
71
+ ```python
72
+ >>> phone_distance.compute(predictions=["bob"], references=[""])
73
+ Traceback (most recent call last):
74
+ ...
75
+ raise ValueError("one or more references are empty strings")
76
+ ValueError: one or more references are empty strings
77
+ ```
78
 
79
  ## Limitations and Bias
80
+ - Phone error rate and feature error rate can be greater than 1.0 if the reference string is shorter than the prediction string.
81
+ - Since these are error rates, not edit distances, the reference strings cannot be empty.
82
 
83
  ## Citation
84
+ ```bibtex
85
+ @inproceedings{Mortensen-et-al:2016,
86
+ author = {David R. Mortensen and
87
+ Patrick Littell and
88
+ Akash Bharadwaj and
89
+ Kartik Goyal and
90
+ Chris Dyer and
91
+ Lori S. Levin},
92
+ title = {PanPhon: {A} Resource for Mapping {IPA} Segments to Articulatory Feature Vectors},
93
+ booktitle = {Proceedings of {COLING} 2016, the 26th International Conference on Computational Linguistics: Technical Papers},
94
+ pages = {3475--3484},
95
+ publisher = {{ACL}},
96
+ year = {2016}
97
+ }
98
+ ```
99
 
100
  ## Further References
101
+ - PER and PFER are used as evaluation metrics in [Universal Automatic Phonetic Transcription into the International Phonetic Alphabet (Taguchi et al.)](https://www.isca-archive.org/interspeech_2023/taguchi23_interspeech.html)
102
+ - Pierce Darragh's blog post [Introduction to Phonology, Part 3: Phonetic Features](https://pdarragh.github.io/blog/2018/04/26/intro-to-phonology-pt-3/) gives an overview of phonetic features for speech sounds.
103
+ - [panphon Github repository](https://github.com/dmort27/panphon)
phone_distance.py CHANGED
@@ -57,10 +57,9 @@ equality, but can also be normalized by number of phones.
57
  Each measure is given for each prediction, reference pair along with the mean value across all pairs.
58
 
59
  Args:
60
- predictions: list of predictions to score. Each predictions
61
- should be a string of unicode characters.
62
- references: list of reference for each prediction. Each
63
- reference should be a string with of unicode characters.
64
  is_normalize_pfer: bool, set to True to normalize PFER by the largest number of phones in the prediction, reference pair
65
  Returns:
66
  phone_error_rates: list of floats giving PER for each prediction, reference pair
@@ -137,12 +136,12 @@ class PhoneDistance(evaluate.Metric):
137
  reference_urls=["https://pypi.org/project/panphon/", "https://arxiv.org/abs/2308.03917"]
138
  )
139
 
140
- def _compute(self, predictions:list[str]|None=None, references:list[str]|None=None, feature_model:str="segment", is_normalize_pfer:bool=False):
141
  """Computes phoneme error rates, phone feature error rate (Hamming feature edit distance) and feature error rates between prediction and reference strings
142
 
143
  Args:
144
- predictions (list[str], optional): Predicted transcriptions. Defaults to None.
145
- references (list[str], optional): Reference transcriptions. Defaults to None.
146
  feature_model (str, optional): panphon.distance.Distance feature parsing model to be used, choose from "strict", "permissive", "segment". Defaults to "segment".
147
  is_normalize_pfer (bool, optional): Set to true to normalize phone feature error rates by maximum length (measure won't be a true metric). Defaults to False.
148
 
 
57
  Each measure is given for each prediction, reference pair along with the mean value across all pairs.
58
 
59
  Args:
60
+ predictions: list of predictions to score. Each predictions should be a string of unicode characters.
61
+ references: list of reference for each prediction. Each reference should be a string with of unicode characters.
62
+ feature_model: string to set which panphon.distance.Distance feature parsing model is used, choose from "strict", "permissive", "segment". Defaults to "segment".
 
63
  is_normalize_pfer: bool, set to True to normalize PFER by the largest number of phones in the prediction, reference pair
64
  Returns:
65
  phone_error_rates: list of floats giving PER for each prediction, reference pair
 
136
  reference_urls=["https://pypi.org/project/panphon/", "https://arxiv.org/abs/2308.03917"]
137
  )
138
 
139
+ def _compute(self, predictions:list[str], references:list[str], feature_model:str="segment", is_normalize_pfer:bool=False):
140
  """Computes phoneme error rates, phone feature error rate (Hamming feature edit distance) and feature error rates between prediction and reference strings
141
 
142
  Args:
143
+ predictions (list[str]): Predicted transcriptions.
144
+ references (list[str]): Reference transcriptions.
145
  feature_model (str, optional): panphon.distance.Distance feature parsing model to be used, choose from "strict", "permissive", "segment". Defaults to "segment".
146
  is_normalize_pfer (bool, optional): Set to true to normalize phone feature error rates by maximum length (measure won't be a true metric). Defaults to False.
147