Other
zouharvi commited on
Commit
2b5effd
·
1 Parent(s): 86964fc

init commit

Browse files
.gitattributes CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ rnn_metric_learning_token_ipa_all.pt filter=lfs diff=lfs merge=lfs -text
37
+ rnn_metric_learning_token_ort_all.pt filter=lfs diff=lfs merge=lfs -text
38
+ rnn_metric_learning_panphon_all.pt filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: other
3
+ language:
4
+ - multilingual
5
+ - en
6
+ - de
7
+ - am
8
+ - fr
9
+ - bn
10
+ - uz
11
+ - pl
12
+ - es
13
+ - sw
14
+ license: apache-2.0
15
+ ---
16
+
17
+ # PWESuite-metric_learner
18
+
19
+ This is a phonetic word embedding model based on PWESuite, as described in [PWESuite: Phonetic Word Embeddings and Tasks They Facilitate](https://aclanthology.org/2024.lrec-main.1168/).
20
+ The metric learner model is based on mimicking distances in the vector space that correspond to Panphon's phonetic distnaces.
21
+ The representation is either based on orthography (token_ort), IPA (token_ipa), or Panphon pronunciation vectors (panphon), which yields three models.
22
+ These models have been trained on all languages jointly.
23
+
24
+ ## Instructions
25
+
26
+ To run any of the three metric learner models, run:
27
+ ```bash
28
+ git clone https://github.com/zouharvi/pwesuite.git
29
+ cd pwesuite
30
+ mkdir -p computed/models
31
+ pip3 install -e .
32
+
33
+ # download the three models
34
+ wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_token_ort_all.ckpt -O computed/models/
35
+ wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_token_ipa_all.ckpt -O computed/models/
36
+ wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_panphon_all.ckpt -O computed/models/
37
+ ```
38
+
39
+ Then, in Python, you can run [this example script](https://github.com/zouharvi/pwesuite/blob/master/scripts/50-use_metric_learner.py):
40
+ ```python
41
+
42
+ from models.metric_learning.model import RNNMetricLearner
43
+ from models.metric_learning.preprocessor import preprocess_dataset_foreign
44
+ from main.utils import load_multi_data
45
+ import torch
46
+ import tqdm
47
+ import math
48
+
49
+ data = load_multi_data(purpose_key="all")
50
+ data = preprocess_dataset_foreign(data[:10], features="token_ipa")
51
+
52
+ model = RNNMetricLearner(
53
+ dimension=300,
54
+ feature_size=data[0][0].shape[1],
55
+ )
56
+ model.load_state_dict(torch.load("computed/models/rnn_metric_learning_token_ipa_all.pt"))
57
+
58
+ # some cheap paralelization
59
+ BATCH_SIZE = 32
60
+ data_out = []
61
+ for i in tqdm.tqdm(range(math.ceil(len(data) / BATCH_SIZE))):
62
+ batch = [f for f, _ in data[i * BATCH_SIZE:(i + 1) * BATCH_SIZE]]
63
+ data_out += list(
64
+ model.forward(batch).detach().cpu().numpy()
65
+ )
66
+
67
+ assert len(data) == len(data_out)
68
+ assert all([len(x) == 300 for x in data_out])
69
+ ```
70
+
71
+ You can also run the inference on all the data and evaluate it:
72
+ ```bash
73
+ mkdir -p computed/embd/
74
+ python3 ./models/metric_learning/apply.py -l all -mp computed/models/rnn_metric_learning_token_ipa_all.pt -o computed/embd/rnn_metric_learning_token_ipa_all.pkl --features token_ipa
75
+ python3 ./suite_evaluation/eval_all.py --embd computed/embd/rnn_metric_learning_token_ipa_all.pkl
76
+ ```
77
+
78
+ Which gives you an output like:
79
+ ```
80
+ human_similarity: 0.6054
81
+ correlation: 0.8995
82
+ retrieval: 0.9158
83
+ analogy: 0.1128
84
+ rhyme: 0.6375
85
+ cognate: 0.6513
86
+ JSON1!{"human_similarity": 0.6053864496119294, "correlation": 0.8995336394813026, "retrieval": 0.9157905555555556, "analogy": 0.1127777777777778, "rhyme": 0.6374601910828025, "cognate": 0.6512651265126512, "overall": 0.6370356233370033}
87
+ Score (overall): 0.6370
88
+ ```
89
+
90
+
91
+ ## Training
92
+
93
+ Training this model takes about an hour on a mid-tier GPU.
94
+ See [scripts/03-train_metric_learning.sh](https://github.com/zouharvi/pwesuite/blob/master/scripts/03-train_metric_learning.sh) for the specific training command.
95
+ Further description TODO.
96
+
97
+ ## Other
98
+
99
+ Cite as:
100
+ ```
101
+ @inproceedings{zouhar-etal-2024-pwesuite,
102
+ title = "{PWES}uite: Phonetic Word Embeddings and Tasks They Facilitate",
103
+ author = "Zouhar, Vil{\'e}m and
104
+ Chang, Kalvin and
105
+ Cui, Chenxuan and
106
+ Carlson, Nate B. and
107
+ Robinson, Nathaniel Romney and
108
+ Sachan, Mrinmaya and
109
+ Mortensen, David R.",
110
+ booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
111
+ month = may,
112
+ year = "2024",
113
+ address = "Torino, Italia",
114
+ publisher = "ELRA and ICCL",
115
+ url = "https://aclanthology.org/2024.lrec-main.1168/",
116
+ pages = "13344--13355",
117
+ }
118
+ ```
rnn_metric_learning_panphon_all.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6d837807141508a59563264e866bd6727f32ff5ec26309eb10bd29d94e301b2c
3
+ size 3017300
rnn_metric_learning_token_ipa_all.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c9b6ebdd2c5b994ee7c9f258da410099f78ed16163ee7707cc9676e7da9ae93b
3
+ size 3804510
rnn_metric_learning_token_ort_all.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e747f19ddfe55f445bffd36a00938c01195ea1558d50f399c1499dc4e9ab6621
3
+ size 5748510