init commit
Browse files- .gitattributes +3 -0
- README.md +118 -0
- rnn_metric_learning_panphon_all.pt +3 -0
- rnn_metric_learning_token_ipa_all.pt +3 -0
- rnn_metric_learning_token_ort_all.pt +3 -0
.gitattributes
CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
rnn_metric_learning_token_ipa_all.pt filter=lfs diff=lfs merge=lfs -text
|
37 |
+
rnn_metric_learning_token_ort_all.pt filter=lfs diff=lfs merge=lfs -text
|
38 |
+
rnn_metric_learning_panphon_all.pt filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
@@ -0,0 +1,118 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
pipeline_tag: other
|
3 |
+
language:
|
4 |
+
- multilingual
|
5 |
+
- en
|
6 |
+
- de
|
7 |
+
- am
|
8 |
+
- fr
|
9 |
+
- bn
|
10 |
+
- uz
|
11 |
+
- pl
|
12 |
+
- es
|
13 |
+
- sw
|
14 |
+
license: apache-2.0
|
15 |
+
---
|
16 |
+
|
17 |
+
# PWESuite-metric_learner
|
18 |
+
|
19 |
+
This is a phonetic word embedding model based on PWESuite, as described in [PWESuite: Phonetic Word Embeddings and Tasks They Facilitate](https://aclanthology.org/2024.lrec-main.1168/).
|
20 |
+
The metric learner model is based on mimicking distances in the vector space that correspond to Panphon's phonetic distnaces.
|
21 |
+
The representation is either based on orthography (token_ort), IPA (token_ipa), or Panphon pronunciation vectors (panphon), which yields three models.
|
22 |
+
These models have been trained on all languages jointly.
|
23 |
+
|
24 |
+
## Instructions
|
25 |
+
|
26 |
+
To run any of the three metric learner models, run:
|
27 |
+
```bash
|
28 |
+
git clone https://github.com/zouharvi/pwesuite.git
|
29 |
+
cd pwesuite
|
30 |
+
mkdir -p computed/models
|
31 |
+
pip3 install -e .
|
32 |
+
|
33 |
+
# download the three models
|
34 |
+
wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_token_ort_all.ckpt -O computed/models/
|
35 |
+
wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_token_ipa_all.ckpt -O computed/models/
|
36 |
+
wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_panphon_all.ckpt -O computed/models/
|
37 |
+
```
|
38 |
+
|
39 |
+
Then, in Python, you can run [this example script](https://github.com/zouharvi/pwesuite/blob/master/scripts/50-use_metric_learner.py):
|
40 |
+
```python
|
41 |
+
|
42 |
+
from models.metric_learning.model import RNNMetricLearner
|
43 |
+
from models.metric_learning.preprocessor import preprocess_dataset_foreign
|
44 |
+
from main.utils import load_multi_data
|
45 |
+
import torch
|
46 |
+
import tqdm
|
47 |
+
import math
|
48 |
+
|
49 |
+
data = load_multi_data(purpose_key="all")
|
50 |
+
data = preprocess_dataset_foreign(data[:10], features="token_ipa")
|
51 |
+
|
52 |
+
model = RNNMetricLearner(
|
53 |
+
dimension=300,
|
54 |
+
feature_size=data[0][0].shape[1],
|
55 |
+
)
|
56 |
+
model.load_state_dict(torch.load("computed/models/rnn_metric_learning_token_ipa_all.pt"))
|
57 |
+
|
58 |
+
# some cheap paralelization
|
59 |
+
BATCH_SIZE = 32
|
60 |
+
data_out = []
|
61 |
+
for i in tqdm.tqdm(range(math.ceil(len(data) / BATCH_SIZE))):
|
62 |
+
batch = [f for f, _ in data[i * BATCH_SIZE:(i + 1) * BATCH_SIZE]]
|
63 |
+
data_out += list(
|
64 |
+
model.forward(batch).detach().cpu().numpy()
|
65 |
+
)
|
66 |
+
|
67 |
+
assert len(data) == len(data_out)
|
68 |
+
assert all([len(x) == 300 for x in data_out])
|
69 |
+
```
|
70 |
+
|
71 |
+
You can also run the inference on all the data and evaluate it:
|
72 |
+
```bash
|
73 |
+
mkdir -p computed/embd/
|
74 |
+
python3 ./models/metric_learning/apply.py -l all -mp computed/models/rnn_metric_learning_token_ipa_all.pt -o computed/embd/rnn_metric_learning_token_ipa_all.pkl --features token_ipa
|
75 |
+
python3 ./suite_evaluation/eval_all.py --embd computed/embd/rnn_metric_learning_token_ipa_all.pkl
|
76 |
+
```
|
77 |
+
|
78 |
+
Which gives you an output like:
|
79 |
+
```
|
80 |
+
human_similarity: 0.6054
|
81 |
+
correlation: 0.8995
|
82 |
+
retrieval: 0.9158
|
83 |
+
analogy: 0.1128
|
84 |
+
rhyme: 0.6375
|
85 |
+
cognate: 0.6513
|
86 |
+
JSON1!{"human_similarity": 0.6053864496119294, "correlation": 0.8995336394813026, "retrieval": 0.9157905555555556, "analogy": 0.1127777777777778, "rhyme": 0.6374601910828025, "cognate": 0.6512651265126512, "overall": 0.6370356233370033}
|
87 |
+
Score (overall): 0.6370
|
88 |
+
```
|
89 |
+
|
90 |
+
|
91 |
+
## Training
|
92 |
+
|
93 |
+
Training this model takes about an hour on a mid-tier GPU.
|
94 |
+
See [scripts/03-train_metric_learning.sh](https://github.com/zouharvi/pwesuite/blob/master/scripts/03-train_metric_learning.sh) for the specific training command.
|
95 |
+
Further description TODO.
|
96 |
+
|
97 |
+
## Other
|
98 |
+
|
99 |
+
Cite as:
|
100 |
+
```
|
101 |
+
@inproceedings{zouhar-etal-2024-pwesuite,
|
102 |
+
title = "{PWES}uite: Phonetic Word Embeddings and Tasks They Facilitate",
|
103 |
+
author = "Zouhar, Vil{\'e}m and
|
104 |
+
Chang, Kalvin and
|
105 |
+
Cui, Chenxuan and
|
106 |
+
Carlson, Nate B. and
|
107 |
+
Robinson, Nathaniel Romney and
|
108 |
+
Sachan, Mrinmaya and
|
109 |
+
Mortensen, David R.",
|
110 |
+
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
|
111 |
+
month = may,
|
112 |
+
year = "2024",
|
113 |
+
address = "Torino, Italia",
|
114 |
+
publisher = "ELRA and ICCL",
|
115 |
+
url = "https://aclanthology.org/2024.lrec-main.1168/",
|
116 |
+
pages = "13344--13355",
|
117 |
+
}
|
118 |
+
```
|
rnn_metric_learning_panphon_all.pt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:6d837807141508a59563264e866bd6727f32ff5ec26309eb10bd29d94e301b2c
|
3 |
+
size 3017300
|
rnn_metric_learning_token_ipa_all.pt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:c9b6ebdd2c5b994ee7c9f258da410099f78ed16163ee7707cc9676e7da9ae93b
|
3 |
+
size 3804510
|
rnn_metric_learning_token_ort_all.pt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:e747f19ddfe55f445bffd36a00938c01195ea1558d50f399c1499dc4e9ab6621
|
3 |
+
size 5748510
|