init commit

Browse files

Files changed (5) hide show

.gitattributes +3 -0
README.md +118 -0
rnn_metric_learning_panphon_all.pt +3 -0
rnn_metric_learning_token_ipa_all.pt +3 -0
rnn_metric_learning_token_ort_all.pt +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+rnn_metric_learning_token_ipa_all.pt filter=lfs diff=lfs merge=lfs -text
+rnn_metric_learning_token_ort_all.pt filter=lfs diff=lfs merge=lfs -text
+rnn_metric_learning_panphon_all.pt filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,118 @@

+---
+pipeline_tag: other
+language:
+- multilingual
+- en
+- de
+- am
+- fr
+- bn
+- uz
+- pl
+- es
+- sw
+license: apache-2.0
+---
+# PWESuite-metric_learner
+This is a phonetic word embedding model based on PWESuite, as described in [PWESuite: Phonetic Word Embeddings and Tasks They Facilitate](https://aclanthology.org/2024.lrec-main.1168/).
+The metric learner model is based on mimicking distances in the vector space that correspond to Panphon's phonetic distnaces.
+The representation is either based on orthography (token_ort), IPA (token_ipa), or Panphon pronunciation vectors (panphon), which yields three models.
+These models have been trained on all languages jointly.
+## Instructions
+To run any of the three metric learner models, run:
+```bash
+git clone https://github.com/zouharvi/pwesuite.git
+cd pwesuite
+mkdir -p computed/models
+pip3 install -e .
+# download the three models
+wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_token_ort_all.ckpt -O computed/models/
+wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_token_ipa_all.ckpt -O computed/models/
+wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_panphon_all.ckpt   -O computed/models/
+```
+Then, in Python, you can run [this example script](https://github.com/zouharvi/pwesuite/blob/master/scripts/50-use_metric_learner.py):
+```python
+from models.metric_learning.model import RNNMetricLearner
+from models.metric_learning.preprocessor import preprocess_dataset_foreign
+from main.utils import load_multi_data
+import torch
+import tqdm
+import math
+data = load_multi_data(purpose_key="all")
+data = preprocess_dataset_foreign(data[:10], features="token_ipa")
+model = RNNMetricLearner(
+    dimension=300,
+    feature_size=data[0][0].shape[1],
+)
+model.load_state_dict(torch.load("computed/models/rnn_metric_learning_token_ipa_all.pt"))
+# some cheap paralelization
+BATCH_SIZE = 32
+data_out = []
+for i in tqdm.tqdm(range(math.ceil(len(data) / BATCH_SIZE))):
+    batch = [f for f, _ in data[i * BATCH_SIZE:(i + 1) * BATCH_SIZE]]
+    data_out += list(
+        model.forward(batch).detach().cpu().numpy()
+    )
+assert len(data) == len(data_out)
+assert all([len(x) == 300 for x in data_out])
+```
+You can also run the inference on all the data and evaluate it:
+```bash
+mkdir -p computed/embd/
+python3 ./models/metric_learning/apply.py -l all -mp computed/models/rnn_metric_learning_token_ipa_all.pt -o computed/embd/rnn_metric_learning_token_ipa_all.pkl --features token_ipa
+python3 ./suite_evaluation/eval_all.py --embd computed/embd/rnn_metric_learning_token_ipa_all.pkl
+```
+Which gives you an output like:
+```
+human_similarity: 0.6054
+correlation: 0.8995
+retrieval: 0.9158
+analogy: 0.1128
+rhyme: 0.6375
+cognate: 0.6513
+JSON1!{"human_similarity": 0.6053864496119294, "correlation": 0.8995336394813026, "retrieval": 0.9157905555555556, "analogy": 0.1127777777777778, "rhyme": 0.6374601910828025, "cognate": 0.6512651265126512, "overall": 0.6370356233370033}
+Score (overall): 0.6370
+```
+## Training
+Training this model takes about an hour on a mid-tier GPU.
+See [scripts/03-train_metric_learning.sh](https://github.com/zouharvi/pwesuite/blob/master/scripts/03-train_metric_learning.sh) for the specific training command.
+Further description TODO.
+## Other
+Cite as:
+```
+@inproceedings{zouhar-etal-2024-pwesuite,
+    title = "{PWES}uite: Phonetic Word Embeddings and Tasks They Facilitate",
+    author = "Zouhar, Vil{\'e}m  and
+      Chang, Kalvin  and
+      Cui, Chenxuan  and
+      Carlson, Nate B.  and
+      Robinson, Nathaniel Romney  and
+      Sachan, Mrinmaya  and
+      Mortensen, David R.",
+    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
+    month = may,
+    year = "2024",
+    address = "Torino, Italia",
+    publisher = "ELRA and ICCL",
+    url = "https://aclanthology.org/2024.lrec-main.1168/",
+    pages = "13344--13355",
+}
+```

rnn_metric_learning_panphon_all.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6d837807141508a59563264e866bd6727f32ff5ec26309eb10bd29d94e301b2c
+size 3017300

rnn_metric_learning_token_ipa_all.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c9b6ebdd2c5b994ee7c9f258da410099f78ed16163ee7707cc9676e7da9ae93b
+size 3804510

rnn_metric_learning_token_ort_all.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e747f19ddfe55f445bffd36a00938c01195ea1558d50f399c1499dc4e9ab6621
+size 5748510