billingsmoore commited on
Commit
7045d51
·
verified ·
1 Parent(s): 667f0e8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +137 -3
README.md CHANGED
@@ -1,3 +1,137 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - bo
5
+ base_model: google-t5/t5-small
6
+ tags:
7
+ - nlp
8
+ - transliteration
9
+ - tibetan
10
+ - buddhism
11
+ ---
12
+ # Model Card for tibetan-phonetic-transliteration
13
+
14
+ This model is a text2text generation model for phonetic transliteration of Tibetan script.
15
+
16
+ ## Model Details
17
+
18
+ ### Model Description
19
+
20
+ <!-- Provide a longer summary of what this model is. -->
21
+
22
+
23
+
24
+ - **Developed by:** billingsmoore
25
+ - **Model type:** text2text generation
26
+ - **Language(s) (NLP):** Tibetan
27
+ - **License:** [Attribution-NonCommercial 4.0 International ](Attribution-NonCommercial 4.0 International )
28
+ - **Finetuned from model:** ['google-t5/t5-small'](https://huggingface.co/google-t5/t5-small)
29
+
30
+ ### Model Sources
31
+
32
+ - **Repository:** [https://github.com/billingsmoore/MLotsawa](https://github.com/billingsmoore/MLotsawa)
33
+
34
+ ## Uses
35
+
36
+ The intended use of this model is to provide phonetic transliteration of Tibetan script, typically as part of a larger Tibetan translation ecosystem.
37
+
38
+ ### Direct Use
39
+
40
+ To use the model for transliteration in a python script, you can use the transformers library like so:
41
+
42
+ ```python
43
+ from transformers import pipeline
44
+
45
+ transliterator = pipeline('translation',model='billingsmoore/tibetan-phonetic-transliteration')
46
+
47
+ transliterated_text = transliterator(<string of unicode Tibetan script>)
48
+
49
+ ```
50
+
51
+ ### Downstream Use
52
+
53
+ The model can be finetuned for a specific use case using the following code.
54
+
55
+ ```python
56
+ from datasets import load_dataset
57
+ from transformers import AutoTokenizer, DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, Adafactor
58
+ from accelerate import Accelerator
59
+
60
+ dataset = load_dataset(<your dataset>)
61
+ dataset = dataset['train'].train_test_split(.1)
62
+
63
+ checkpoint = "billingsmoore/tibetan-phonetic-transliteration"
64
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
65
+ model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint, device_map="auto")
66
+ data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)
67
+
68
+ source_lang = 'bo'
69
+ target_lang = 'phon'
70
+
71
+ def preprocess_function(examples):
72
+
73
+ inputs = [example for example in examples[source_lang]]
74
+ targets = [example for example in examples[target_lang]]
75
+
76
+ model_inputs = tokenizer(inputs, text_target=targets, max_length=256, truncation=True, padding="max_length")
77
+
78
+ return model_inputs
79
+
80
+ tokenized_dataset = dataset.map(preprocess_function, batched=True)
81
+
82
+ optimizer = Adafactor(
83
+ model.parameters(),
84
+ scale_parameter=True,
85
+ relative_step=False,
86
+ warmup_init=False,
87
+ lr=3e-4
88
+ )
89
+
90
+ accelerator = Accelerator()
91
+ model, optimizer = accelerator.prepare(model, optimizer)
92
+
93
+ training_args = Seq2SeqTrainingArguments(
94
+ output_dir=".",
95
+ auto_find_batch_size=True,
96
+ predict_with_generate=True,
97
+ fp16=False,
98
+ push_to_hub=False,
99
+ eval_strategy='epoch',
100
+ save_strategy='epoch',
101
+ load_best_model_at_end=True,
102
+ num_train_epochs=5
103
+ )
104
+
105
+ trainer = Seq2SeqTrainer(
106
+ model=model,
107
+ args=training_args,
108
+ train_dataset=tokenized_dataset['train'],
109
+ eval_dataset=tokenized_dataset['test'],
110
+ tokenizer=tokenizer,
111
+ optimizers=(optimizer, None),
112
+ data_collator=data_collator
113
+ )
114
+
115
+ trainer.train()
116
+ ```
117
+
118
+ ## Bias, Risks, and Limitations
119
+
120
+ This model was trained exclusively on material from the Tibetan Buddhist canon and thus on Literary Tibetan.
121
+ It may not perform satisfactorily on texts from other corpi or on other dialects of Tibetan.
122
+
123
+ ### Recommendations
124
+
125
+ For users who wish to use the model for other texts, I recommend further finetuning on your own dataset using the instructions above.
126
+
127
+ ## Training Details
128
+
129
+ This model was trained on 98597 pairs of text, the first member of which is a line of unicode Tibetan text, the second (the target) is a the phonetic transliteration of the first.
130
+ This dataset was scraped from Lotsawa House and is released on Kaggle under the same license as the texts from which it is sourced.
131
+ [You can find this dataset and more information by clicking here.](https://www.kaggle.com/datasets/billingsmoore/tibetan-phonetic-transliteration-pairs)
132
+
133
+ This model was trained for five epochs. Further information regarding training can be found in the documentation of the [MLotsawa repository](https://github.com/billingsmoore/MLotsawa).
134
+
135
+ ## Model Card Contact
136
+
137
+ billingsmoore [at] gmail [dot] com