|
--- |
|
license: cc-by-nc-4.0 |
|
language: |
|
- bo |
|
base_model: google-t5/t5-small |
|
tags: |
|
- nlp |
|
- transliteration |
|
- tibetan |
|
- buddhism |
|
datasets: |
|
- billingsmoore/tibetan-phonetic-transliteration-dataset |
|
--- |
|
# Model Card for tibetan-phonetic-transliteration |
|
|
|
This model is a text2text generation model for phonetic transliteration of Tibetan script. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
|
|
|
|
- **Developed by:** billingsmoore |
|
- **Model type:** text2text generation |
|
- **Language(s) (NLP):** Tibetan |
|
- **License:** [Attribution-NonCommercial 4.0 International ](Attribution-NonCommercial 4.0 International ) |
|
- **Finetuned from model:** ['google-t5/t5-small'](https://huggingface.co/google-t5/t5-small) |
|
|
|
### Model Sources |
|
|
|
- **Repository:** [https://github.com/billingsmoore/MLotsawa](https://github.com/billingsmoore/MLotsawa) |
|
|
|
## Uses |
|
|
|
The intended use of this model is to provide phonetic transliteration of Tibetan script, typically as part of a larger Tibetan translation ecosystem. |
|
|
|
### Direct Use |
|
|
|
To use the model for transliteration in a python script, you can use the transformers library like so: |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
transliterator = pipeline('translation',model='billingsmoore/tibetan-phonetic-transliteration') |
|
|
|
transliterated_text = transliterator(<string of unicode Tibetan script>) |
|
|
|
``` |
|
|
|
### Downstream Use |
|
|
|
The model can be finetuned for a specific use case using the following code. |
|
|
|
```python |
|
from datasets import load_dataset |
|
from transformers import AutoTokenizer, DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, Adafactor |
|
from accelerate import Accelerator |
|
|
|
dataset = load_dataset(<your dataset>) |
|
dataset = dataset['train'].train_test_split(.1) |
|
|
|
checkpoint = "billingsmoore/tibetan-phonetic-transliteration" |
|
tokenizer = AutoTokenizer.from_pretrained(checkpoint) |
|
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint, device_map="auto") |
|
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint) |
|
|
|
source_lang = 'bo' |
|
target_lang = 'phon' |
|
|
|
def preprocess_function(examples): |
|
|
|
inputs = [example for example in examples[source_lang]] |
|
targets = [example for example in examples[target_lang]] |
|
|
|
model_inputs = tokenizer(inputs, text_target=targets, max_length=256, truncation=True, padding="max_length") |
|
|
|
return model_inputs |
|
|
|
tokenized_dataset = dataset.map(preprocess_function, batched=True) |
|
|
|
optimizer = Adafactor( |
|
model.parameters(), |
|
scale_parameter=True, |
|
relative_step=False, |
|
warmup_init=False, |
|
lr=3e-4 |
|
) |
|
|
|
accelerator = Accelerator() |
|
model, optimizer = accelerator.prepare(model, optimizer) |
|
|
|
training_args = Seq2SeqTrainingArguments( |
|
output_dir=".", |
|
auto_find_batch_size=True, |
|
predict_with_generate=True, |
|
fp16=False, |
|
push_to_hub=False, |
|
eval_strategy='epoch', |
|
save_strategy='epoch', |
|
load_best_model_at_end=True, |
|
num_train_epochs=5 |
|
) |
|
|
|
trainer = Seq2SeqTrainer( |
|
model=model, |
|
args=training_args, |
|
train_dataset=tokenized_dataset['train'], |
|
eval_dataset=tokenized_dataset['test'], |
|
tokenizer=tokenizer, |
|
optimizers=(optimizer, None), |
|
data_collator=data_collator |
|
) |
|
|
|
trainer.train() |
|
``` |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
This model was trained exclusively on material from the Tibetan Buddhist canon and thus on Literary Tibetan. |
|
It may not perform satisfactorily on texts from other corpi or on other dialects of Tibetan. |
|
|
|
### Recommendations |
|
|
|
For users who wish to use the model for other texts, I recommend further finetuning on your own dataset using the instructions above. |
|
|
|
## Training Details |
|
|
|
This model was trained on 98597 pairs of text, the first member of which is a line of unicode Tibetan text, the second (the target) is a the phonetic transliteration of the first. |
|
This dataset was scraped from Lotsawa House and is released on Kaggle under the same license as the texts from which it is sourced. |
|
[You can find this dataset and more information on Kaggle by clicking here.](https://www.kaggle.com/datasets/billingsmoore/tibetan-phonetic-transliteration-pairs) |
|
[You can find this dataset and more information on Huggingface by clicking here.](https://huggingface.co/datasets/billingsmoore/tibetan-phonetic-transliteration-dataset) |
|
|
|
This model was trained for five epochs. Further information regarding training can be found in the documentation of the [MLotsawa repository](https://github.com/billingsmoore/MLotsawa). |
|
|
|
## Model Card Contact |
|
|
|
billingsmoore [at] gmail [dot] com |