Update README.md

eca802b verified 6 months ago

4.54 kB

	---
	license: cc-by-nc-4.0
	language:
	- bo
	base_model: google-t5/t5-small
	tags:
	- nlp
	- transliteration
	- tibetan
	- buddhism
	datasets:
	- billingsmoore/tibetan-phonetic-transliteration-dataset
	---
	# Model Card for tibetan-phonetic-transliteration

	This model is a text2text generation model for phonetic transliteration of Tibetan script.

	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->



	- Developed by: billingsmoore
	- Model type: text2text generation
	- Language(s) (NLP): Tibetan
	- License: [Attribution-NonCommercial 4.0 International ](Attribution-NonCommercial 4.0 International )
	- Finetuned from model: ['google-t5/t5-small'](https://huggingface.co/google-t5/t5-small)

	### Model Sources

	- Repository: [https://github.com/billingsmoore/MLotsawa](https://github.com/billingsmoore/MLotsawa)

	## Uses

	The intended use of this model is to provide phonetic transliteration of Tibetan script, typically as part of a larger Tibetan translation ecosystem.

	### Direct Use

	To use the model for transliteration in a python script, you can use the transformers library like so:

	```python
	from transformers import pipeline

	transliterator = pipeline('translation',model='billingsmoore/tibetan-phonetic-transliteration')

	transliterated_text = transliterator(<string of unicode Tibetan script>)

	```

	### Downstream Use

	The model can be finetuned for a specific use case using the following code.

	```python
	from datasets import load_dataset
	from transformers import AutoTokenizer, DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, Adafactor
	from accelerate import Accelerator

	dataset = load_dataset(<your dataset>)
	dataset = dataset['train'].train_test_split(.1)

	checkpoint = "billingsmoore/tibetan-phonetic-transliteration"
	tokenizer = AutoTokenizer.from_pretrained(checkpoint)
	model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint, device_map="auto")
	data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

	source_lang = 'bo'
	target_lang = 'phon'

	def preprocess_function(examples):

	inputs = [example for example in examples[source_lang]]
	targets = [example for example in examples[target_lang]]

	model_inputs = tokenizer(inputs, text_target=targets, max_length=256, truncation=True, padding="max_length")

	return model_inputs

	tokenized_dataset = dataset.map(preprocess_function, batched=True)

	optimizer = Adafactor(
	model.parameters(),
	scale_parameter=True,
	relative_step=False,
	warmup_init=False,
	lr=3e-4
	)

	accelerator = Accelerator()
	model, optimizer = accelerator.prepare(model, optimizer)

	training_args = Seq2SeqTrainingArguments(
	output_dir=".",
	auto_find_batch_size=True,
	predict_with_generate=True,
	fp16=False,
	push_to_hub=False,
	eval_strategy='epoch',
	save_strategy='epoch',
	load_best_model_at_end=True,
	num_train_epochs=5
	)

	trainer = Seq2SeqTrainer(
	model=model,
	args=training_args,
	train_dataset=tokenized_dataset['train'],
	eval_dataset=tokenized_dataset['test'],
	tokenizer=tokenizer,
	optimizers=(optimizer, None),
	data_collator=data_collator
	)

	trainer.train()
	```

	## Bias, Risks, and Limitations

	This model was trained exclusively on material from the Tibetan Buddhist canon and thus on Literary Tibetan.
	It may not perform satisfactorily on texts from other corpi or on other dialects of Tibetan.

	### Recommendations

	For users who wish to use the model for other texts, I recommend further finetuning on your own dataset using the instructions above.

	## Training Details

	This model was trained on 98597 pairs of text, the first member of which is a line of unicode Tibetan text, the second (the target) is a the phonetic transliteration of the first.
	This dataset was scraped from Lotsawa House and is released on Kaggle under the same license as the texts from which it is sourced.
	[You can find this dataset and more information on Kaggle by clicking here.](https://www.kaggle.com/datasets/billingsmoore/tibetan-phonetic-transliteration-pairs)
	[You can find this dataset and more information on Huggingface by clicking here.](https://huggingface.co/datasets/billingsmoore/tibetan-phonetic-transliteration-dataset)

	This model was trained for five epochs. Further information regarding training can be found in the documentation of the [MLotsawa repository](https://github.com/billingsmoore/MLotsawa).

	## Model Card Contact

	billingsmoore [at] gmail [dot] com