Update README.md

d4ef170 verified 6 months ago

5.89 kB

	---
	language: el
	tags:
	- legal

	library_name: transformers
	pipeline_tag: fill-mask
	widget:
	- text: Ο Δικηγόρος κατέθεσε ένα <mask> .
	---

	# GreekLegalRoBERTa_v3

	A Greek lagal version of RoBERTa pre-trained language model.



	## Pre-training corpora

	The pre-training corpora of `GreekLegalRoBERTa_v3` include:

	* The entire corpus of Greek legislation, as published by the [National Publication Office](http://www.et.gr).
	* the Greek Parliament Proceedings [Greekparl](https://proceedings.neurips.cc/paper_files/paper/2022/file/b96ce67b2f2d45e4ab315e13a6b5b9c5-Paper-Datasets_and_Benchmarks.pdf).
	* The entire corpus of EU legislation (Greek translation), as published in [Eur-Lex](https://eur-lex.europa.eu/homepage.html?locale=en).
	* the Greek Parliament Proceedings [Greekparl](https://proceedings.neurips.cc/paper_files/paper/2022/file/b96ce67b2f2d45e4ab315e13a6b5b9c5-Paper-Datasets_and_Benchmarks.pdf) .
	* The Greek part of [Wikipedia](https://el.wikipedia.org/wiki/Βικιπαίδεια:Αντίγραφα_της_βάσης_δεδομένων).
	* The Greek part of [European Parliament Proceedings Parallel Corpus](https://www.statmt.org/europarl/).
	* The Greek part of [OSCAR](https://traces1.inria.fr/oscar/), a cleansed version of [Common Crawl](https://commoncrawl.org).
	* The [Raptarchis](https://raptarchis.gov.gr/).


	## Pre-training details

	* We develop the code in [Hugging Face](https://huggingface.co)'s [Transformers](https://github.com/huggingface/transformers). We publish our code in AI-team-UoA GitHub repository (https://github.com/AI-team-UoA/GreekLegalRoBERTa).
	* We released a model similar to the English `FacebookAI/roberta-base` for greek legislative applications model (12-layer, 768-hidden, 12-heads, 125M parameters).
	* We train for 100k training steps with batch size of 4096 sequences of length 512 with an initial learning rate 6e-4.
	* We pretrained our models using 4 v-100 GPUs provided by [Cyprus Research Institute](https://www.bing.com/search?pglt=41&q=Cyprus+Re-+search+Institute&cvid=5a277677e3e740a7a775c9ca0b342baa&gs_lcrp=EgZjaHJvbWUqBggAEEUYOzIGCAAQRRg7MgYIARAAGEAyBggCEAAYQDIGCAMQABhAMgYIBBAAGEAyBggFEAAYQDIGCAYQABhAMgYIBxAAGEAyBggIEAAYQNIBBzI1NmowajGoAgCwAgA&FORM=ANNTA1&PC=EDGEDSE). We would like to express our sincere gratitude to the Cyprus Research Institute for providing us with access to Cyclone. Without your support, this work would not have been possible.


	## Requirements


	```
	pip install torch
	pip install tokenizers
	pip install transformers[torch]
	pip install datasets
	```

	## Load Pretrained Model

	```python
	from transformers import AutoTokenizer, AutoModel

	tokenizer = AutoTokenizer.from_pretrained("AI-team-UoA/GreekLegalRoBERTa_v3")
	model = AutoModel.from_pretrained("AI-team-UoA/GreekLegalRoBERTa_v3")
	```

	## Use Pretrained Model as a Language Model

	```python
	import torch
	from transformers import *

	# Load model and tokenizer
	for i in range(10):
	tokenizer_greek = AutoTokenizer.from_pretrained('AI-team-UoA/GreekLegalRoBERTa_v3')
	lm_model_greek = AutoModelWithLMHead.from_pretrained('AI-team-UoA/GreekLegalRoBERTa_v3')
	unmasker = pipeline("fill-mask", model=lm_model_greek, tokenizer=tokenizer_greek)
	# ================ EXAMPLE 1 ================
	print("================ EXAMPLE 1 ================")
	text_1 = ' O Δικηγορος κατεθεσε ένα <mask> .'
	# EN: 'The lawyer submited a <mask>.'
	input_ids = tokenizer_greek.encode(text_1)
	outputs = lm_model_greek(torch.tensor([input_ids]))[0]
	for i in range(5):
	print("Model's answer "+str(i+1)+" : " +unmasker(text_1, top_k=5)[i]['token_str'])
	#================ EXAMPLE 1 ================
	#Model's answer 1 : letter
	#Model's answer 2 : copy
	#Model's answer 3 : record
	#Model's answer 4 : memorandum
	#Model's answer 5 : diagram


	# ================ EXAMPLE 2 ================
	print("================ EXAMPLE 2 ================")

	text_2 = 'Είναι ένας <mask> άνθρωπος.'
	# EN: 'He is a <mask> person.'
	input_ids = tokenizer_greek.encode(text_2)
	outputs = lm_model_greek(torch.tensor([input_ids]))[0]
	for i in range(5):
	print("Model's answer "+str(i+1)+" : " +unmasker(text_2, top_k=5)[i]['token_str'])

	#================ EXAMPLE 2 ================
	#Model's answer 1 : new
	#Model's answer 2 : capable
	#Model's answer 3 : simple
	#Model's answer 4 : serious
	#Model's answer 5 : small


	# ================ EXAMPLE 3 ================
	print("================ EXAMPLE 3 ================")

	text_3 = 'Είναι ένας <mask> άνθρωπος και κάνει συχνά <mask>.'
	# EN: 'He is a <mask> person he does frequently <mask>.'
	for i in range(5):
	print("Model's answer "+str(i+1)+" : " +unmasker(text_3, top_k=5)[0][i]['token_str']+" , " +unmasker(text_3, top_k=5)[1][i]['token_str'])

	#================ EXAMPLE 3 ================
	#Model's answer 1 : simple, trips
	#Model's answer 2 : new, vacations
	#Model's answer 3 : small, visits
	#Model's answer 4 : good, mistakes
	#Model's answer 5 : serious, actions

	# the most plausible prediction for the second <mask> is "trips"
	# ================ EXAMPLE 4 ================
	print("================ EXAMPLE 4 ================")

	text_4 = ' Kαθορισμός τρόπου αξιολόγησης της επιμελείς των υπαλλήλων που παρακολουθούν προγράμματα επιμόρφωσης και <mask> .'
	# EN: '"Determining how to evaluate the diligence of employees attending edification and <mask> programs."'
	for i in range(5):
	print("Model's answer "+str(i+1)+" : " +unmasker(text_4, top_k=5)[i]['token_str'])

	#================ EXAMPLE 4 ================
	#Model's answer 1 : retraining
	#Model's answer 2 : specialization
	#Model's answer 3 : training
	#Model's answer 4 : education
	#Model's answer 5 : Retraining

	```

	## Evaluation on downstream tasks

	For detailed results read the article:

	TODO


	## Author