|
--- |
|
language: el |
|
tags: |
|
- legal |
|
|
|
library_name: transformers |
|
pipeline_tag: fill-mask |
|
widget: |
|
- text: Ο Δικηγόρος κατέθεσε ένα <mask> . |
|
--- |
|
|
|
# GreekLegalRoBERTa_v3 |
|
|
|
A Greek lagal version of RoBERTa pre-trained language model. |
|
|
|
|
|
|
|
## Pre-training corpora |
|
|
|
The pre-training corpora of `GreekLegalRoBERTa_v3` include: |
|
|
|
* The entire corpus of Greek legislation, as published by the [National Publication Office](http://www.et.gr). |
|
* the Greek Parliament Proceedings [Greekparl](https://proceedings.neurips.cc/paper_files/paper/2022/file/b96ce67b2f2d45e4ab315e13a6b5b9c5-Paper-Datasets_and_Benchmarks.pdf). |
|
* The entire corpus of EU legislation (Greek translation), as published in [Eur-Lex](https://eur-lex.europa.eu/homepage.html?locale=en). |
|
* the Greek Parliament Proceedings [Greekparl](https://proceedings.neurips.cc/paper_files/paper/2022/file/b96ce67b2f2d45e4ab315e13a6b5b9c5-Paper-Datasets_and_Benchmarks.pdf) . |
|
* The Greek part of [Wikipedia](https://el.wikipedia.org/wiki/Βικιπαίδεια:Αντίγραφα_της_βάσης_δεδομένων). |
|
* The Greek part of [European Parliament Proceedings Parallel Corpus](https://www.statmt.org/europarl/). |
|
* The Greek part of [OSCAR](https://traces1.inria.fr/oscar/), a cleansed version of [Common Crawl](https://commoncrawl.org). |
|
* The [Raptarchis](https://raptarchis.gov.gr/). |
|
|
|
|
|
## Pre-training details |
|
|
|
* We develop the code in [Hugging Face](https://huggingface.co)'s [Transformers](https://github.com/huggingface/transformers). We publish our code in AI-team-UoA GitHub repository (https://github.com/AI-team-UoA/GreekLegalRoBERTa). |
|
* We released a model similar to the English `FacebookAI/roberta-base` for greek legislative applications model (12-layer, 768-hidden, 12-heads, 125M parameters). |
|
* We train for 100k training steps with batch size of 4096 sequences of length 512 with an initial learning rate 6e-4. |
|
* We pretrained our models using 4 v-100 GPUs provided by [Cyprus Research Institute](https://www.bing.com/search?pglt=41&q=Cyprus+Re-+search+Institute&cvid=5a277677e3e740a7a775c9ca0b342baa&gs_lcrp=EgZjaHJvbWUqBggAEEUYOzIGCAAQRRg7MgYIARAAGEAyBggCEAAYQDIGCAMQABhAMgYIBBAAGEAyBggFEAAYQDIGCAYQABhAMgYIBxAAGEAyBggIEAAYQNIBBzI1NmowajGoAgCwAgA&FORM=ANNTA1&PC=EDGEDSE). We would like to express our sincere gratitude to the Cyprus Research Institute for providing us with access to Cyclone. Without your support, this work would not have been possible. |
|
|
|
|
|
## Requirements |
|
|
|
|
|
``` |
|
pip install torch |
|
pip install tokenizers |
|
pip install transformers[torch] |
|
pip install datasets |
|
``` |
|
|
|
## Load Pretrained Model |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("AI-team-UoA/GreekLegalRoBERTa_v3") |
|
model = AutoModel.from_pretrained("AI-team-UoA/GreekLegalRoBERTa_v3") |
|
``` |
|
|
|
## Use Pretrained Model as a Language Model |
|
|
|
```python |
|
import torch |
|
from transformers import * |
|
|
|
# Load model and tokenizer |
|
for i in range(10): |
|
tokenizer_greek = AutoTokenizer.from_pretrained('AI-team-UoA/GreekLegalRoBERTa_v3') |
|
lm_model_greek = AutoModelWithLMHead.from_pretrained('AI-team-UoA/GreekLegalRoBERTa_v3') |
|
unmasker = pipeline("fill-mask", model=lm_model_greek, tokenizer=tokenizer_greek) |
|
# ================ EXAMPLE 1 ================ |
|
print("================ EXAMPLE 1 ================") |
|
text_1 = ' O Δικηγορος κατεθεσε ένα <mask> .' |
|
# EN: 'The lawyer submited a <mask>.' |
|
input_ids = tokenizer_greek.encode(text_1) |
|
outputs = lm_model_greek(torch.tensor([input_ids]))[0] |
|
for i in range(5): |
|
print("Model's answer "+str(i+1)+" : " +unmasker(text_1, top_k=5)[i]['token_str']) |
|
#================ EXAMPLE 1 ================ |
|
#Model's answer 1 : letter |
|
#Model's answer 2 : copy |
|
#Model's answer 3 : record |
|
#Model's answer 4 : memorandum |
|
#Model's answer 5 : diagram |
|
|
|
|
|
# ================ EXAMPLE 2 ================ |
|
print("================ EXAMPLE 2 ================") |
|
|
|
text_2 = 'Είναι ένας <mask> άνθρωπος.' |
|
# EN: 'He is a <mask> person.' |
|
input_ids = tokenizer_greek.encode(text_2) |
|
outputs = lm_model_greek(torch.tensor([input_ids]))[0] |
|
for i in range(5): |
|
print("Model's answer "+str(i+1)+" : " +unmasker(text_2, top_k=5)[i]['token_str']) |
|
|
|
#================ EXAMPLE 2 ================ |
|
#Model's answer 1 : new |
|
#Model's answer 2 : capable |
|
#Model's answer 3 : simple |
|
#Model's answer 4 : serious |
|
#Model's answer 5 : small |
|
|
|
|
|
# ================ EXAMPLE 3 ================ |
|
print("================ EXAMPLE 3 ================") |
|
|
|
text_3 = 'Είναι ένας <mask> άνθρωπος και κάνει συχνά <mask>.' |
|
# EN: 'He is a <mask> person he does frequently <mask>.' |
|
for i in range(5): |
|
print("Model's answer "+str(i+1)+" : " +unmasker(text_3, top_k=5)[0][i]['token_str']+" , " +unmasker(text_3, top_k=5)[1][i]['token_str']) |
|
|
|
#================ EXAMPLE 3 ================ |
|
#Model's answer 1 : simple, trips |
|
#Model's answer 2 : new, vacations |
|
#Model's answer 3 : small, visits |
|
#Model's answer 4 : good, mistakes |
|
#Model's answer 5 : serious, actions |
|
|
|
# the most plausible prediction for the second <mask> is "trips" |
|
# ================ EXAMPLE 4 ================ |
|
print("================ EXAMPLE 4 ================") |
|
|
|
text_4 = ' Kαθορισμός τρόπου αξιολόγησης της επιμελείς των υπαλλήλων που παρακολουθούν προγράμματα επιμόρφωσης και <mask> .' |
|
# EN: '"Determining how to evaluate the diligence of employees attending edification and <mask> programs."' |
|
for i in range(5): |
|
print("Model's answer "+str(i+1)+" : " +unmasker(text_4, top_k=5)[i]['token_str']) |
|
|
|
#================ EXAMPLE 4 ================ |
|
#Model's answer 1 : retraining |
|
#Model's answer 2 : specialization |
|
#Model's answer 3 : training |
|
#Model's answer 4 : education |
|
#Model's answer 5 : Retraining |
|
|
|
``` |
|
|
|
## Evaluation on downstream tasks |
|
|
|
For detailed results read the article: |
|
|
|
TODO |
|
|
|
|
|
## Author |