basilis's picture
Update README.md
d4ef170 verified
|
raw
history blame
5.89 kB
---
language: el
tags:
- legal
library_name: transformers
pipeline_tag: fill-mask
widget:
- text: Ο Δικηγόρος κατέθεσε ένα <mask> .
---
# GreekLegalRoBERTa_v3
A Greek lagal version of RoBERTa pre-trained language model.
## Pre-training corpora
The pre-training corpora of `GreekLegalRoBERTa_v3` include:
* The entire corpus of Greek legislation, as published by the [National Publication Office](http://www.et.gr).
* the Greek Parliament Proceedings [Greekparl](https://proceedings.neurips.cc/paper_files/paper/2022/file/b96ce67b2f2d45e4ab315e13a6b5b9c5-Paper-Datasets_and_Benchmarks.pdf).
* The entire corpus of EU legislation (Greek translation), as published in [Eur-Lex](https://eur-lex.europa.eu/homepage.html?locale=en).
* the Greek Parliament Proceedings [Greekparl](https://proceedings.neurips.cc/paper_files/paper/2022/file/b96ce67b2f2d45e4ab315e13a6b5b9c5-Paper-Datasets_and_Benchmarks.pdf) .
* The Greek part of [Wikipedia](https://el.wikipedia.org/wiki/Βικιπαίδεια:Αντίγραφα_της_βάσης_δεδομένων).
* The Greek part of [European Parliament Proceedings Parallel Corpus](https://www.statmt.org/europarl/).
* The Greek part of [OSCAR](https://traces1.inria.fr/oscar/), a cleansed version of [Common Crawl](https://commoncrawl.org).
* The [Raptarchis](https://raptarchis.gov.gr/).
## Pre-training details
* We develop the code in [Hugging Face](https://huggingface.co)'s [Transformers](https://github.com/huggingface/transformers). We publish our code in AI-team-UoA GitHub repository (https://github.com/AI-team-UoA/GreekLegalRoBERTa).
* We released a model similar to the English `FacebookAI/roberta-base` for greek legislative applications model (12-layer, 768-hidden, 12-heads, 125M parameters).
* We train for 100k training steps with batch size of 4096 sequences of length 512 with an initial learning rate 6e-4.
* We pretrained our models using 4 v-100 GPUs provided by [Cyprus Research Institute](https://www.bing.com/search?pglt=41&q=Cyprus+Re-+search+Institute&cvid=5a277677e3e740a7a775c9ca0b342baa&gs_lcrp=EgZjaHJvbWUqBggAEEUYOzIGCAAQRRg7MgYIARAAGEAyBggCEAAYQDIGCAMQABhAMgYIBBAAGEAyBggFEAAYQDIGCAYQABhAMgYIBxAAGEAyBggIEAAYQNIBBzI1NmowajGoAgCwAgA&FORM=ANNTA1&PC=EDGEDSE). We would like to express our sincere gratitude to the Cyprus Research Institute for providing us with access to Cyclone. Without your support, this work would not have been possible.
## Requirements
```
pip install torch
pip install tokenizers
pip install transformers[torch]
pip install datasets
```
## Load Pretrained Model
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("AI-team-UoA/GreekLegalRoBERTa_v3")
model = AutoModel.from_pretrained("AI-team-UoA/GreekLegalRoBERTa_v3")
```
## Use Pretrained Model as a Language Model
```python
import torch
from transformers import *
# Load model and tokenizer
for i in range(10):
tokenizer_greek = AutoTokenizer.from_pretrained('AI-team-UoA/GreekLegalRoBERTa_v3')
lm_model_greek = AutoModelWithLMHead.from_pretrained('AI-team-UoA/GreekLegalRoBERTa_v3')
unmasker = pipeline("fill-mask", model=lm_model_greek, tokenizer=tokenizer_greek)
# ================ EXAMPLE 1 ================
print("================ EXAMPLE 1 ================")
text_1 = ' O Δικηγορος κατεθεσε ένα <mask> .'
# EN: 'The lawyer submited a <mask>.'
input_ids = tokenizer_greek.encode(text_1)
outputs = lm_model_greek(torch.tensor([input_ids]))[0]
for i in range(5):
print("Model's answer "+str(i+1)+" : " +unmasker(text_1, top_k=5)[i]['token_str'])
#================ EXAMPLE 1 ================
#Model's answer 1 : letter
#Model's answer 2 : copy
#Model's answer 3 : record
#Model's answer 4 : memorandum
#Model's answer 5 : diagram
# ================ EXAMPLE 2 ================
print("================ EXAMPLE 2 ================")
text_2 = 'Είναι ένας <mask> άνθρωπος.'
# EN: 'He is a <mask> person.'
input_ids = tokenizer_greek.encode(text_2)
outputs = lm_model_greek(torch.tensor([input_ids]))[0]
for i in range(5):
print("Model's answer "+str(i+1)+" : " +unmasker(text_2, top_k=5)[i]['token_str'])
#================ EXAMPLE 2 ================
#Model's answer 1 : new
#Model's answer 2 : capable
#Model's answer 3 : simple
#Model's answer 4 : serious
#Model's answer 5 : small
# ================ EXAMPLE 3 ================
print("================ EXAMPLE 3 ================")
text_3 = 'Είναι ένας <mask> άνθρωπος και κάνει συχνά <mask>.'
# EN: 'He is a <mask> person he does frequently <mask>.'
for i in range(5):
print("Model's answer "+str(i+1)+" : " +unmasker(text_3, top_k=5)[0][i]['token_str']+" , " +unmasker(text_3, top_k=5)[1][i]['token_str'])
#================ EXAMPLE 3 ================
#Model's answer 1 : simple, trips
#Model's answer 2 : new, vacations
#Model's answer 3 : small, visits
#Model's answer 4 : good, mistakes
#Model's answer 5 : serious, actions
# the most plausible prediction for the second <mask> is "trips"
# ================ EXAMPLE 4 ================
print("================ EXAMPLE 4 ================")
text_4 = ' Kαθορισμός τρόπου αξιολόγησης της επιμελείς των υπαλλήλων που παρακολουθούν προγράμματα επιμόρφωσης και <mask> .'
# EN: '"Determining how to evaluate the diligence of employees attending edification and <mask> programs."'
for i in range(5):
print("Model's answer "+str(i+1)+" : " +unmasker(text_4, top_k=5)[i]['token_str'])
#================ EXAMPLE 4 ================
#Model's answer 1 : retraining
#Model's answer 2 : specialization
#Model's answer 3 : training
#Model's answer 4 : education
#Model's answer 5 : Retraining
```
## Evaluation on downstream tasks
For detailed results read the article:
TODO
## Author