Update README.md
Browse files
README.md
CHANGED
|
@@ -1,12 +1,17 @@
|
|
| 1 |
# ***LegalNLP*** - Natural Language Processing Methods for the Brazilian Legal Language ⚖️
|
| 2 |
|
| 3 |
-
### The library of Natural Language Processing for Brazilian legal language, *LegalNLP*, was born in a partnership between Brazilian researchers and the legal tech Tikal Tech based in São Paulo, Brazil. Besides containing pre-trained language models for the Brazilian legal language, ***LegalNLP*** provides functions that can facilitate the manipulation of legal texts in Portuguese and demonstration/tutorials to help people in their own work.
|
| 4 |
|
| 5 |
You can access our paper by clicking [**here**](https://arxiv.org/abs/2110.15709).
|
| 6 |
|
| 7 |
If you use our library in your academic work, please cite us in the following way
|
| 8 |
|
| 9 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
|
| 11 |
--------------
|
| 12 |
|
|
@@ -22,21 +27,21 @@ If you use our library in your academic work, please cite us in the following wa
|
|
| 22 |
2. [ Word2Vec/Doc2Vec ](#3.2)
|
| 23 |
3. [ FastText ](#3.3)
|
| 24 |
4. [ BERTikal ](#3.4)
|
| 25 |
-
4. [ Demonstrations/Tutorials](#4)
|
| 26 |
5. [ References](#5)
|
| 27 |
|
| 28 |
--------------
|
| 29 |
|
| 30 |
<a name="0"></a>
|
| 31 |
-
## 0\.
|
| 32 |
|
| 33 |
|
| 34 |
All our models can be found [here](https://drive.google.com/drive/folders/1tCccOXPLSEAEUQtcWXvED3YaNJi3p7la?usp=sharing).
|
| 35 |
|
| 36 |
-
Some models can be download directly using our function `get_premodel
|
| 37 |
|
| 38 |
|
| 39 |
-
Please contact *[email protected]* if you have
|
| 40 |
|
| 41 |
--------------
|
| 42 |
|
|
@@ -53,7 +58,8 @@ $ pip install git+https://github.com/felipemaiapolo/legalnlp
|
|
| 53 |
You can load all our functions running the following command
|
| 54 |
|
| 55 |
```python
|
| 56 |
-
from legalnlp import *
|
|
|
|
| 57 |
```
|
| 58 |
|
| 59 |
|
|
@@ -109,7 +115,7 @@ Function to download a pre-trained model in the same folder as the file that is
|
|
| 109 |
- **model = "wdoc"**: Download Word2Vec and Do2vec pre-trained models in a.zip file and unzip it. It has 2 two files, one with an size 100 Doc2Vec Distributed Memory/ Word2Vec Continuous Bag-of-Words (CBOW) embeddings model and other with an size 100 Doc2Vec Distributed Bag-of-Words (DBOW)/ Word2Vec Skip-Gram (SG) embeddings model.
|
| 110 |
- **model = "fasttext"**: Download a .zip file containing 100 sized FastText CBOW/SG models and unzip it.
|
| 111 |
- **model = "phraser"**: Download Phraser pre-trained model in a .zip file and unzip it. It has 2 two files with phraser1 and phreaser2. We explain how to use them in Section [ Phraser ](#3.1).
|
| 112 |
-
- **model = "w2vnilc"**: Download size 100 Word2Vec CBOW model trained by "Núcleo Interinstitucional de Linguística Computacional" embeddings model in a .zip file and unzip it. [Click here for more details](http://nilc.icmc.usp.br/nilc/index.php/repositorio-de-word-embeddings-do-nilc).
|
| 113 |
- **model = "neuralmindbase"**: Download a .zip file containing base BERT model (PyTorch), trained by NeuralMind and unzip it. For more informations about BERT models made by NeuralMind go to [their GitHub repo](https://github.com/neuralmind-ai/portuguese-bert).
|
| 114 |
- **model = "neuralmindlarge"**: Download a .zip file containing large BERT model (PyTorch), trained by NeuralMind and unzip it. For more informations about BERT models made by NeuralMind go to [their GitHub repo](https://github.com/neuralmind-ai/portuguese-bert).
|
| 115 |
|
|
@@ -121,7 +127,7 @@ Function to download a pre-trained model in the same folder as the file that is
|
|
| 121 |
|
| 122 |
#### 2.2.1\. `extract_features_bert(path_model, path_tokenizer, data, gpu=True)`
|
| 123 |
|
| 124 |
-
Function for extracting features with the BERT model (This function is not accessed through the package installation, but you can find it [here](https://github.com/
|
| 125 |
|
| 126 |
|
| 127 |
**Input:**
|
|
@@ -339,7 +345,7 @@ Below we have a summary table with some important information about the trained
|
|
| 339 |
|
| 340 |
|
| 341 |
| Filenames | FastText | Sizes | Windows
|
| 342 |
-
|
| 343 |
| ```fasttext_cbow*``` | Continuous Bag-of-Words (CBOW) | 100, 200, 300 | 15
|
| 344 |
| ```fasttext_sg*``` | Skip-Gram (SG) | 100, 200, 300 | 15
|
| 345 |
|
|
@@ -432,12 +438,12 @@ bert_model = BertModel.from_pretrained('model_bertikal/')
|
|
| 432 |
|
| 433 |
For a better understanding of the application of these models, below are the links to notebooks where we apply them to a legal dataset using various classification models such as Logistic Regression and CatBoost:
|
| 434 |
|
| 435 |
-
- **BERT notebook** :
|
| 436 |
-
[
|
| 437 |
-
- **Word2Vec notebook** :
|
| 438 |
-
[
|
| 439 |
-
- **Doc2Vec notebook** :
|
| 440 |
-
[
|
| 441 |
|
| 442 |
|
| 443 |
|
|
|
|
| 1 |
# ***LegalNLP*** - Natural Language Processing Methods for the Brazilian Legal Language ⚖️
|
| 2 |
|
| 3 |
+
### The library of Natural Language Processing for Brazilian legal language, *LegalNLP*, was born in a partnership between Brazilian researchers and the legal tech [Tikal Tech](https://www.tikal.tech) based in São Paulo, Brazil. Besides containing pre-trained language models for the Brazilian legal language, ***LegalNLP*** provides functions that can facilitate the manipulation of legal texts in Portuguese and demonstration/tutorials to help people in their own work.
|
| 4 |
|
| 5 |
You can access our paper by clicking [**here**](https://arxiv.org/abs/2110.15709).
|
| 6 |
|
| 7 |
If you use our library in your academic work, please cite us in the following way
|
| 8 |
|
| 9 |
+
@article{polo2021legalnlp,
|
| 10 |
+
title={LegalNLP--Natural Language Processing methods for the Brazilian Legal Language},
|
| 11 |
+
author={Polo, Felipe Maia and Mendon{\c{c}}a, Gabriel Caiaffa Floriano and Parreira, Kau{\^e} Capellato J and Gianvechio, Lucka and Cordeiro, Peterson and Ferreira, Jonathan Batista and de Lima, Leticia Maria Paz and Maia, Ant{\^o}nio Carlos do Amaral and Vicente, Renato},
|
| 12 |
+
journal={arXiv preprint arXiv:2110.15709},
|
| 13 |
+
year={2021}
|
| 14 |
+
}
|
| 15 |
|
| 16 |
--------------
|
| 17 |
|
|
|
|
| 27 |
2. [ Word2Vec/Doc2Vec ](#3.2)
|
| 28 |
3. [ FastText ](#3.3)
|
| 29 |
4. [ BERTikal ](#3.4)
|
| 30 |
+
4. [ Demonstrations / Tutorials](#4)
|
| 31 |
5. [ References](#5)
|
| 32 |
|
| 33 |
--------------
|
| 34 |
|
| 35 |
<a name="0"></a>
|
| 36 |
+
## 0\. Accessing the Language Models
|
| 37 |
|
| 38 |
|
| 39 |
All our models can be found [here](https://drive.google.com/drive/folders/1tCccOXPLSEAEUQtcWXvED3YaNJi3p7la?usp=sharing).
|
| 40 |
|
| 41 |
+
Some models can be download directly using our function `get_premodel` (more details in section [Other Functions](#2.2)).
|
| 42 |
|
| 43 |
|
| 44 |
+
Please contact *[email protected]* if you have any problem accessing the language models.
|
| 45 |
|
| 46 |
--------------
|
| 47 |
|
|
|
|
| 58 |
You can load all our functions running the following command
|
| 59 |
|
| 60 |
```python
|
| 61 |
+
from legalnlp.clean_functions import *
|
| 62 |
+
from legalnlp.get_premodel import *
|
| 63 |
```
|
| 64 |
|
| 65 |
|
|
|
|
| 115 |
- **model = "wdoc"**: Download Word2Vec and Do2vec pre-trained models in a.zip file and unzip it. It has 2 two files, one with an size 100 Doc2Vec Distributed Memory/ Word2Vec Continuous Bag-of-Words (CBOW) embeddings model and other with an size 100 Doc2Vec Distributed Bag-of-Words (DBOW)/ Word2Vec Skip-Gram (SG) embeddings model.
|
| 116 |
- **model = "fasttext"**: Download a .zip file containing 100 sized FastText CBOW/SG models and unzip it.
|
| 117 |
- **model = "phraser"**: Download Phraser pre-trained model in a .zip file and unzip it. It has 2 two files with phraser1 and phreaser2. We explain how to use them in Section [ Phraser ](#3.1).
|
| 118 |
+
- **model = "w2vnilc"**: Download size 100 Word2Vec CBOW model trained by "Núcleo Interinstitucional de Linguística Computacional - USP" embeddings model in a .zip file and unzip it. [Click here for more details](http://nilc.icmc.usp.br/nilc/index.php/repositorio-de-word-embeddings-do-nilc).
|
| 119 |
- **model = "neuralmindbase"**: Download a .zip file containing base BERT model (PyTorch), trained by NeuralMind and unzip it. For more informations about BERT models made by NeuralMind go to [their GitHub repo](https://github.com/neuralmind-ai/portuguese-bert).
|
| 120 |
- **model = "neuralmindlarge"**: Download a .zip file containing large BERT model (PyTorch), trained by NeuralMind and unzip it. For more informations about BERT models made by NeuralMind go to [their GitHub repo](https://github.com/neuralmind-ai/portuguese-bert).
|
| 121 |
|
|
|
|
| 127 |
|
| 128 |
#### 2.2.1\. `extract_features_bert(path_model, path_tokenizer, data, gpu=True)`
|
| 129 |
|
| 130 |
+
Function for extracting features with the BERT model (This function is not accessed through the package installation, but you can find it [here](https://github.com/felipemaiapolo/legalnlp/blob/main/demo/BERT/extract_features_bert.ipynb)).
|
| 131 |
|
| 132 |
|
| 133 |
**Input:**
|
|
|
|
| 345 |
|
| 346 |
|
| 347 |
| Filenames | FastText | Sizes | Windows
|
| 348 |
+
|:-------------------:|:--------------:|:--------------:|:--------------:|
|
| 349 |
| ```fasttext_cbow*``` | Continuous Bag-of-Words (CBOW) | 100, 200, 300 | 15
|
| 350 |
| ```fasttext_sg*``` | Skip-Gram (SG) | 100, 200, 300 | 15
|
| 351 |
|
|
|
|
| 438 |
|
| 439 |
For a better understanding of the application of these models, below are the links to notebooks where we apply them to a legal dataset using various classification models such as Logistic Regression and CatBoost:
|
| 440 |
|
| 441 |
+
- **BERT notebook** : click
|
| 442 |
+
[here](https://github.com/felipemaiapolo/legalnlp/blob/main/demo/BERT/BERT_TUTORIAL.ipynb)
|
| 443 |
+
- **Word2Vec notebook** : click
|
| 444 |
+
[here](https://github.com/felipemaiapolo/legalnlp/blob/main/demo/Word2Vec/Word2Vec_TUTORIAL.ipynb)
|
| 445 |
+
- **Doc2Vec notebook** : click
|
| 446 |
+
[here](https://github.com/felipemaiapolo/legalnlp/blob/main/demo/Doc2Vec/Doc2Vec_TUTORIAL.ipynb)
|
| 447 |
|
| 448 |
|
| 449 |
|