Projeto
/

LegalNLP

@@ -1,12 +1,17 @@
 # ***LegalNLP*** - Natural Language Processing Methods for the Brazilian Legal Language ⚖️
-### The library of Natural Language Processing for Brazilian legal language, *LegalNLP*, was born in a partnership between Brazilian researchers and the legal tech Tikal Tech based in São Paulo, Brazil. Besides containing pre-trained language models for the Brazilian legal language, ***LegalNLP*** provides functions that can facilitate the manipulation of legal texts in Portuguese and demonstration/tutorials to help people in their own work.
 You can access our paper by clicking [**here**](https://arxiv.org/abs/2110.15709).
 If you use our library in your academic work, please cite us in the following way
 --------------
@@ -22,21 +27,21 @@ If you use our library in your academic work, please cite us in the following wa
     2.  [ Word2Vec/Doc2Vec ](#3.2)
     3.  [ FastText ](#3.3)
     4.  [ BERTikal ](#3.4)
-4. [ Demonstrations/Tutorials](#4)
 5. [ References](#5)
 --------------
 <a name="0"></a>
-## 0\. Address for Language Models
 All our models can be found [here](https://drive.google.com/drive/folders/1tCccOXPLSEAEUQtcWXvED3YaNJi3p7la?usp=sharing).
-Some models can be download directly using our function `get_premodel`.
-Please contact *[email protected]* if you have some problem accessing the language models.
 --------------
@@ -53,7 +58,8 @@ $ pip install git+https://github.com/felipemaiapolo/legalnlp
 You can load all our functions running the following command
 ```python
-from legalnlp import *
 ```
@@ -109,7 +115,7 @@ Function to download a pre-trained model in the same folder as the file that is
     - **model = "wdoc"**: Download Word2Vec and Do2vec pre-trained models in a.zip file and unzip it. It has 2 two files, one with an size 100 Doc2Vec Distributed Memory/ Word2Vec Continuous Bag-of-Words (CBOW) embeddings model and other with an size 100 Doc2Vec Distributed Bag-of-Words (DBOW)/ Word2Vec Skip-Gram (SG)  embeddings model.
     - **model = "fasttext"**: Download a .zip file containing 100 sized FastText CBOW/SG models and unzip it.
     - **model = "phraser"**: Download Phraser pre-trained model in a .zip file and unzip it. It has 2 two files with phraser1 and phreaser2. We explain how to use them in Section [ Phraser ](#3.1).
-    - **model = "w2vnilc"**: Download size 100 Word2Vec CBOW model trained by "Núcleo Interinstitucional de Linguística Computacional" embeddings model in a .zip file and unzip it. [Click here for more details](http://nilc.icmc.usp.br/nilc/index.php/repositorio-de-word-embeddings-do-nilc).
     - **model = "neuralmindbase"**: Download a .zip file containing base BERT model (PyTorch), trained by NeuralMind and unzip it. For more informations about BERT models made by NeuralMind go to [their GitHub repo](https://github.com/neuralmind-ai/portuguese-bert).
     - **model = "neuralmindlarge"**: Download a .zip file containing large BERT model (PyTorch), trained by NeuralMind and unzip it. For more informations about BERT models made by NeuralMind go to [their GitHub repo](https://github.com/neuralmind-ai/portuguese-bert).
@@ -121,7 +127,7 @@ Function to download a pre-trained model in the same folder as the file that is
 #### 2.2.1\. `extract_features_bert(path_model, path_tokenizer, data, gpu=True)`
-Function for extracting features with the BERT model (This function is not accessed through the package installation, but you can find it [here](https://github.com/legalnlp21/legalnlp/blob/main/demo/BERT/extract_features_bert.ipynb)).
 **Input:**
@@ -339,7 +345,7 @@ Below we have a summary table with some important information about the trained
 | Filenames      | FastText   | Sizes | Windows
-|:-------------------:|:----------------:|:--------------:|:--------------:|
 | ```fasttext_cbow*```         | Continuous Bag-of-Words (CBOW)          | 100, 200, 300 | 15
 | ```fasttext_sg*```             | Skip-Gram (SG)                   | 100, 200, 300      | 15
@@ -432,12 +438,12 @@ bert_model = BertModel.from_pretrained('model_bertikal/')
 For a better understanding of the application of these models, below are the links to notebooks where we apply them to a legal dataset using various classification models such as Logistic Regression and CatBoost:
-- **BERT notebook** :
-[https://github.com/legalnlp21/legalnlp/blob/main/demo/BERT/BERT_TUTORIAL.ipynb](https://github.com/legalnlp21/legalnlp/blob/main/demo/BERT/BERT_TUTORIAL.ipynb)
-- **Word2Vec notebook** :
-[https://github.com/legalnlp21/legalnlp/blob/main/demo/Word2Vec/Word2Vec_TUTORIAL.ipynb](https://github.com/legalnlp21/legalnlp/blob/main/demo/Word2Vec/Word2Vec_TUTORIAL.ipynb)
-- **Doc2Vec notebook** :
-[https://github.com/legalnlp21/legalnlp/tree/main/demo/Doc2Vec](https://github.com/legalnlp21/legalnlp/tree/main/demo/Doc2Vec)

 # ***LegalNLP*** - Natural Language Processing Methods for the Brazilian Legal Language ⚖️
+### The library of Natural Language Processing for Brazilian legal language, *LegalNLP*, was born in a partnership between Brazilian researchers and the legal tech [Tikal Tech](https://www.tikal.tech) based in São Paulo, Brazil. Besides containing pre-trained language models for the Brazilian legal language, ***LegalNLP*** provides functions that can facilitate the manipulation of legal texts in Portuguese and demonstration/tutorials to help people in their own work.
 You can access our paper by clicking [**here**](https://arxiv.org/abs/2110.15709).
 If you use our library in your academic work, please cite us in the following way
+    @article{polo2021legalnlp,
+      title={LegalNLP--Natural Language Processing methods for the Brazilian Legal Language},
+      author={Polo, Felipe Maia and Mendon{\c{c}}a, Gabriel Caiaffa Floriano and Parreira, Kau{\^e} Capellato J and Gianvechio, Lucka and Cordeiro, Peterson and Ferreira, Jonathan Batista and de Lima, Leticia Maria Paz and Maia, Ant{\^o}nio Carlos do Amaral and Vicente, Renato},
+      journal={arXiv preprint arXiv:2110.15709},
+      year={2021}
+    }
 --------------
     2.  [ Word2Vec/Doc2Vec ](#3.2)
     3.  [ FastText ](#3.3)
     4.  [ BERTikal ](#3.4)
+4. [ Demonstrations / Tutorials](#4)
 5. [ References](#5)
 --------------
 <a name="0"></a>
+## 0\. Accessing the Language Models
 All our models can be found [here](https://drive.google.com/drive/folders/1tCccOXPLSEAEUQtcWXvED3YaNJi3p7la?usp=sharing).
+Some models can be download directly using our function `get_premodel` (more details in section [Other Functions](#2.2)).
+Please contact *[email protected]* if you have any problem accessing the language models.
 --------------
 You can load all our functions running the following command
 ```python
+from legalnlp.clean_functions import *
+from legalnlp.get_premodel import *
 ```
     - **model = "wdoc"**: Download Word2Vec and Do2vec pre-trained models in a.zip file and unzip it. It has 2 two files, one with an size 100 Doc2Vec Distributed Memory/ Word2Vec Continuous Bag-of-Words (CBOW) embeddings model and other with an size 100 Doc2Vec Distributed Bag-of-Words (DBOW)/ Word2Vec Skip-Gram (SG)  embeddings model.
     - **model = "fasttext"**: Download a .zip file containing 100 sized FastText CBOW/SG models and unzip it.
     - **model = "phraser"**: Download Phraser pre-trained model in a .zip file and unzip it. It has 2 two files with phraser1 and phreaser2. We explain how to use them in Section [ Phraser ](#3.1).
+    - **model = "w2vnilc"**: Download size 100 Word2Vec CBOW model trained by "Núcleo Interinstitucional de Linguística Computacional - USP" embeddings model in a .zip file and unzip it. [Click here for more details](http://nilc.icmc.usp.br/nilc/index.php/repositorio-de-word-embeddings-do-nilc).
     - **model = "neuralmindbase"**: Download a .zip file containing base BERT model (PyTorch), trained by NeuralMind and unzip it. For more informations about BERT models made by NeuralMind go to [their GitHub repo](https://github.com/neuralmind-ai/portuguese-bert).
     - **model = "neuralmindlarge"**: Download a .zip file containing large BERT model (PyTorch), trained by NeuralMind and unzip it. For more informations about BERT models made by NeuralMind go to [their GitHub repo](https://github.com/neuralmind-ai/portuguese-bert).
 #### 2.2.1\. `extract_features_bert(path_model, path_tokenizer, data, gpu=True)`
+Function for extracting features with the BERT model (This function is not accessed through the package installation, but you can find it [here](https://github.com/felipemaiapolo/legalnlp/blob/main/demo/BERT/extract_features_bert.ipynb)).
 **Input:**
 | Filenames      | FastText   | Sizes | Windows
+|:-------------------:|:--------------:|:--------------:|:--------------:|
 | ```fasttext_cbow*```         | Continuous Bag-of-Words (CBOW)          | 100, 200, 300 | 15
 | ```fasttext_sg*```             | Skip-Gram (SG)                   | 100, 200, 300      | 15
 For a better understanding of the application of these models, below are the links to notebooks where we apply them to a legal dataset using various classification models such as Logistic Regression and CatBoost:
+- **BERT notebook** : click
+[here](https://github.com/felipemaiapolo/legalnlp/blob/main/demo/BERT/BERT_TUTORIAL.ipynb)
+- **Word2Vec notebook** : click
+[here](https://github.com/felipemaiapolo/legalnlp/blob/main/demo/Word2Vec/Word2Vec_TUTORIAL.ipynb)
+- **Doc2Vec notebook** : click
+[here](https://github.com/felipemaiapolo/legalnlp/blob/main/demo/Doc2Vec/Doc2Vec_TUTORIAL.ipynb)