unb-lamfo-nlp-mcti
/

NLP-Classification-MCTI

English

Clsssification

science

Model card Files Files and versions Community

MarcosDib commited on Dec 13, 2022

Commit

8565c27

1 Parent(s): 42b2aa7

Update README.md

Browse files

Files changed (1) hide show

README.md +5 -6

README.md CHANGED Viewed

@@ -63,10 +63,8 @@ With the motivation to increase accuracy obtained with baseline implementation,
 strategy under the assumption that small data available for training was insufficient for adequate embedding training.
 In this context, was considered two approaches:
-   i) pre-training word embeddings using similar datasets for text classification;
-   ii) using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
-Other 24 smaller models are released afterward.
 The detailed release history can be found on the [here](https://huggingface.co/unb-lamfo-nlp-mcti) on github.
@@ -194,7 +192,7 @@ Pre-processing was used to standardize the texts for the English language, reduc
 optimize the training of the models.
 The following assumptions were considered:
-- The Data Entry base is obtained from the result of goal 4.
 - Labeling (Goal 4) is considered true for accuracy measurement purposes;
 - Preprocessing experiments compare accuracy in a shallow neural network (SNN);
 - Pre-processing was investigated for the classification goal.
@@ -262,7 +260,8 @@ less number of unique tokens. XP8: It has smaller maximum sizes. In this case, t
 was the computational cost required to train the vector representation models (word-embedding, sentence-embeddings,
 document-embedding). The training time is so close that it did not have such a large weight for the analysis.
-As a last step, a spreadsheet was generated for the model (xp8) with the fields opo_pre and opo_pre_tkn, containing the preprocessed text in sentence format and tokens, respectively. This [database](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/oportunidades_final_pre_processado.xlsx) was made
 available on the project's GitHub with the inclusion of columns opo_pre (text) and opo_pre_tkn (tokenized).
 ### Pretraining

 strategy under the assumption that small data available for training was insufficient for adequate embedding training.
 In this context, was considered two approaches:
+   - Pre-training word embeddings using similar datasets for text classification;
+   - Using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
 The detailed release history can be found on the [here](https://huggingface.co/unb-lamfo-nlp-mcti) on github.
 optimize the training of the models.
 The following assumptions were considered:
+- The Data Entry base is obtained from the result of Goal 4.
 - Labeling (Goal 4) is considered true for accuracy measurement purposes;
 - Preprocessing experiments compare accuracy in a shallow neural network (SNN);
 - Pre-processing was investigated for the classification goal.
 was the computational cost required to train the vector representation models (word-embedding, sentence-embeddings,
 document-embedding). The training time is so close that it did not have such a large weight for the analysis.
+As a last step, a spreadsheet was generated for the model (xp8) with the fields opo_pre and opo_pre_tkn, containing the
+preprocessed text in sentence format and tokens, respectively. This [database](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/oportunidades_final_pre_processado.xlsx) was made
 available on the project's GitHub with the inclusion of columns opo_pre (text) and opo_pre_tkn (tokenized).
 ### Pretraining