MarcosDib commited on
Commit
8565c27
·
1 Parent(s): 42b2aa7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -6
README.md CHANGED
@@ -63,10 +63,8 @@ With the motivation to increase accuracy obtained with baseline implementation,
63
  strategy under the assumption that small data available for training was insufficient for adequate embedding training.
64
  In this context, was considered two approaches:
65
 
66
- i) pre-training word embeddings using similar datasets for text classification;
67
- ii) using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
68
-
69
- Other 24 smaller models are released afterward.
70
 
71
  The detailed release history can be found on the [here](https://huggingface.co/unb-lamfo-nlp-mcti) on github.
72
 
@@ -194,7 +192,7 @@ Pre-processing was used to standardize the texts for the English language, reduc
194
  optimize the training of the models.
195
 
196
  The following assumptions were considered:
197
- - The Data Entry base is obtained from the result of goal 4.
198
  - Labeling (Goal 4) is considered true for accuracy measurement purposes;
199
  - Preprocessing experiments compare accuracy in a shallow neural network (SNN);
200
  - Pre-processing was investigated for the classification goal.
@@ -262,7 +260,8 @@ less number of unique tokens. XP8: It has smaller maximum sizes. In this case, t
262
  was the computational cost required to train the vector representation models (word-embedding, sentence-embeddings,
263
  document-embedding). The training time is so close that it did not have such a large weight for the analysis.
264
 
265
- As a last step, a spreadsheet was generated for the model (xp8) with the fields opo_pre and opo_pre_tkn, containing the preprocessed text in sentence format and tokens, respectively. This [database](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/oportunidades_final_pre_processado.xlsx) was made
 
266
  available on the project's GitHub with the inclusion of columns opo_pre (text) and opo_pre_tkn (tokenized).
267
 
268
  ### Pretraining
 
63
  strategy under the assumption that small data available for training was insufficient for adequate embedding training.
64
  In this context, was considered two approaches:
65
 
66
+ - Pre-training word embeddings using similar datasets for text classification;
67
+ - Using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
 
 
68
 
69
  The detailed release history can be found on the [here](https://huggingface.co/unb-lamfo-nlp-mcti) on github.
70
 
 
192
  optimize the training of the models.
193
 
194
  The following assumptions were considered:
195
+ - The Data Entry base is obtained from the result of Goal 4.
196
  - Labeling (Goal 4) is considered true for accuracy measurement purposes;
197
  - Preprocessing experiments compare accuracy in a shallow neural network (SNN);
198
  - Pre-processing was investigated for the classification goal.
 
260
  was the computational cost required to train the vector representation models (word-embedding, sentence-embeddings,
261
  document-embedding). The training time is so close that it did not have such a large weight for the analysis.
262
 
263
+ As a last step, a spreadsheet was generated for the model (xp8) with the fields opo_pre and opo_pre_tkn, containing the
264
+ preprocessed text in sentence format and tokens, respectively. This [database](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/oportunidades_final_pre_processado.xlsx) was made
265
  available on the project's GitHub with the inclusion of columns opo_pre (text) and opo_pre_tkn (tokenized).
266
 
267
  ### Pretraining