Update README.md
Browse files
README.md
CHANGED
@@ -63,10 +63,8 @@ With the motivation to increase accuracy obtained with baseline implementation,
|
|
63 |
strategy under the assumption that small data available for training was insufficient for adequate embedding training.
|
64 |
In this context, was considered two approaches:
|
65 |
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
Other 24 smaller models are released afterward.
|
70 |
|
71 |
The detailed release history can be found on the [here](https://huggingface.co/unb-lamfo-nlp-mcti) on github.
|
72 |
|
@@ -194,7 +192,7 @@ Pre-processing was used to standardize the texts for the English language, reduc
|
|
194 |
optimize the training of the models.
|
195 |
|
196 |
The following assumptions were considered:
|
197 |
-
- The Data Entry base is obtained from the result of
|
198 |
- Labeling (Goal 4) is considered true for accuracy measurement purposes;
|
199 |
- Preprocessing experiments compare accuracy in a shallow neural network (SNN);
|
200 |
- Pre-processing was investigated for the classification goal.
|
@@ -262,7 +260,8 @@ less number of unique tokens. XP8: It has smaller maximum sizes. In this case, t
|
|
262 |
was the computational cost required to train the vector representation models (word-embedding, sentence-embeddings,
|
263 |
document-embedding). The training time is so close that it did not have such a large weight for the analysis.
|
264 |
|
265 |
-
As a last step, a spreadsheet was generated for the model (xp8) with the fields opo_pre and opo_pre_tkn, containing the
|
|
|
266 |
available on the project's GitHub with the inclusion of columns opo_pre (text) and opo_pre_tkn (tokenized).
|
267 |
|
268 |
### Pretraining
|
|
|
63 |
strategy under the assumption that small data available for training was insufficient for adequate embedding training.
|
64 |
In this context, was considered two approaches:
|
65 |
|
66 |
+
- Pre-training word embeddings using similar datasets for text classification;
|
67 |
+
- Using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
|
|
|
|
|
68 |
|
69 |
The detailed release history can be found on the [here](https://huggingface.co/unb-lamfo-nlp-mcti) on github.
|
70 |
|
|
|
192 |
optimize the training of the models.
|
193 |
|
194 |
The following assumptions were considered:
|
195 |
+
- The Data Entry base is obtained from the result of Goal 4.
|
196 |
- Labeling (Goal 4) is considered true for accuracy measurement purposes;
|
197 |
- Preprocessing experiments compare accuracy in a shallow neural network (SNN);
|
198 |
- Pre-processing was investigated for the classification goal.
|
|
|
260 |
was the computational cost required to train the vector representation models (word-embedding, sentence-embeddings,
|
261 |
document-embedding). The training time is so close that it did not have such a large weight for the analysis.
|
262 |
|
263 |
+
As a last step, a spreadsheet was generated for the model (xp8) with the fields opo_pre and opo_pre_tkn, containing the
|
264 |
+
preprocessed text in sentence format and tokens, respectively. This [database](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/oportunidades_final_pre_processado.xlsx) was made
|
265 |
available on the project's GitHub with the inclusion of columns opo_pre (text) and opo_pre_tkn (tokenized).
|
266 |
|
267 |
### Pretraining
|