Update README.md
Browse files
README.md
CHANGED
|
@@ -59,23 +59,24 @@ bibendum cursus. Nunc volutpat vitae neque ut bibendum.
|
|
| 59 |
|
| 60 |
## Model variations
|
| 61 |
|
| 62 |
-
With the motivation to increase accuracy obtained with baseline implementation,
|
| 63 |
strategy under the assumption that small data available for training was insufficient for adequate embedding training.
|
| 64 |
-
In this context,
|
| 65 |
|
| 66 |
-
i) pre-training
|
| 67 |
ii) using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
|
| 68 |
|
| 69 |
-
XXXX has originally been released in base and large variations, for cased and uncased input text. The uncased models
|
| 70 |
-
also strips out an accent markers. Chinese and multilingual uncased and cased versions followed shortly after.
|
| 71 |
-
Modified preprocessing with whole word masking has replaced subpiece masking in a following work, with the release of
|
| 72 |
-
two models.
|
| 73 |
-
|
| 74 |
Other 24 smaller models are released afterward.
|
| 75 |
|
| 76 |
The detailed release history can be found on the [here](https://huggingface.co/unb-lamfo-nlp-mcti) on github.
|
| 77 |
|
| 78 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
| Model | #params | Language |
|
| 80 |
|------------------------------|:-------:|:--------:|
|
| 81 |
| [`mcti-base-uncased`] | 110M | English |
|
|
@@ -84,8 +85,8 @@ The detailed release history can be found on the [here](https://huggingface.co/u
|
|
| 84 |
| [`mcti-large-cased`] | 110M | Chinese |
|
| 85 |
| [`-base-multilingual-cased`] | 110M | Multiple |
|
| 86 |
|
| 87 |
-
|
| 88 |
-
| Dataset |
|
| 89 |
|--------------------------------------|:----------------------:|
|
| 90 |
| Labeled MCTI | 100% |
|
| 91 |
| Full MCTI | 100% |
|
|
@@ -146,8 +147,7 @@ output = model(encoded_input)
|
|
| 146 |
|
| 147 |
### Limitations and bias
|
| 148 |
|
| 149 |
-
This model is uncased: it does not make a difference between english
|
| 150 |
-
and English.
|
| 151 |
|
| 152 |
Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
|
| 153 |
predictions:
|
|
@@ -182,9 +182,9 @@ This bias will also affect all fine-tuned versions of this model.
|
|
| 182 |
|
| 183 |
## Training data
|
| 184 |
|
| 185 |
-
The
|
| 186 |
-
|
| 187 |
-
|
| 188 |
|
| 189 |
## Training procedure
|
| 190 |
|
|
@@ -204,7 +204,7 @@ to implement the [pre-processing code](https://github.com/mcti-sefip/mcti-sefip-
|
|
| 204 |
|
| 205 |
Several Python packages were used to develop the preprocessing code:
|
| 206 |
|
| 207 |
-
|
| 208 |
| Objective | Package |
|
| 209 |
|--------------------------------------------------------|--------------|
|
| 210 |
| Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
|
|
@@ -221,7 +221,7 @@ Several Python packages were used to develop the preprocessing code:
|
|
| 221 |
As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
|
| 222 |
bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
|
| 223 |
|
| 224 |
-
|
| 225 |
| id | Experiments |
|
| 226 |
|--------|------------------------------------------------------------------------|
|
| 227 |
| Base | Original Texts |
|
|
@@ -244,7 +244,7 @@ All eight bases were evaluated to classify the eligibility of the opportunity,
|
|
| 244 |
neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
|
| 245 |
shown in Table 5.
|
| 246 |
|
| 247 |
-
|
| 248 |
| id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
|
| 249 |
|--------|------------------------------------------------------------------------|----------|----------|--------|-----------|----------|----------|------------|
|
| 250 |
| Base | Original Texts | 89,78% | 84,20% | 79,09% | 90,95% | 417,772 | 23788 | 5636 |
|
|
@@ -282,7 +282,7 @@ data in a supervised manner. The new coupled model can be seen in Figure 5 under
|
|
| 282 |
obtained results with related metrics. With this implementation, we achieved new levels of accuracy with 86% for the CNN
|
| 283 |
architecture and 88% for the LSTM architecture.
|
| 284 |
|
| 285 |
-
|
| 286 |
| ML Model | Accuracy | F1 Score | Precision | Recall |
|
| 287 |
|:--------:|:---------:|:---------:|:---------:|:---------:|
|
| 288 |
| NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
|
|
@@ -308,7 +308,7 @@ models, we realized supervised training of the whole model. At this point, only
|
|
| 308 |
computational power was needed to realize the fine-tuning of the weights. The results with related metrics can be viewed in table 4.
|
| 309 |
This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
|
| 310 |
|
| 311 |
-
|
| 312 |
| ML Model | Accuracy | F1 Score | Precision | Recall |
|
| 313 |
|:--------:|:---------:|:---------:|:---------:|:---------:|
|
| 314 |
| NN | 0.8269 | 0.8754 |0.7950 | 0.9773 |
|
|
|
|
| 59 |
|
| 60 |
## Model variations
|
| 61 |
|
| 62 |
+
With the motivation to increase accuracy obtained with baseline implementation, was implemented a transfer learning
|
| 63 |
strategy under the assumption that small data available for training was insufficient for adequate embedding training.
|
| 64 |
+
In this context, was considered two approaches:
|
| 65 |
|
| 66 |
+
i) pre-training word embeddings using similar datasets for text classification;
|
| 67 |
ii) using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
|
| 68 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
Other 24 smaller models are released afterward.
|
| 70 |
|
| 71 |
The detailed release history can be found on the [here](https://huggingface.co/unb-lamfo-nlp-mcti) on github.
|
| 72 |
|
| 73 |
+
Os modelos que utilizam Word2Vec e Longformer também precisam ser carregados e seus pesos são os seguintes:
|
| 74 |
+
|
| 75 |
+
Longformer: 10.88 GB
|
| 76 |
+
|
| 77 |
+
Word2Vec: 56.1 MB
|
| 78 |
+
|
| 79 |
+
Table 1:
|
| 80 |
| Model | #params | Language |
|
| 81 |
|------------------------------|:-------:|:--------:|
|
| 82 |
| [`mcti-base-uncased`] | 110M | English |
|
|
|
|
| 85 |
| [`mcti-large-cased`] | 110M | Chinese |
|
| 86 |
| [`-base-multilingual-cased`] | 110M | Multiple |
|
| 87 |
|
| 88 |
+
Table 2: Compatibility results (*base = labeled MCTI dataset entries)
|
| 89 |
+
| Dataset | |
|
| 90 |
|--------------------------------------|:----------------------:|
|
| 91 |
| Labeled MCTI | 100% |
|
| 92 |
| Full MCTI | 100% |
|
|
|
|
| 147 |
|
| 148 |
### Limitations and bias
|
| 149 |
|
| 150 |
+
This model is uncased: it does not make a difference between english and English.
|
|
|
|
| 151 |
|
| 152 |
Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
|
| 153 |
predictions:
|
|
|
|
| 182 |
|
| 183 |
## Training data
|
| 184 |
|
| 185 |
+
The [inputted training](https://github.com/chap0lin/PPF-MCTI/tree/master/Datasets) data was obtained from scrapping techniques, over 30 different platforms e.g. The Royal Society,
|
| 186 |
+
Annenberg foundation, and contained 928 labeled entries (928 rows x 21 columns). Of the data gathered, was used only
|
| 187 |
+
the main text content (column u). Text content averages 800 tokens in length, but with high variance, up to 5,000 tokens.
|
| 188 |
|
| 189 |
## Training procedure
|
| 190 |
|
|
|
|
| 204 |
|
| 205 |
Several Python packages were used to develop the preprocessing code:
|
| 206 |
|
| 207 |
+
Table 3: Python packages used
|
| 208 |
| Objective | Package |
|
| 209 |
|--------------------------------------------------------|--------------|
|
| 210 |
| Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
|
|
|
|
| 221 |
As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
|
| 222 |
bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
|
| 223 |
|
| 224 |
+
Table 4: Preprocessing methods evaluated
|
| 225 |
| id | Experiments |
|
| 226 |
|--------|------------------------------------------------------------------------|
|
| 227 |
| Base | Original Texts |
|
|
|
|
| 244 |
neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
|
| 245 |
shown in Table 5.
|
| 246 |
|
| 247 |
+
Table 5: Results obtained in Preprocessing
|
| 248 |
| id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
|
| 249 |
|--------|------------------------------------------------------------------------|----------|----------|--------|-----------|----------|----------|------------|
|
| 250 |
| Base | Original Texts | 89,78% | 84,20% | 79,09% | 90,95% | 417,772 | 23788 | 5636 |
|
|
|
|
| 282 |
obtained results with related metrics. With this implementation, we achieved new levels of accuracy with 86% for the CNN
|
| 283 |
architecture and 88% for the LSTM architecture.
|
| 284 |
|
| 285 |
+
Table 6: Results from Pre-trained WE + ML models
|
| 286 |
| ML Model | Accuracy | F1 Score | Precision | Recall |
|
| 287 |
|:--------:|:---------:|:---------:|:---------:|:---------:|
|
| 288 |
| NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
|
|
|
|
| 308 |
computational power was needed to realize the fine-tuning of the weights. The results with related metrics can be viewed in table 4.
|
| 309 |
This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
|
| 310 |
|
| 311 |
+
Table 7: Results from Pre-trained Longformer + ML models
|
| 312 |
| ML Model | Accuracy | F1 Score | Precision | Recall |
|
| 313 |
|:--------:|:---------:|:---------:|:---------:|:---------:|
|
| 314 |
| NN | 0.8269 | 0.8754 |0.7950 | 0.9773 |
|