Update README.md
Browse files
README.md
CHANGED
@@ -13,18 +13,26 @@ thumbnail: https://github.com/Marcosdib/S2Query/Classification_Architecture_mode
|
|
13 |
# MCTI Text Classification Task (uncased) DRAFT
|
14 |
|
15 |
Disclaimer:
|
16 |
-
sentences in the original corpus, and in the other cases, it's another random sentence in
|
17 |
-
## According to the abstract,
|
18 |
-
|
19 |
-
With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus, and in
|
20 |
|
21 |
-
|
22 |
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
|
29 |
## Model description
|
30 |
|
@@ -241,23 +249,12 @@ learning rate warmup for 10,000 steps and linear decay of the learning rate afte
|
|
241 |
|
242 |
### Model training with Word2Vec embeddings
|
243 |
|
244 |
-
Now we have a pre-trained model of word2vec embeddings that has already learned relevant meaningsfor our classification problem.
|
245 |
-
our classification models (Fig. 4), realizing transferlearning
|
246 |
-
|
247 |
-
|
248 |
-
|
249 |
-
The Table 3 shows the obtained results with related
|
250 |
-
metrics. With this implementation, we achieved
|
251 |
-
new levels of accuracy with 86% for the CNN architecture
|
252 |
-
and 88% for the LSTM architecture.
|
253 |
|
254 |
-
When fine-tuned on downstream tasks, this model achieves the following results:
|
255 |
-
|
256 |
-
Glue test results:
|
257 |
-
|
258 |
-
| Task | MNLI-(m/mm) | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | Average |
|
259 |
-
|:----:|:-----------:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:|
|
260 |
-
| | 84.6/83.4 | 71.2 | 90.5 | 93.5 | 52.1 | 85.8 | 88.9 | 66.4 | 79.6 |
|
261 |
|
262 |
Table 1: Results from Pre-trained WE + ML models.
|
263 |
|
@@ -270,49 +267,34 @@ Table 1: Results from Pre-trained WE + ML models.
|
|
270 |
|
271 |
### Transformer-based implementation
|
272 |
|
273 |
-
Another way we used pre-trained vector representations
|
274 |
-
|
275 |
-
|
276 |
-
|
277 |
-
|
278 |
-
|
279 |
-
|
280 |
-
|
281 |
-
|
282 |
-
|
283 |
-
|
284 |
-
|
285 |
-
|
286 |
-
|
287 |
-
|
288 |
-
and probably miss some critical information.
|
289 |
-
By comparison, with the Longformer, with a maximum
|
290 |
-
length of 4096, only eight sentences will have
|
291 |
-
their information shortened.
|
292 |
-
To apply the Longformer, we used the pre-trained
|
293 |
-
base (available on the link) that was previously trained
|
294 |
-
with a combination of vast datasets as input to the
|
295 |
-
model, as shown in figure 5 under Longformer model
|
296 |
-
training. After coupling to our classification models,
|
297 |
-
we realized supervised training of the whole model.
|
298 |
-
At this point, only transfer learning was applied since
|
299 |
-
more computational power was needed to realize the
|
300 |
-
fine-tuning of the weights. The results with related
|
301 |
-
metrics can be viewed in table 4. This approach
|
302 |
-
achieved adequate accuracy scores, above 82% in all
|
303 |
-
implementation architectures.
|
304 |
|
305 |
|
306 |
Table 2: Results from Pre-trained Longformer + ML models.
|
307 |
|
308 |
-
ML Model Accuracy F1 Score Precision Recall
|
309 |
-
|
310 |
-
|
311 |
-
|
312 |
-
|
|
|
313 |
|
314 |
|
315 |
-
|
316 |
- Examples
|
317 |
- Implementation Notes
|
318 |
- Usage Example
|
@@ -320,21 +302,20 @@ LSTM 0.8269 0.8801 0.8571 0.9091
|
|
320 |
- >>> ...
|
321 |
|
322 |
|
323 |
-
|
324 |
-
|
325 |
-
##< Tokenizer >
|
326 |
|
327 |
-
|
328 |
|
329 |
-
|
330 |
|
331 |
-
|
332 |
|
333 |
-
|
334 |
|
335 |
-
|
336 |
|
337 |
-
|
|
|
338 |
|
339 |
### BibTeX entry and citation info
|
340 |
|
|
|
13 |
# MCTI Text Classification Task (uncased) DRAFT
|
14 |
|
15 |
Disclaimer:
|
|
|
|
|
|
|
|
|
16 |
|
17 |
+
## According to the abstract,
|
18 |
|
19 |
+
Text classification is a traditional problem in Natural Language Processing (NLP). Most of the state-of-the-art implementations
|
20 |
+
require high-quality, voluminous, labeled data. Pre- trained models on large corpora have shown beneficial for text classification
|
21 |
+
and other NLP tasks, but they can only take a limited amount of symbols as input. This is a real case study that explores
|
22 |
+
different machine learning strategies to classify a small amount of long, unstructured, and uneven data to find a proper method
|
23 |
+
with good performance. The collected data includes texts of financing opportunities the international R&D funding organizations
|
24 |
+
provided on theirwebsites. The main goal is to find international R&D funding eligible for Brazilian researchers, sponsored by
|
25 |
+
the Ministry of Science, Technology and Innovation. We use pre-training and word embedding solutions to learn the relationship
|
26 |
+
of the words from other datasets with considerable similarity and larger scale. Then, using the acquired features, based on the
|
27 |
+
available dataset from MCTI, we apply transfer learning plus deep learning models to improve the comprehension of each sentence.
|
28 |
+
Compared to the baseline accuracy rate of 81%, based on the available datasets, and the 85% accuracy rate achieved through a
|
29 |
+
Transformer-based approach, the Word2Vec-based approach improved the accuracy rate to 88%. The research results serve as
|
30 |
+
asuccessful case of artificial intelligence in a federal government application.
|
31 |
+
|
32 |
+
This model focus on a more specific problem, creating a Research Financing Products Portfolio (FPP) outside ofthe Union budget,
|
33 |
+
supported by the Brazilian Ministry of Science, Technology, and Innovation (MCTI). It was introduced in ["Using transfer learning to classify long unstructured texts with small amounts of labeled data"](https://www.scitepress.org/Link.aspx?doi=10.5220/0011527700003318) and first released in
|
34 |
+
[this repository](https://huggingface.co/unb-lamfo-nlp-mcti). This model is uncased: it does not make a difference between english
|
35 |
+
and English.
|
36 |
|
37 |
## Model description
|
38 |
|
|
|
249 |
|
250 |
### Model training with Word2Vec embeddings
|
251 |
|
252 |
+
Now we have a pre-trained model of word2vec embeddings that has already learned relevant meaningsfor our classification problem.
|
253 |
+
We can couple it to our classification models (Fig. 4), realizing transferlearning and then training the model with the labeled
|
254 |
+
data in a supervised manner. The new coupled model can be seen in Figure 5 under word2vec model training. The Table 3 shows the
|
255 |
+
obtained results with related metrics. With this implementation, we achieved new levels of accuracy with 86% for the CNN
|
256 |
+
architecture and 88% for the LSTM architecture.
|
|
|
|
|
|
|
|
|
257 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
258 |
|
259 |
Table 1: Results from Pre-trained WE + ML models.
|
260 |
|
|
|
267 |
|
268 |
### Transformer-based implementation
|
269 |
|
270 |
+
Another way we used pre-trained vector representations was by use of a Longformer (Beltagy et al., 2020). We chose it because
|
271 |
+
of the limitation of the first generation of transformers and BERT-based architectures involving the size of the sentences:
|
272 |
+
the maximum of 512 tokens. The reason behind that limitation is that the self-attention mechanism scale quadratically with the
|
273 |
+
input sequence length O(n2) (Beltagy et al., 2020). The Longformer allowed the processing sequences of a thousand characters
|
274 |
+
without facing the memory bottleneck of BERT-like architectures and achieved SOTA in several benchmarks.
|
275 |
+
|
276 |
+
For our text length distribution in Figure 3, if we used a Bert-based architecture with a maximum length of 512, 99 sentences
|
277 |
+
would have to be truncated and probably miss some critical information. By comparison, with the Longformer, with a maximum
|
278 |
+
length of 4096, only eight sentences will have their information shortened.
|
279 |
+
|
280 |
+
To apply the Longformer, we used the pre-trained base (available on the link) that was previously trained with a combination
|
281 |
+
of vast datasets as input to the model, as shown in figure 5 under Longformer model training. After coupling to our classification
|
282 |
+
models, we realized supervised training of the whole model. At this point, only transfer learning was applied since more
|
283 |
+
computational power was needed to realize the fine-tuning of the weights. The results with related metrics can be viewed in table 4.
|
284 |
+
This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
285 |
|
286 |
|
287 |
Table 2: Results from Pre-trained Longformer + ML models.
|
288 |
|
289 |
+
| ML Model | Accuracy | F1 Score | Precision | Recall |
|
290 |
+
|:--------:|:---------:|:---------:|:---------:|:---------:|
|
291 |
+
| NN | 0.8269 | 0.8754 |0.7950 | 0.9773 |
|
292 |
+
| DNN | 0.8462 | 0.8776 |0.8474 | 0.9123 |
|
293 |
+
| CNN | 0.8462 | 0.8776 |0.8474 | 0.9123 |
|
294 |
+
| LSTM | 0.8269 | 0.8801 |0.8571 | 0.9091 |
|
295 |
|
296 |
|
297 |
+
## Checkpoints
|
298 |
- Examples
|
299 |
- Implementation Notes
|
300 |
- Usage Example
|
|
|
302 |
- >>> ...
|
303 |
|
304 |
|
305 |
+
## Config
|
|
|
|
|
306 |
|
307 |
+
## Tokenizer
|
308 |
|
309 |
+
## Training data
|
310 |
|
311 |
+
## Training procedure
|
312 |
|
313 |
+
## Preprocessing
|
314 |
|
315 |
+
## Pretraining
|
316 |
|
317 |
+
## Evaluation results
|
318 |
+
## Benchmarks
|
319 |
|
320 |
### BibTeX entry and citation info
|
321 |
|