MarcosDib commited on
Commit
fe3713e
·
1 Parent(s): 6996277

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -72
README.md CHANGED
@@ -13,18 +13,26 @@ thumbnail: https://github.com/Marcosdib/S2Query/Classification_Architecture_mode
13
  # MCTI Text Classification Task (uncased) DRAFT
14
 
15
  Disclaimer:
16
- sentences in the original corpus, and in the other cases, it's another random sentence in
17
- ## According to the abstract,
18
-
19
- With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus, and in
20
 
21
- Text classification is a traditional problem in Natural Language Processing (NLP). Most of the state-of-the-art implementations require high-quality, voluminous, labeled data. Pre- trained models on large corpora have shown beneficial for text classification and other NLP tasks, but they can only take a limited amount of symbols as input. This is a real case study that explores different machine learning strategies to classify a small amount of long, unstructured, and uneven data to find a proper method with good performance. The collected data includes texts of financing opportunities the international R&D funding organizations provided on theirwebsites. The main goal is to find international R&D funding eligible for Brazilian researchers, sponsored by the Ministry of Science, Technology and Innovation. We use pre-training and word embedding solutions to learn the relationship of the words from other datasets with considerable similarity and larger scale. Then, using the acquired features, based on the available dataset from MCTI, we apply transfer learning plus deep learning models to improve the comprehension of each sentence. Compared to the baseline accuracy rate of 81%, based on the available datasets, and the 85% accuracy rate achieved through a Transformer-based approach, the Word2Vec-based approach improved the accuracy rate to 88%. The research results serve as asuccessful case of artificial intelligence in a federal government application.
22
 
23
- This model focus on a more specific problem, creating a Research Financing Products Portfolio (FPP) outside of
24
- the Union budget, supported by the Brazilian Ministry of Science, Technology, and Innovation (MCTI). It was
25
- introduced in ["Using transfer learning to classify long unstructured texts with small amounts of labeled data"](https://www.scitepress.org/Link.aspx?doi=10.5220/0011527700003318) and first released in
26
- [this repository](https://huggingface.co/unb-lamfo-nlp-mcti). This model is uncased: it does not make a difference
27
- between english and English.
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
  ## Model description
30
 
@@ -241,23 +249,12 @@ learning rate warmup for 10,000 steps and linear decay of the learning rate afte
241
 
242
  ### Model training with Word2Vec embeddings
243
 
244
- Now we have a pre-trained model of word2vec embeddings that has already learned relevant meaningsfor our classification problem. We can couple it to
245
- our classification models (Fig. 4), realizing transferlearning
246
- and then training the model with the labeled
247
- data in a supervised manner. The new coupled model
248
- can be seen in Figure 5 under word2vec model training.
249
- The Table 3 shows the obtained results with related
250
- metrics. With this implementation, we achieved
251
- new levels of accuracy with 86% for the CNN architecture
252
- and 88% for the LSTM architecture.
253
 
254
- When fine-tuned on downstream tasks, this model achieves the following results:
255
-
256
- Glue test results:
257
-
258
- | Task | MNLI-(m/mm) | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | Average |
259
- |:----:|:-----------:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:|
260
- | | 84.6/83.4 | 71.2 | 90.5 | 93.5 | 52.1 | 85.8 | 88.9 | 66.4 | 79.6 |
261
 
262
  Table 1: Results from Pre-trained WE + ML models.
263
 
@@ -270,49 +267,34 @@ Table 1: Results from Pre-trained WE + ML models.
270
 
271
  ### Transformer-based implementation
272
 
273
- Another way we used pre-trained vector representations
274
- was by use of a Longformer (Beltagy et al.,
275
- 2020). We chose it because of the limitation of the
276
- first generation of transformers and BERT-based architectures
277
- involving the size of the sentences: the
278
- maximum of 512 tokens. The reason behind that
279
- limitation is that the self-attention mechanism scale
280
- quadratically with the input sequence length O(n2)
281
- (Beltagy et al., 2020). The Longformer allowed the
282
- processing sequences of a thousand characters without
283
- facing the memory bottleneck of BERT-like architectures
284
- and achieved SOTA in several benchmarks.
285
- For our text length distribution in Figure 3, if
286
- we used a Bert-based architecture with a maximum
287
- length of 512, 99 sentences would have to be truncated
288
- and probably miss some critical information.
289
- By comparison, with the Longformer, with a maximum
290
- length of 4096, only eight sentences will have
291
- their information shortened.
292
- To apply the Longformer, we used the pre-trained
293
- base (available on the link) that was previously trained
294
- with a combination of vast datasets as input to the
295
- model, as shown in figure 5 under Longformer model
296
- training. After coupling to our classification models,
297
- we realized supervised training of the whole model.
298
- At this point, only transfer learning was applied since
299
- more computational power was needed to realize the
300
- fine-tuning of the weights. The results with related
301
- metrics can be viewed in table 4. This approach
302
- achieved adequate accuracy scores, above 82% in all
303
- implementation architectures.
304
 
305
 
306
  Table 2: Results from Pre-trained Longformer + ML models.
307
 
308
- ML Model Accuracy F1 Score Precision Recall
309
- NN 0.8269 0.8754 0.7950 0.9773
310
- DNN 0.8462 0.8776 0.8474 0.9123
311
- CNN 0.8462 0.8776 0.8474 0.9123
312
- LSTM 0.8269 0.8801 0.8571 0.9091
 
313
 
314
 
315
- ##< Checkpoints >
316
  - Examples
317
  - Implementation Notes
318
  - Usage Example
@@ -320,21 +302,20 @@ LSTM 0.8269 0.8801 0.8571 0.9091
320
  - >>> ...
321
 
322
 
323
- ##< Config >
324
-
325
- ##< Tokenizer >
326
 
327
- ##< Training data >
328
 
329
- ##< Training procedure >
330
 
331
- ##< Preprocessing >
332
 
333
- ##< Pretraining >
334
 
335
- ##< Evaluation results >
336
 
337
- ##< Benchmarks >
 
338
 
339
  ### BibTeX entry and citation info
340
 
 
13
  # MCTI Text Classification Task (uncased) DRAFT
14
 
15
  Disclaimer:
 
 
 
 
16
 
17
+ ## According to the abstract,
18
 
19
+ Text classification is a traditional problem in Natural Language Processing (NLP). Most of the state-of-the-art implementations
20
+ require high-quality, voluminous, labeled data. Pre- trained models on large corpora have shown beneficial for text classification
21
+ and other NLP tasks, but they can only take a limited amount of symbols as input. This is a real case study that explores
22
+ different machine learning strategies to classify a small amount of long, unstructured, and uneven data to find a proper method
23
+ with good performance. The collected data includes texts of financing opportunities the international R&D funding organizations
24
+ provided on theirwebsites. The main goal is to find international R&D funding eligible for Brazilian researchers, sponsored by
25
+ the Ministry of Science, Technology and Innovation. We use pre-training and word embedding solutions to learn the relationship
26
+ of the words from other datasets with considerable similarity and larger scale. Then, using the acquired features, based on the
27
+ available dataset from MCTI, we apply transfer learning plus deep learning models to improve the comprehension of each sentence.
28
+ Compared to the baseline accuracy rate of 81%, based on the available datasets, and the 85% accuracy rate achieved through a
29
+ Transformer-based approach, the Word2Vec-based approach improved the accuracy rate to 88%. The research results serve as
30
+ asuccessful case of artificial intelligence in a federal government application.
31
+
32
+ This model focus on a more specific problem, creating a Research Financing Products Portfolio (FPP) outside ofthe Union budget,
33
+ supported by the Brazilian Ministry of Science, Technology, and Innovation (MCTI). It was introduced in ["Using transfer learning to classify long unstructured texts with small amounts of labeled data"](https://www.scitepress.org/Link.aspx?doi=10.5220/0011527700003318) and first released in
34
+ [this repository](https://huggingface.co/unb-lamfo-nlp-mcti). This model is uncased: it does not make a difference between english
35
+ and English.
36
 
37
  ## Model description
38
 
 
249
 
250
  ### Model training with Word2Vec embeddings
251
 
252
+ Now we have a pre-trained model of word2vec embeddings that has already learned relevant meaningsfor our classification problem.
253
+ We can couple it to our classification models (Fig. 4), realizing transferlearning and then training the model with the labeled
254
+ data in a supervised manner. The new coupled model can be seen in Figure 5 under word2vec model training. The Table 3 shows the
255
+ obtained results with related metrics. With this implementation, we achieved new levels of accuracy with 86% for the CNN
256
+ architecture and 88% for the LSTM architecture.
 
 
 
 
257
 
 
 
 
 
 
 
 
258
 
259
  Table 1: Results from Pre-trained WE + ML models.
260
 
 
267
 
268
  ### Transformer-based implementation
269
 
270
+ Another way we used pre-trained vector representations was by use of a Longformer (Beltagy et al., 2020). We chose it because
271
+ of the limitation of the first generation of transformers and BERT-based architectures involving the size of the sentences:
272
+ the maximum of 512 tokens. The reason behind that limitation is that the self-attention mechanism scale quadratically with the
273
+ input sequence length O(n2) (Beltagy et al., 2020). The Longformer allowed the processing sequences of a thousand characters
274
+ without facing the memory bottleneck of BERT-like architectures and achieved SOTA in several benchmarks.
275
+
276
+ For our text length distribution in Figure 3, if we used a Bert-based architecture with a maximum length of 512, 99 sentences
277
+ would have to be truncated and probably miss some critical information. By comparison, with the Longformer, with a maximum
278
+ length of 4096, only eight sentences will have their information shortened.
279
+
280
+ To apply the Longformer, we used the pre-trained base (available on the link) that was previously trained with a combination
281
+ of vast datasets as input to the model, as shown in figure 5 under Longformer model training. After coupling to our classification
282
+ models, we realized supervised training of the whole model. At this point, only transfer learning was applied since more
283
+ computational power was needed to realize the fine-tuning of the weights. The results with related metrics can be viewed in table 4.
284
+ This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
285
 
286
 
287
  Table 2: Results from Pre-trained Longformer + ML models.
288
 
289
+ | ML Model | Accuracy | F1 Score | Precision | Recall |
290
+ |:--------:|:---------:|:---------:|:---------:|:---------:|
291
+ | NN | 0.8269 | 0.8754 |0.7950 | 0.9773 |
292
+ | DNN | 0.8462 | 0.8776 |0.8474 | 0.9123 |
293
+ | CNN | 0.8462 | 0.8776 |0.8474 | 0.9123 |
294
+ | LSTM | 0.8269 | 0.8801 |0.8571 | 0.9091 |
295
 
296
 
297
+ ## Checkpoints
298
  - Examples
299
  - Implementation Notes
300
  - Usage Example
 
302
  - >>> ...
303
 
304
 
305
+ ## Config
 
 
306
 
307
+ ## Tokenizer
308
 
309
+ ## Training data
310
 
311
+ ## Training procedure
312
 
313
+ ## Preprocessing
314
 
315
+ ## Pretraining
316
 
317
+ ## Evaluation results
318
+ ## Benchmarks
319
 
320
  ### BibTeX entry and citation info
321