Update README.md
Browse files
README.md
CHANGED
@@ -11,23 +11,26 @@ datasets:
|
|
11 |
# MCTI Text Classification Task (uncased) DRAFT
|
12 |
|
13 |
Disclaimer:
|
14 |
-
|
15 |
## According to the abstract,
|
16 |
|
17 |
-
Text classification is a traditional problem in Natural Language Processing (NLP). Most of
|
18 |
-
implementations require high-quality, voluminous, labeled data. Pre-
|
19 |
-
shown beneficial for text classification and other NLP
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
|
|
|
|
|
|
31 |
|
32 |
This model focus on a more specific problem, creating a Research Financing Products Portfolio (FPP) outside of
|
33 |
the Union budget, supported by the Brazilian Ministry of Science, Technology, and Innovation (MCTI). It was
|
@@ -62,7 +65,7 @@ With the motivation to increase accuracy obtained with baseline implementation,
|
|
62 |
strategy under the assumption that small data available for training was insufficient for adequate embedding training.
|
63 |
In this context, we considered two approaches:
|
64 |
|
65 |
-
i) pre-training wordembeddings using similar datasets for text classification;
|
66 |
ii) using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
|
67 |
|
68 |
XXXX has originally been released in base and large variations, for cased and uncased input text. The uncased models
|
@@ -82,12 +85,12 @@ The detailed release history can be found on the [here](https://huggingface.co/u
|
|
82 |
| [`mcti-large-cased`] | 110M | Chinese |
|
83 |
| [`-base-multilingual-cased`] | 110M | Multiple |
|
84 |
|
85 |
-
| Dataset
|
86 |
-
|
87 |
-
| Labeled MCTI
|
88 |
-
| Full MCTI
|
89 |
-
| BBC News Articles
|
90 |
-
| New unlabeled MCTI
|
91 |
|
92 |
|
93 |
## Intended uses
|
@@ -248,6 +251,18 @@ learning rate warmup for 10,000 steps and linear decay of the learning rate afte
|
|
248 |
|
249 |
## Evaluation results
|
250 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
251 |
When fine-tuned on downstream tasks, this model achieves the following results:
|
252 |
|
253 |
Glue test results:
|
@@ -256,26 +271,99 @@ Glue test results:
|
|
256 |
|:----:|:-----------:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:|
|
257 |
| | 84.6/83.4 | 71.2 | 90.5 | 93.5 | 52.1 | 85.8 | 88.9 | 66.4 | 79.6 |
|
258 |
|
259 |
-
|
260 |
-
|
261 |
-
|
262 |
-
|
263 |
-
|
264 |
-
|
265 |
-
|
266 |
-
|
267 |
-
|
268 |
-
|
269 |
-
|
270 |
-
|
271 |
-
|
272 |
-
|
273 |
-
|
274 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
275 |
### BibTeX entry and citation info
|
276 |
|
277 |
```bibtex
|
278 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
279 |
```
|
280 |
|
281 |
<a href="https://huggingface.co/exbert/?model=bert-base-uncased">
|
|
|
11 |
# MCTI Text Classification Task (uncased) DRAFT
|
12 |
|
13 |
Disclaimer:
|
14 |
+
sentences in the original corpus, and in the other cases, it's another random sentence in
|
15 |
## According to the abstract,
|
16 |
|
17 |
+
Text classification is a traditional problem in Natural Language Processing (NLP). Most of
|
18 |
+
the state-of-the-art implementations require high-quality, voluminous, labeled data. Pre-
|
19 |
+
trained models on large corpora have shown beneficial for text classification and other NLP
|
20 |
+
tasks, but they can only take a limited amount of symbols as input. This is a real case
|
21 |
+
study that explores different machine learning strategies to classify a small amount of
|
22 |
+
long, unstructured, and uneven data to find a proper method with good performance. The
|
23 |
+
collected data includes texts of financing opportunities the international R&D funding
|
24 |
+
organizations provided on theirwebsites. The main goal is to find international R&D funding
|
25 |
+
eligible for Brazilian researchers, sponsored by the Ministry of Science, Technology and
|
26 |
+
Innovation. We use pre-training and word embedding solutions to learn the relationship of
|
27 |
+
the words from other datasets with considerable similarity and larger scale. Then, using
|
28 |
+
the acquired features, based on the available dataset from MCTI, we apply transfer learning
|
29 |
+
plus deep learning models to improve the comprehension of each sentence. Compared to the
|
30 |
+
baseline accuracy rate of 81%, based on the available datasets, and the 85% accuracy rate
|
31 |
+
achieved through a Transformer-based approach, the Word2Vec-based approach improved the
|
32 |
+
accuracy rate to 88%. The research results serve as asuccessful case of artificial
|
33 |
+
intelligence in a federal government application.
|
34 |
|
35 |
This model focus on a more specific problem, creating a Research Financing Products Portfolio (FPP) outside of
|
36 |
the Union budget, supported by the Brazilian Ministry of Science, Technology, and Innovation (MCTI). It was
|
|
|
65 |
strategy under the assumption that small data available for training was insufficient for adequate embedding training.
|
66 |
In this context, we considered two approaches:
|
67 |
|
68 |
+
i) pre-training wordembeddings using similar datasets for text classification;
|
69 |
ii) using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
|
70 |
|
71 |
XXXX has originally been released in base and large variations, for cased and uncased input text. The uncased models
|
|
|
85 |
| [`mcti-large-cased`] | 110M | Chinese |
|
86 |
| [`-base-multilingual-cased`] | 110M | Multiple |
|
87 |
|
88 |
+
| Dataset | Compatibility to base* |
|
89 |
+
|----------------------------|------------------------|
|
90 |
+
| Labeled MCTI | 100% |
|
91 |
+
| Full MCTI | 100% |
|
92 |
+
| BBC News Articles | 56.77% |
|
93 |
+
| New unlabeled MCTI | 75.26% |
|
94 |
|
95 |
|
96 |
## Intended uses
|
|
|
251 |
|
252 |
## Evaluation results
|
253 |
|
254 |
+
### Model training with Word2Vec embeddings
|
255 |
+
|
256 |
+
Now we have a pre-trained model of word2vec embeddings that has already learned relevant meaningsfor our classification problem. We can couple it to
|
257 |
+
our classification models (Fig. 4), realizing transferlearning
|
258 |
+
and then training the model with the labeled
|
259 |
+
data in a supervised manner. The new coupled model
|
260 |
+
can be seen in Figure 5 under word2vec model training.
|
261 |
+
The Table 3 shows the obtained results with related
|
262 |
+
metrics. With this implementation, we achieved
|
263 |
+
new levels of accuracy with 86% for the CNN architecture
|
264 |
+
and 88% for the LSTM architecture.
|
265 |
+
|
266 |
When fine-tuned on downstream tasks, this model achieves the following results:
|
267 |
|
268 |
Glue test results:
|
|
|
271 |
|:----:|:-----------:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:|
|
272 |
| | 84.6/83.4 | 71.2 | 90.5 | 93.5 | 52.1 | 85.8 | 88.9 | 66.4 | 79.6 |
|
273 |
|
274 |
+
Table 1: Results from Pre-trained WE + ML models.
|
275 |
+
|
276 |
+
| ML Model | Accuracy | F1 Score | Precision | Recall |
|
277 |
+
|:--------:|:---------:|:---------:|:---------:|:---------:|
|
278 |
+
| NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
|
279 |
+
| DNN | 0.7115 | 0.7794 | 0.7255 | 0.8485 |
|
280 |
+
| CNN | 0.8654 | 0.9083 | 0.8486 | 0.9773 |
|
281 |
+
| LSTM | 0.8846 | 0.9139 | 0.9056 | 0.9318 |
|
282 |
+
|
283 |
+
### Transformer-based implementation
|
284 |
+
|
285 |
+
Another way we used pre-trained vector representations
|
286 |
+
was by use of a Longformer (Beltagy et al.,
|
287 |
+
2020). We chose it because of the limitation of the
|
288 |
+
first generation of transformers and BERT-based architectures
|
289 |
+
involving the size of the sentences: the
|
290 |
+
maximum of 512 tokens. The reason behind that
|
291 |
+
limitation is that the self-attention mechanism scale
|
292 |
+
quadratically with the input sequence length O(n2)
|
293 |
+
(Beltagy et al., 2020). The Longformer allowed the
|
294 |
+
processing sequences of a thousand characters without
|
295 |
+
facing the memory bottleneck of BERT-like architectures
|
296 |
+
and achieved SOTA in several benchmarks.
|
297 |
+
For our text length distribution in Figure 3, if
|
298 |
+
we used a Bert-based architecture with a maximum
|
299 |
+
length of 512, 99 sentences would have to be truncated
|
300 |
+
and probably miss some critical information.
|
301 |
+
By comparison, with the Longformer, with a maximum
|
302 |
+
length of 4096, only eight sentences will have
|
303 |
+
their information shortened.
|
304 |
+
To apply the Longformer, we used the pre-trained
|
305 |
+
base (available on the link) that was previously trained
|
306 |
+
with a combination of vast datasets as input to the
|
307 |
+
model, as shown in figure 5 under Longformer model
|
308 |
+
training. After coupling to our classification models,
|
309 |
+
we realized supervised training of the whole model.
|
310 |
+
At this point, only transfer learning was applied since
|
311 |
+
more computational power was needed to realize the
|
312 |
+
fine-tuning of the weights. The results with related
|
313 |
+
metrics can be viewed in table 4. This approach
|
314 |
+
achieved adequate accuracy scores, above 82% in all
|
315 |
+
implementation architectures.
|
316 |
+
|
317 |
+
|
318 |
+
Table 2: Results from Pre-trained Longformer + ML models.
|
319 |
+
|
320 |
+
ML Model Accuracy F1 Score Precision Recall
|
321 |
+
NN 0.8269 0.8754 0.7950 0.9773
|
322 |
+
DNN 0.8462 0.8776 0.8474 0.9123
|
323 |
+
CNN 0.8462 0.8776 0.8474 0.9123
|
324 |
+
LSTM 0.8269 0.8801 0.8571 0.9091
|
325 |
+
|
326 |
+
|
327 |
+
##< Checkpoints >
|
328 |
+
- Examples
|
329 |
+
- Implementation Notes
|
330 |
+
- Usage Example
|
331 |
+
- >>>
|
332 |
+
- >>> ...
|
333 |
+
|
334 |
+
|
335 |
+
##< Config >
|
336 |
+
|
337 |
+
##< Tokenizer >
|
338 |
+
|
339 |
+
##< Training data >
|
340 |
+
|
341 |
+
##< Training procedure >
|
342 |
+
|
343 |
+
##< Preprocessing >
|
344 |
+
|
345 |
+
##< Pretraining >
|
346 |
+
|
347 |
+
##< Evaluation results >
|
348 |
+
|
349 |
+
##< Benchmarks >
|
350 |
+
|
351 |
### BibTeX entry and citation info
|
352 |
|
353 |
```bibtex
|
354 |
+
@conference{webist22,
|
355 |
+
author ={Carlos Rocha. and Marcos Dib. and Li Weigang. and Andrea Nunes. and Allan Faria. and Daniel Cajueiro.
|
356 |
+
and Maísa {Kely de Melo}. and Victor Celestino.},
|
357 |
+
title ={Using Transfer Learning To Classify Long Unstructured Texts with Small Amounts of Labeled Data},
|
358 |
+
booktitle ={Proceedings of the 18th International Conference on Web Information Systems and Technologies - WEBIST,},
|
359 |
+
year ={2022},
|
360 |
+
pages ={201-213},
|
361 |
+
publisher ={SciTePress},
|
362 |
+
organization ={INSTICC},
|
363 |
+
doi ={10.5220/0011527700003318},
|
364 |
+
isbn ={978-989-758-613-2},
|
365 |
+
issn ={2184-3252},
|
366 |
+
}
|
367 |
```
|
368 |
|
369 |
<a href="https://huggingface.co/exbert/?model=bert-base-uncased">
|