Update README.md
Browse files
README.md
CHANGED
@@ -18,14 +18,7 @@ sentences in the original corpus, and in the other cases, it's another random se
|
|
18 |
|
19 |
With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus, and in
|
20 |
|
21 |
-
Text classification is a traditional problem in Natural Language Processing (NLP). Most of the state-of-the-art implementations require high-quality, voluminous, labeled data. Pre- trained models on large corpora have shown beneficial for text classification and other NLP tasks, but they can only take a limited amount of symbols as input. This is a real case
|
22 |
-
study that explores different machine learning strategies to classify a small amount of long, unstructured, and uneven data to
|
23 |
-
find a proper method with good performance. The
|
24 |
-
collected data includes texts of financing opportunities the international R&D funding organizations provided on theirwebsites. The main goal is to find international R&D funding eligible for Brazilian researchers, sponsored by the Ministry of Science, Technology and
|
25 |
-
Innovation. We use pre-training and word embedding solutions to learn the relationship of the words from other datasets with considerable similarity and larger scale. Then, using
|
26 |
-
the acquired features, based on the available dataset from MCTI, we apply transfer learning plus deep learning models to improve the comprehension of each sentence. Compared to the
|
27 |
-
baseline accuracy rate of 81%, based on the available datasets, and the 85% accuracy rate achieved through a Transformer-based approach, the Word2Vec-based approach improved the
|
28 |
-
accuracy rate to 88%. The research results serve as asuccessful case of artificial intelligence in a federal government application.
|
29 |
|
30 |
This model focus on a more specific problem, creating a Research Financing Products Portfolio (FPP) outside of
|
31 |
the Union budget, supported by the Brazilian Ministry of Science, Technology, and Innovation (MCTI). It was
|
|
|
18 |
|
19 |
With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus, and in
|
20 |
|
21 |
+
Text classification is a traditional problem in Natural Language Processing (NLP). Most of the state-of-the-art implementations require high-quality, voluminous, labeled data. Pre- trained models on large corpora have shown beneficial for text classification and other NLP tasks, but they can only take a limited amount of symbols as input. This is a real case study that explores different machine learning strategies to classify a small amount of long, unstructured, and uneven data to find a proper method with good performance. The collected data includes texts of financing opportunities the international R&D funding organizations provided on theirwebsites. The main goal is to find international R&D funding eligible for Brazilian researchers, sponsored by the Ministry of Science, Technology and Innovation. We use pre-training and word embedding solutions to learn the relationship of the words from other datasets with considerable similarity and larger scale. Then, using the acquired features, based on the available dataset from MCTI, we apply transfer learning plus deep learning models to improve the comprehension of each sentence. Compared to the baseline accuracy rate of 81%, based on the available datasets, and the 85% accuracy rate achieved through a Transformer-based approach, the Word2Vec-based approach improved the accuracy rate to 88%. The research results serve as asuccessful case of artificial intelligence in a federal government application.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
22 |
|
23 |
This model focus on a more specific problem, creating a Research Financing Products Portfolio (FPP) outside of
|
24 |
the Union budget, supported by the Brazilian Ministry of Science, Technology, and Innovation (MCTI). It was
|