MarcosDib commited on
Commit
6996277
·
1 Parent(s): 2165371

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -8
README.md CHANGED
@@ -18,14 +18,7 @@ sentences in the original corpus, and in the other cases, it's another random se
18
 
19
  With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus, and in
20
 
21
- Text classification is a traditional problem in Natural Language Processing (NLP). Most of the state-of-the-art implementations require high-quality, voluminous, labeled data. Pre- trained models on large corpora have shown beneficial for text classification and other NLP tasks, but they can only take a limited amount of symbols as input. This is a real case
22
- study that explores different machine learning strategies to classify a small amount of long, unstructured, and uneven data to
23
- find a proper method with good performance. The
24
- collected data includes texts of financing opportunities the international R&D funding organizations provided on theirwebsites. The main goal is to find international R&D funding eligible for Brazilian researchers, sponsored by the Ministry of Science, Technology and
25
- Innovation. We use pre-training and word embedding solutions to learn the relationship of the words from other datasets with considerable similarity and larger scale. Then, using
26
- the acquired features, based on the available dataset from MCTI, we apply transfer learning plus deep learning models to improve the comprehension of each sentence. Compared to the
27
- baseline accuracy rate of 81%, based on the available datasets, and the 85% accuracy rate achieved through a Transformer-based approach, the Word2Vec-based approach improved the
28
- accuracy rate to 88%. The research results serve as asuccessful case of artificial intelligence in a federal government application.
29
 
30
  This model focus on a more specific problem, creating a Research Financing Products Portfolio (FPP) outside of
31
  the Union budget, supported by the Brazilian Ministry of Science, Technology, and Innovation (MCTI). It was
 
18
 
19
  With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus, and in
20
 
21
+ Text classification is a traditional problem in Natural Language Processing (NLP). Most of the state-of-the-art implementations require high-quality, voluminous, labeled data. Pre- trained models on large corpora have shown beneficial for text classification and other NLP tasks, but they can only take a limited amount of symbols as input. This is a real case study that explores different machine learning strategies to classify a small amount of long, unstructured, and uneven data to find a proper method with good performance. The collected data includes texts of financing opportunities the international R&D funding organizations provided on theirwebsites. The main goal is to find international R&D funding eligible for Brazilian researchers, sponsored by the Ministry of Science, Technology and Innovation. We use pre-training and word embedding solutions to learn the relationship of the words from other datasets with considerable similarity and larger scale. Then, using the acquired features, based on the available dataset from MCTI, we apply transfer learning plus deep learning models to improve the comprehension of each sentence. Compared to the baseline accuracy rate of 81%, based on the available datasets, and the 85% accuracy rate achieved through a Transformer-based approach, the Word2Vec-based approach improved the accuracy rate to 88%. The research results serve as asuccessful case of artificial intelligence in a federal government application.
 
 
 
 
 
 
 
22
 
23
  This model focus on a more specific problem, creating a Research Financing Products Portfolio (FPP) outside of
24
  the Union budget, supported by the Brazilian Ministry of Science, Technology, and Innovation (MCTI). It was