unb-lamfo-nlp-mcti
/

NLP-Classification-MCTI

English

Clsssification

science

Model card Files Files and versions Community

MarcosDib commited on Dec 6, 2022

Commit

6996277

1 Parent(s): 2165371

Update README.md

Browse files

Files changed (1) hide show

README.md +1 -8

README.md CHANGED Viewed

@@ -18,14 +18,7 @@ sentences in the original corpus, and in the other cases, it's another random se
 With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus, and in
-Text classification is a traditional problem in Natural Language Processing (NLP). Most of the state-of-the-art implementations require high-quality, voluminous, labeled data. Pre- trained models on large corpora have shown beneficial for text classification and other NLP tasks, but they can only take a limited amount of symbols as input. This is a real case
-study that explores different machine learning strategies to classify a small amount of  long, unstructured, and uneven data to
-find a proper method with good performance. The
-collected data includes texts of financing opportunities the international R&D funding  organizations provided on theirwebsites. The main goal is to find international R&D funding eligible for Brazilian researchers, sponsored by the Ministry of Science, Technology and
-Innovation. We use pre-training and word embedding solutions to learn the relationship of the words from other datasets with considerable similarity and larger scale. Then, using
-the acquired features, based on the available dataset from MCTI, we apply transfer learning  plus deep learning models to improve the comprehension of each sentence. Compared to the
-baseline accuracy rate of 81%, based on the available datasets, and the 85% accuracy rate  achieved through a Transformer-based approach, the Word2Vec-based approach improved the
-accuracy rate to 88%. The research results serve as asuccessful case of artificial  intelligence in a federal government application.
 This model focus on a more specific problem, creating a Research Financing Products Portfolio (FPP) outside of
 the Union budget, supported by the Brazilian Ministry of Science, Technology, and Innovation (MCTI). It was

 With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus, and in
+Text classification is a traditional problem in Natural Language Processing (NLP). Most of the state-of-the-art implementations require high-quality, voluminous, labeled data. Pre- trained models on large corpora have shown beneficial for text classification and other NLP tasks, but they can only take a limited amount of symbols as input. This is a real case study that explores different machine learning strategies to classify a small amount of long, unstructured, and uneven data to find a proper method with good performance. The collected data includes texts of financing opportunities the international R&D funding organizations provided on theirwebsites. The main goal is to find international R&D funding eligible for Brazilian researchers, sponsored by the Ministry of Science, Technology and Innovation. We use pre-training and word embedding solutions to learn the relationship of the words from other datasets with considerable similarity and larger scale. Then, using the acquired features, based on the available dataset from MCTI, we apply transfer learning plus deep learning models to improve the comprehension of each sentence. Compared to the baseline accuracy rate of 81%, based on the available datasets, and the 85% accuracy rate achieved through a Transformer-based approach, the Word2Vec-based approach improved the accuracy rate to 88%. The research results serve as asuccessful case of artificial intelligence in a federal government application.
 This model focus on a more specific problem, creating a Research Financing Products Portfolio (FPP) outside of
 the Union budget, supported by the Brazilian Ministry of Science, Technology, and Innovation (MCTI). It was