Update README.md
Browse files
README.md
CHANGED
|
@@ -83,7 +83,7 @@ Other 24 smaller models are released afterward.
|
|
| 83 |
The detailed release history can be found on the [here](https://huggingface.co/unb-lamfo-nlp-mcti) on github.
|
| 84 |
|
| 85 |
| Model | #params | Language |
|
| 86 |
-
|
| 87 |
| [`mcti-base-uncased`] | 110M | English |
|
| 88 |
| [`mcti-large-uncased`] | 340M | English | sub
|
| 89 |
| [`mcti-base-cased`] | 110M | English |
|
|
@@ -91,7 +91,7 @@ The detailed release history can be found on the [here](https://huggingface.co/u
|
|
| 91 |
| [`-base-multilingual-cased`] | 110M | Multiple |
|
| 92 |
|
| 93 |
| Dataset | Compatibility to base* |
|
| 94 |
-
|
| 95 |
| Labeled MCTI | 100% |
|
| 96 |
| Full MCTI | 100% |
|
| 97 |
| BBC News Articles | 56.77% |
|
|
@@ -202,13 +202,13 @@ The following assumptions were considered:
|
|
| 202 |
- Preprocessing experiments compare accuracy in a shallow neural network (SNN);
|
| 203 |
- Pre-processing was investigated for the classification goal.
|
| 204 |
|
| 205 |
-
From the Database obtained in Meta 4, stored in the project's [GitHub](github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/scraps-desenvolvimento/Rotulagem/db_PPF_validacao_para%20UNB_%20FINAL.xlsx), a Notebook was developed in [Google Colab](colab.research.google.com)
|
| 206 |
-
to implement the [pre-processing code](github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento.ipynb), which also can be found on the project's GitHub.
|
| 207 |
|
| 208 |
Several Python packages were used to develop the preprocessing code:
|
| 209 |
|
| 210 |
| Objective | Package |
|
| 211 |
-
|
| 212 |
| Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
|
| 213 |
| Natural Language Processing | [nltk](https://pypi.org/project/nltk) |
|
| 214 |
| Others data manipulations and calculations included in Python 3.10: io, json, math, re (regular expressions), shutil, time, unicodedata; | [numpy](https://pypi.org/project/numpy) |
|
|
@@ -224,7 +224,7 @@ As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip
|
|
| 224 |
bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
|
| 225 |
|
| 226 |
| Base | Textos originais |
|
| 227 |
-
|
| 228 |
| xp1 | Expandir Contrações |
|
| 229 |
| xp2 | Expandir Contrações + Transformar texto em minúsculo |
|
| 230 |
| xp3 | Expandir Contrações + Remover Pontuação |
|
|
@@ -233,7 +233,7 @@ bases, derived from the base of goal 4, with the application of the methods show
|
|
| 233 |
| xp6 | xp4 + Lematização |
|
| 234 |
| xp7 | xp4 + Stemização + Remoção de StopWords |
|
| 235 |
| xp8 | ap4 + Lematização + Remoção de StopWords |
|
| 236 |
-
|
| 237 |
|
| 238 |
### Pretraining
|
| 239 |
|
|
|
|
| 83 |
The detailed release history can be found on the [here](https://huggingface.co/unb-lamfo-nlp-mcti) on github.
|
| 84 |
|
| 85 |
| Model | #params | Language |
|
| 86 |
+
|------------------------------|:-------:|:--------:|
|
| 87 |
| [`mcti-base-uncased`] | 110M | English |
|
| 88 |
| [`mcti-large-uncased`] | 340M | English | sub
|
| 89 |
| [`mcti-base-cased`] | 110M | English |
|
|
|
|
| 91 |
| [`-base-multilingual-cased`] | 110M | Multiple |
|
| 92 |
|
| 93 |
| Dataset | Compatibility to base* |
|
| 94 |
+
|--------------------------------------|:----------------------:|
|
| 95 |
| Labeled MCTI | 100% |
|
| 96 |
| Full MCTI | 100% |
|
| 97 |
| BBC News Articles | 56.77% |
|
|
|
|
| 202 |
- Preprocessing experiments compare accuracy in a shallow neural network (SNN);
|
| 203 |
- Pre-processing was investigated for the classification goal.
|
| 204 |
|
| 205 |
+
From the Database obtained in Meta 4, stored in the project's [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/scraps-desenvolvimento/Rotulagem/db_PPF_validacao_para%20UNB_%20FINAL.xlsx), a Notebook was developed in [Google Colab](https://colab.research.google.com)
|
| 206 |
+
to implement the [pre-processing code](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento.ipynb), which also can be found on the project's GitHub.
|
| 207 |
|
| 208 |
Several Python packages were used to develop the preprocessing code:
|
| 209 |
|
| 210 |
| Objective | Package |
|
| 211 |
+
|--------------------------------------------------------|--------------|
|
| 212 |
| Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
|
| 213 |
| Natural Language Processing | [nltk](https://pypi.org/project/nltk) |
|
| 214 |
| Others data manipulations and calculations included in Python 3.10: io, json, math, re (regular expressions), shutil, time, unicodedata; | [numpy](https://pypi.org/project/numpy) |
|
|
|
|
| 224 |
bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
|
| 225 |
|
| 226 |
| Base | Textos originais |
|
| 227 |
+
|--------|--------------------------------------------------------------|
|
| 228 |
| xp1 | Expandir Contrações |
|
| 229 |
| xp2 | Expandir Contrações + Transformar texto em minúsculo |
|
| 230 |
| xp3 | Expandir Contrações + Remover Pontuação |
|
|
|
|
| 233 |
| xp6 | xp4 + Lematização |
|
| 234 |
| xp7 | xp4 + Stemização + Remoção de StopWords |
|
| 235 |
| xp8 | ap4 + Lematização + Remoção de StopWords |
|
| 236 |
+
Table 2 – Pre-processing methods evaluated
|
| 237 |
|
| 238 |
### Pretraining
|
| 239 |
|