Update app.py
Browse files
app.py
CHANGED
@@ -518,37 +518,83 @@ def predict_tags(test_sentence):
|
|
518 |
predict_tags(test_sentence)
|
519 |
|
520 |
def get_readme():
|
521 |
-
return
|
522 |
-
"""
|
523 |
|
524 |
-
|
525 |
|
526 |
-
|
527 |
|
528 |
-
|
|
|
|
|
529 |
|
530 |
<!-- Provide a longer summary of what this model is. -->
|
|
|
531 |
|
532 |
-
- **Developed by:** syke9p3, mnemoria, xenoxia, riakm
|
533 |
-
- **Shared by:** syke9p3
|
534 |
- **Model type:** BERT Tagalog Base Uncased
|
535 |
- **Languages (NLP):** Tagalog, Filipino
|
536 |
-
- **
|
537 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
538 |
|
539 |
-
##
|
540 |
-
[syke9p3
|
|
|
|
|
|
|
541 |
|
542 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
543 |
|
544 |
| Part of Speech | Tags |
|
545 |
|-----------------------------------------------|------|
|
546 |
-
| **Noun**
|
547 |
| Common Noun | NNC |
|
548 |
| Proper Noun | NNP |
|
549 |
| Proper Noun Abbreviation | NNPA |
|
550 |
| Common Noun Abbreviation | NNCA |
|
551 |
-
| **Pronoun** | PR
|
552 |
| as Subject (Palagyo)/Personal Pronouns Singular | PRS |
|
553 |
| Personal Pronouns | PRP |
|
554 |
| Possessive Subject (Paari) | PRSP |
|
@@ -559,15 +605,15 @@ def get_readme():
|
|
559 |
| Comparison (Panulad) | PRC |
|
560 |
| Found (Pahimaton) | PRF |
|
561 |
| Indefinite | PRI |
|
562 |
-
| **Determiner**
|
563 |
| Determiner (Pantukoy) for Common Noun Plural | DTC |
|
564 |
| Determiner (Pantukoy) for Proper Noun | DTP |
|
565 |
| Determiner (Pantukoy) for Proper Noun Plural | DTPP |
|
566 |
-
|
|
567 |
-
|
|
568 |
| Ligatures (Pang-angkop) | CCP |
|
569 |
| Preposition (Pang-ukol) | CCU |
|
570 |
-
| **Verb (Pandiwa)** | VB
|
571 |
| Neutral/Infinitive | VBW |
|
572 |
| Auxiliary, Modal/Pseudo-verbs | VBS |
|
573 |
| Existential | VBH |
|
@@ -582,14 +628,14 @@ def get_readme():
|
|
582 |
| Locative Focus | VBOL |
|
583 |
| Instrumental Focus | VBOI |
|
584 |
| Referential/Measurement Focus | VBRF |
|
585 |
-
| **Adjective (
|
586 |
| Describing (Panlarawan) | JJD |
|
587 |
| Used for Comparison (same level) (Pahambing Magkatulad) | JJC |
|
588 |
| Comparison Comparative (more) (Palamang) | JJCC |
|
589 |
| Comparison Superlative (most) (Pasukdol) | JJCS |
|
590 |
| Comparison Negation (not quite) (Di-Magkatulad) | JJCN |
|
591 |
| Describing Number (Pamilang) | JJN |
|
592 |
-
| **Adverb (Pang-Abay)** | RB
|
593 |
| Describing “How” (Pamaraan) | RBD |
|
594 |
| Number (Panggaano/Panukat) | RBN |
|
595 |
| Conditional (Kondisyunal) | RBK |
|
@@ -605,18 +651,22 @@ def get_readme():
|
|
605 |
| Enclitics (Paningit) | RBI |
|
606 |
| Interjections (Sambitla) | RBJ |
|
607 |
| Social Formula (Pormularyong Panlipunan) | RBS |
|
608 |
-
|**Cardinal Number (Bilang)** | CD
|
609 |
| Digit, Rank, Count | CDB |
|
610 |
-
| **Topicless (Walang Paksa)** | TS
|
611 |
-
| Foreign Words
|
612 |
-
| **Punctuation (Pananda)** | PM
|
613 |
| Period | PMP |
|
614 |
| Exclamation Point | PME |
|
615 |
| Question Mark | PMQ |
|
616 |
| Comma | PMC |
|
617 |
| Semi-colon | PMSC |
|
618 |
-
|
|
619 |
|
|
|
|
|
|
|
|
|
620 |
"""
|
621 |
|
622 |
|
|
|
518 |
predict_tags(test_sentence)
|
519 |
|
520 |
def get_readme():
|
521 |
+
return """
|
|
|
522 |
|
523 |
+
----
|
524 |
|
525 |
+
<!-- ---- -->
|
526 |
|
527 |
+
# BERT Tagalog Part of Speech Tagger (BERTTPOST)
|
528 |
+
|
529 |
+
## 📋 Model Details
|
530 |
|
531 |
<!-- Provide a longer summary of what this model is. -->
|
532 |
+
### Model Description
|
533 |
|
|
|
|
|
534 |
- **Model type:** BERT Tagalog Base Uncased
|
535 |
- **Languages (NLP):** Tagalog, Filipino
|
536 |
+
- **Finetuned from model**: [GKLMIP/bert-tagalog-base-uncased](https://huggingface.co/GKLMIP/bert-tagalog-base-uncased)
|
537 |
+
|
538 |
+
### Dataset
|
539 |
+
|
540 |
+
1000 annotated sentences from Sagum et. al.'s Tagalog Corpora based on MGNN Tagset convention.
|
541 |
+
|
542 |
+
| Dataset | Number of Sentences | Percentage |
|
543 |
+
|----------------|-----------------|------------|
|
544 |
+
| Training Set | 800 | 80% |
|
545 |
+
| Testing Set | 200 | 20% |
|
546 |
+
|
547 |
+
### Preprocessing
|
548 |
+
A corpus was used containing tagged sentences in Tagalog language. The dataset comprises sentences with each word annotated with its corresponding POS tag in the format of ```<TAG word>```. To prepare the corpus for training, the following preprocessing steps were performed:
|
549 |
+
1. **Removal of Line Identifier**: the line identifier, such as ```SNT.108970.2066```, was removed from each tagged sentence.
|
550 |
+
2. **Symbol Conversion**: for the BERT model, certain special symbols like hyphens, quotes, commas, etc., were converted into special tokens (```PMP```, ```PMS```, ```PMC```) to preserve their meaning during tokenization.
|
551 |
+
3. **Alignment of Tokenization**: the BERT tokenized words and their corresponding POS tags were aligned to ensure that the tokenization and tagging are consistent.
|
552 |
+
|
553 |
+
|
554 |
+
### Training
|
555 |
+
|
556 |
+
This model was trained using PyTorch library with the following hyperparameters set:
|
557 |
+
|
558 |
+
| **Hyperparamter** | **Value** |
|
559 |
+
|---------------- |---------
|
560 |
+
| Batch Size | 8 |
|
561 |
+
| Training Epoch | 5 |
|
562 |
+
| Learning-rate | 2e-5 |
|
563 |
+
| Optimizer | Adam |
|
564 |
+
|
565 |
|
566 |
+
## 👥 Developed by
|
567 |
+
- Saya-ang, Kenth G. ([@syke9p3](https://github.com/syke9p3))
|
568 |
+
- Gozum, Denise Julianne S. ([@Xenoxianne](https://github.com/Xenoxianne))
|
569 |
+
- Hamor, Mary Grizelle D. ([@mnemoria](https://github.com/mnemoria))
|
570 |
+
- Mabansag, Ria Karen B. ([@riavx](https://github.com/riavx))
|
571 |
|
572 |
+
## ⚒️ Languages and Technologies
|
573 |
+
|
574 |
+
[](https://jupyter.org/)
|
575 |
+
|
576 |
+
[](https://www.python.org/)
|
577 |
+
|
578 |
+
[](https://pytorch.org/)
|
579 |
+
|
580 |
+
[](https://jupyter.org/)
|
581 |
+
|
582 |
+
[](https://pytorch.org/)
|
583 |
+
|
584 |
+
|
585 |
+
|
586 |
+
|
587 |
+
|
588 |
+
## 🏷️ Tags
|
589 |
|
590 |
| Part of Speech | Tags |
|
591 |
|-----------------------------------------------|------|
|
592 |
+
| **Noun** |  |
|
593 |
| Common Noun | NNC |
|
594 |
| Proper Noun | NNP |
|
595 |
| Proper Noun Abbreviation | NNPA |
|
596 |
| Common Noun Abbreviation | NNCA |
|
597 |
+
| **Pronoun** |  |
|
598 |
| as Subject (Palagyo)/Personal Pronouns Singular | PRS |
|
599 |
| Personal Pronouns | PRP |
|
600 |
| Possessive Subject (Paari) | PRSP |
|
|
|
605 |
| Comparison (Panulad) | PRC |
|
606 |
| Found (Pahimaton) | PRF |
|
607 |
| Indefinite | PRI |
|
608 |
+
| **Determiner** |  |
|
609 |
| Determiner (Pantukoy) for Common Noun Plural | DTC |
|
610 |
| Determiner (Pantukoy) for Proper Noun | DTP |
|
611 |
| Determiner (Pantukoy) for Proper Noun Plural | DTPP |
|
612 |
+
| **Conjunctions (Pang-ugnay)** |  |
|
613 |
+
| **Lexical Marker** |  |
|
614 |
| Ligatures (Pang-angkop) | CCP |
|
615 |
| Preposition (Pang-ukol) | CCU |
|
616 |
+
| **Verb (Pandiwa)** |  |
|
617 |
| Neutral/Infinitive | VBW |
|
618 |
| Auxiliary, Modal/Pseudo-verbs | VBS |
|
619 |
| Existential | VBH |
|
|
|
628 |
| Locative Focus | VBOL |
|
629 |
| Instrumental Focus | VBOI |
|
630 |
| Referential/Measurement Focus | VBRF |
|
631 |
+
| **Adjective** |  |
|
632 |
| Describing (Panlarawan) | JJD |
|
633 |
| Used for Comparison (same level) (Pahambing Magkatulad) | JJC |
|
634 |
| Comparison Comparative (more) (Palamang) | JJCC |
|
635 |
| Comparison Superlative (most) (Pasukdol) | JJCS |
|
636 |
| Comparison Negation (not quite) (Di-Magkatulad) | JJCN |
|
637 |
| Describing Number (Pamilang) | JJN |
|
638 |
+
| **Adverb (Pang-Abay)** |  |
|
639 |
| Describing “How” (Pamaraan) | RBD |
|
640 |
| Number (Panggaano/Panukat) | RBN |
|
641 |
| Conditional (Kondisyunal) | RBK |
|
|
|
651 |
| Enclitics (Paningit) | RBI |
|
652 |
| Interjections (Sambitla) | RBJ |
|
653 |
| Social Formula (Pormularyong Panlipunan) | RBS |
|
654 |
+
|**Cardinal Number (Bilang)** |  |
|
655 |
| Digit, Rank, Count | CDB |
|
656 |
+
| **Topicless (Walang Paksa)** |  |
|
657 |
+
| **Foreign Words** |  |
|
658 |
+
| **Punctuation (Pananda)** |  |
|
659 |
| Period | PMP |
|
660 |
| Exclamation Point | PME |
|
661 |
| Question Mark | PMQ |
|
662 |
| Comma | PMC |
|
663 |
| Semi-colon | PMSC |
|
664 |
+
| Other Symbols | PMS |
|
665 |
|
666 |
+
## Bias, Risks, and Limitations
|
667 |
+
|
668 |
+
This model has not been fully tested so please use with caution.
|
669 |
+
|
670 |
"""
|
671 |
|
672 |
|