syke9p3 commited on
Commit
cc0b92d
·
verified ·
1 Parent(s): 0ad0d6d

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +75 -25
app.py CHANGED
@@ -518,37 +518,83 @@ def predict_tags(test_sentence):
518
  predict_tags(test_sentence)
519
 
520
  def get_readme():
521
- return
522
- """
523
 
524
- This is a BERT Tagalog Base Uncased Part of Speech tagger fine-tuned model of [Jiang et. al.'s pre-trained bert-tagalog-base-uncased model](https://huggingface.co/GKLMIP/bert-tagalog-base-uncased).
525
 
526
- ## Model Details
527
 
528
- ### Model Description
 
 
529
 
530
  <!-- Provide a longer summary of what this model is. -->
 
531
 
532
- - **Developed by:** syke9p3, mnemoria, xenoxia, riakm
533
- - **Shared by:** syke9p3
534
  - **Model type:** BERT Tagalog Base Uncased
535
  - **Languages (NLP):** Tagalog, Filipino
536
- - **Dataset:** Sagum et. al.'s annotated Tagalog Corpora based on MGNN Tagset convention. This model was trained in 800 sentences and evaluated with 200 sentences.
537
- - **Finetuned from model**: [Jiang et. al.'s pre-trained bert-tagalog-base-uncased model](https://huggingface.co/GKLMIP/bert-tagalog-base-uncased)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
538
 
539
- ## GitHub Link
540
- [syke9p3/bert-tagalog-pos-tagger](https://github.com/syke9p3/bert-tagalog-pos-tagger)
 
 
 
541
 
542
- ### Tags
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
543
 
544
  | Part of Speech | Tags |
545
  |-----------------------------------------------|------|
546
- | **Noun** | NNC |
547
  | Common Noun | NNC |
548
  | Proper Noun | NNP |
549
  | Proper Noun Abbreviation | NNPA |
550
  | Common Noun Abbreviation | NNCA |
551
- | **Pronoun** | PR |
552
  | as Subject (Palagyo)/Personal Pronouns Singular | PRS |
553
  | Personal Pronouns | PRP |
554
  | Possessive Subject (Paari) | PRSP |
@@ -559,15 +605,15 @@ def get_readme():
559
  | Comparison (Panulad) | PRC |
560
  | Found (Pahimaton) | PRF |
561
  | Indefinite | PRI |
562
- | **Determiner** | DT |
563
  | Determiner (Pantukoy) for Common Noun Plural | DTC |
564
  | Determiner (Pantukoy) for Proper Noun | DTP |
565
  | Determiner (Pantukoy) for Proper Noun Plural | DTPP |
566
- | Lexical Marker | LM |
567
- | Conjunctions (Pang-ugnay) | CC, CCT, CCR, CCB, CCA |
568
  | Ligatures (Pang-angkop) | CCP |
569
  | Preposition (Pang-ukol) | CCU |
570
- | **Verb (Pandiwa)** | VB |
571
  | Neutral/Infinitive | VBW |
572
  | Auxiliary, Modal/Pseudo-verbs | VBS |
573
  | Existential | VBH |
@@ -582,14 +628,14 @@ def get_readme():
582
  | Locative Focus | VBOL |
583
  | Instrumental Focus | VBOI |
584
  | Referential/Measurement Focus | VBRF |
585
- | **Adjective (Pang-uri)** | JJ |
586
  | Describing (Panlarawan) | JJD |
587
  | Used for Comparison (same level) (Pahambing Magkatulad) | JJC |
588
  | Comparison Comparative (more) (Palamang) | JJCC |
589
  | Comparison Superlative (most) (Pasukdol) | JJCS |
590
  | Comparison Negation (not quite) (Di-Magkatulad) | JJCN |
591
  | Describing Number (Pamilang) | JJN |
592
- | **Adverb (Pang-Abay)** | RB |
593
  | Describing “How” (Pamaraan) | RBD |
594
  | Number (Panggaano/Panukat) | RBN |
595
  | Conditional (Kondisyunal) | RBK |
@@ -605,18 +651,22 @@ def get_readme():
605
  | Enclitics (Paningit) | RBI |
606
  | Interjections (Sambitla) | RBJ |
607
  | Social Formula (Pormularyong Panlipunan) | RBS |
608
- |**Cardinal Number (Bilang)** | CD |
609
  | Digit, Rank, Count | CDB |
610
- | **Topicless (Walang Paksa)** | TS |
611
- | Foreign Words | FW |
612
- | **Punctuation (Pananda)** | PM |
613
  | Period | PMP |
614
  | Exclamation Point | PME |
615
  | Question Mark | PMQ |
616
  | Comma | PMC |
617
  | Semi-colon | PMSC |
618
- | **Symbols** | PMS |
619
 
 
 
 
 
620
  """
621
 
622
 
 
518
  predict_tags(test_sentence)
519
 
520
  def get_readme():
521
+ return """
 
522
 
523
+ ----
524
 
525
+ <!-- ---- -->
526
 
527
+ # BERT Tagalog Part of Speech Tagger (BERTTPOST)
528
+
529
+ ## 📋 Model Details
530
 
531
  <!-- Provide a longer summary of what this model is. -->
532
+ ### Model Description
533
 
 
 
534
  - **Model type:** BERT Tagalog Base Uncased
535
  - **Languages (NLP):** Tagalog, Filipino
536
+ - **Finetuned from model**: [GKLMIP/bert-tagalog-base-uncased](https://huggingface.co/GKLMIP/bert-tagalog-base-uncased)
537
+
538
+ ### Dataset
539
+
540
+ 1000 annotated sentences from Sagum et. al.'s Tagalog Corpora based on MGNN Tagset convention.
541
+
542
+ | Dataset | Number of Sentences | Percentage |
543
+ |----------------|-----------------|------------|
544
+ | Training Set | 800 | 80% |
545
+ | Testing Set | 200 | 20% |
546
+
547
+ ### Preprocessing
548
+ A corpus was used containing tagged sentences in Tagalog language. The dataset comprises sentences with each word annotated with its corresponding POS tag in the format of ```<TAG word>```. To prepare the corpus for training, the following preprocessing steps were performed:
549
+ 1. **Removal of Line Identifier**: the line identifier, such as ```SNT.108970.2066```, was removed from each tagged sentence.
550
+ 2. **Symbol Conversion**: for the BERT model, certain special symbols like hyphens, quotes, commas, etc., were converted into special tokens (```PMP```, ```PMS```, ```PMC```) to preserve their meaning during tokenization.
551
+ 3. **Alignment of Tokenization**: the BERT tokenized words and their corresponding POS tags were aligned to ensure that the tokenization and tagging are consistent.
552
+
553
+
554
+ ### Training
555
+
556
+ This model was trained using PyTorch library with the following hyperparameters set:
557
+
558
+ | **Hyperparamter** | **Value** |
559
+ |---------------- |---------
560
+ | Batch Size | 8 |
561
+ | Training Epoch | 5 |
562
+ | Learning-rate | 2e-5 |
563
+ | Optimizer | Adam |
564
+
565
 
566
+ ## 👥 Developed by
567
+ - Saya-ang, Kenth G. ([@syke9p3](https://github.com/syke9p3))
568
+ - Gozum, Denise Julianne S. ([@Xenoxianne](https://github.com/Xenoxianne))
569
+ - Hamor, Mary Grizelle D. ([@mnemoria](https://github.com/mnemoria))
570
+ - Mabansag, Ria Karen B. ([@riavx](https://github.com/riavx))
571
 
572
+ ## ⚒️ Languages and Technologies
573
+
574
+ [![Hugging Face](https://img.shields.io/badge/Hugging%20Face%20-FFD21E?style=for-the-badge&logo=huggingface&logoColor=black)](https://jupyter.org/)
575
+
576
+ [![Python](https://img.shields.io/badge/Python-F7CC42?style=for-the-badge&logo=python&logoColor=black)](https://www.python.org/)
577
+
578
+ [![Gradio](https://img.shields.io/badge/Gradio-F08705?style=for-the-badge&logo=gradio&logoColor=white)](https://pytorch.org/)
579
+
580
+ [![Jupyter Notebook](https://img.shields.io/badge/Jupyter%20Notebook-F37626?style=for-the-badge&logo=jupyter&logoColor=white)](https://jupyter.org/)
581
+
582
+ [![PyTorch](https://img.shields.io/badge/PyTorch-EE4C2C?style=for-the-badge&logo=pytorch&logoColor=white)](https://pytorch.org/)
583
+
584
+
585
+
586
+
587
+
588
+ ## 🏷️ Tags
589
 
590
  | Part of Speech | Tags |
591
  |-----------------------------------------------|------|
592
+ | **Noun** | ![NNC](https://img.shields.io/badge/NNC-84CC16?style=for-the-badge&logoColor=white) |
593
  | Common Noun | NNC |
594
  | Proper Noun | NNP |
595
  | Proper Noun Abbreviation | NNPA |
596
  | Common Noun Abbreviation | NNCA |
597
+ | **Pronoun** | ![PR](https://img.shields.io/badge/PR-0D9488?style=for-the-badge&logoColor=white) |
598
  | as Subject (Palagyo)/Personal Pronouns Singular | PRS |
599
  | Personal Pronouns | PRP |
600
  | Possessive Subject (Paari) | PRSP |
 
605
  | Comparison (Panulad) | PRC |
606
  | Found (Pahimaton) | PRF |
607
  | Indefinite | PRI |
608
+ | **Determiner** | ![DT](https://img.shields.io/badge/DT-16A34A?style=for-the-badge&logoColor=white) |
609
  | Determiner (Pantukoy) for Common Noun Plural | DTC |
610
  | Determiner (Pantukoy) for Proper Noun | DTP |
611
  | Determiner (Pantukoy) for Proper Noun Plural | DTPP |
612
+ | **Conjunctions (Pang-ugnay)** | ![CC](https://img.shields.io/badge/CC-16A34ADB2777?style=for-the-badge&logoColor=white) |
613
+ | **Lexical Marker** | ![LM](https://img.shields.io/badge/LM-EAB308?style=for-the-badge&logoColor=white) |
614
  | Ligatures (Pang-angkop) | CCP |
615
  | Preposition (Pang-ukol) | CCU |
616
+ | **Verb (Pandiwa)** | ![VB](https://img.shields.io/badge/VB-2563EB?style=for-the-badge&logoColor=white) |
617
  | Neutral/Infinitive | VBW |
618
  | Auxiliary, Modal/Pseudo-verbs | VBS |
619
  | Existential | VBH |
 
628
  | Locative Focus | VBOL |
629
  | Instrumental Focus | VBOI |
630
  | Referential/Measurement Focus | VBRF |
631
+ | **Adjective** | ![JJ](https://img.shields.io/badge/JJ-0D9488?style=for-the-badge&logoColor=white) |
632
  | Describing (Panlarawan) | JJD |
633
  | Used for Comparison (same level) (Pahambing Magkatulad) | JJC |
634
  | Comparison Comparative (more) (Palamang) | JJCC |
635
  | Comparison Superlative (most) (Pasukdol) | JJCS |
636
  | Comparison Negation (not quite) (Di-Magkatulad) | JJCN |
637
  | Describing Number (Pamilang) | JJN |
638
+ | **Adverb (Pang-Abay)** | ![RB](https://img.shields.io/badge/RB-DB2777?style=for-the-badge&logoColor=white) |
639
  | Describing “How” (Pamaraan) | RBD |
640
  | Number (Panggaano/Panukat) | RBN |
641
  | Conditional (Kondisyunal) | RBK |
 
651
  | Enclitics (Paningit) | RBI |
652
  | Interjections (Sambitla) | RBJ |
653
  | Social Formula (Pormularyong Panlipunan) | RBS |
654
+ |**Cardinal Number (Bilang)** | ![CD](https://img.shields.io/badge/CD-2563EB?style=for-the-badge&logoColor=white) |
655
  | Digit, Rank, Count | CDB |
656
+ | **Topicless (Walang Paksa)** | ![TS](https://img.shields.io/badge/TS-0891B2?style=for-the-badge&logoColor=white) |
657
+ | **Foreign Words** | ![FW](https://img.shields.io/badge/FW-EA580C?style=for-the-badge&logoColor=white) |
658
+ | **Punctuation (Pananda)** | ![PM](https://img.shields.io/badge/PM-DB2777?style=for-the-badge&logoColor=white) |
659
  | Period | PMP |
660
  | Exclamation Point | PME |
661
  | Question Mark | PMQ |
662
  | Comma | PMC |
663
  | Semi-colon | PMSC |
664
+ | Other Symbols | PMS |
665
 
666
+ ## Bias, Risks, and Limitations
667
+
668
+ This model has not been fully tested so please use with caution.
669
+
670
  """
671
 
672