MBMMurad commited on
Commit
4e4157d
·
1 Parent(s): 0c49c85

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -0
README.md ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - bn
4
+ metrics:
5
+ - f1
6
+ pipeline_tag: token-classification
7
+ ---
8
+ # Bangla-Person-Name-Extractor
9
+ This repository contains the implementation of a Bangla Person Name Extractor model which is able to extract Person name entities from a given sentence. We approached it as a token classification task i.e. tagging each token with either a Person's name or not. We leveraged the [BanglaBERT](http://https://github.com/csebuetnlp/banglabert) model for our task, finetuning it for a binary classification task using a custom-prepare dataset. We have deployed the model into huggingface for easier access and use case.
10
+
11
+
12
+ # Datasets
13
+ We used two datasets to train and evaluate our pipeline.
14
+ 1. [Bengali-NER/annotated data at master · Rifat1493/Bengali-NER](http://https://github.com/Rifat1493/Bengali-NER/tree/master/annotated%20data)
15
+ 2. [banglakit/bengali-ner-data](http://https://raw.githubusercontent.com/banglakit/bengali-ner-data/master/main.jsonl)
16
+
17
+ The annotation formats for both datasets were quite different, so we had to preprocess both of them before merging them. Please refer to [this notebook](https://github.com/MBMMurad/Bangla-Person-Name-Extractor/blob/main/prepare-dataset.ipynb) for preparing the dataset as required.
18
+
19
+ # Training and Evaluation
20
+ We treated this problem as a token classification task.So it seemed perfect to finetune BanglaBERT model for our purpose. [BanglaBERT ](https://huggingface.co/csebuetnlp/banglabert)is an [ELECTRA](https://openreview.net/pdf?id=r1xMH1BtvB) discriminator model pretrained with the Replaced Token Detection (RTD) objective. Finetuned models using this checkpoint achieve state-of-the-art results on many of the NLP tasks in bengali.
21
+ We mainly finetuned two checkpoints of BanglaBERT.
22
+ 1. [BanglaBERT](https://huggingface.co/csebuetnlp/banglabert)
23
+ 2. [BanglaEERT small](https://huggingface.co/csebuetnlp/banglabert_small)
24
+
25
+ BanglaBERT performed better than BanglaBERT small ( 83% F1 score vs 79% F1 score on the test set) .
26
+ Please refer to [this notebook](https://github.com/MBMMurad/Bangla-Person-Name-Extractor/blob/main/Training%20Notebook%20%3A%20Person%20Name%20Extractor%20using%20BanglaBERT.ipynb) to see the training process.
27
+
28
+ **Quantitative results**
29
+ Please refer to [this notebook](https://github.com/MBMMurad/Bangla-Person-Name-Extractor/blob/main/Inference%20and%20Evaluation%20Notebook.ipynb) to see the evaluation process.
30
+ <br></br>
31
+ ![Markdown symbol](https://github.com/MBMMurad/asl-2d-to-3d/blob/master/Screenshot%20from%202023-07-13%2023-11-59.png)
32
+
33
+ # How to use it?
34
+ [This Notebook](https://github.com/MBMMurad/Bangla-Person-Name-Extractor/blob/main/Inference_template.ipynb) contains the required Inference Template on a sentence.
35
+ <br></br>
36
+ You can also directly infer using the following code snippet. Just change the sentence.
37
+
38
+ ```
39
+ from transformers import AutoModelForPreTraining, AutoTokenizer,AutoModelForTokenClassification #!pip install transformers==4.30.2
40
+ from normalizer import normalize #pip install git+https://github.com/csebuetnlp/normalizer
41
+ import torch #pip install torch
42
+ import numpy as np #!pip install numpy==1.23.5
43
+
44
+
45
+ model = AutoModelForTokenClassification.from_pretrained("MBMMurad/BanglaBERT_Person_Name_Extractor")
46
+ tokenizer = AutoTokenizer.from_pretrained("MBMMurad/BanglaBERT_Person_Name_Extractor")
47
+ def inference_fn(sentence):
48
+ sentence = normalize(sentence)
49
+ tokens = tokenizer.tokenize(sentence)
50
+ inputs = tokenizer.encode(sentence,return_tensors="pt")
51
+ outputs = model(inputs).logits
52
+ predictions = torch.argmax(outputs[0],axis=1)[1:-1].numpy()
53
+ idxs = np.where(predictions==1)
54
+
55
+ return np.array(tokens)[idxs]
56
+
57
+ sentence = "আব্দুর রহিম নামের কাস্টমারকে একশ টাকা বাকি দিলাম।"
58
+ pred = inference_fn(sentence)
59
+ print(f"Input Sentence : {sentence}")
60
+ print(f"Person Name Entities : {pred}")
61
+
62
+ sentence = "ইঞ্জিনিয়ার্স ইনস্টিটিউশন চট্টগ্রামের সাবেক সভাপতি প্রকৌশলী দেলোয়ার হোসেন মজুমদার প্রথম আলোকে বলেন, 'সংকট নিরসনে বর্তমান খালগুলোকে পূর্ণ প্রবাহে ফিরিয়ে আনার পাশাপাশি নতুন তিনটি খাল খনন জরুরি।'"
63
+ pred = inference_fn(sentence)
64
+ print(f"Input Sentence : {sentence}")
65
+ print(f"Person Name Entities : {pred}")
66
+
67
+
68
+ sentence = "দলীয় নেতারা তাঁর বাসভবনে যেতে চাইলে আটক হন।"
69
+ pred = inference_fn(sentence)
70
+ print(f"Input Sentence : {sentence}")
71
+ print(f"Person Name Entities : {pred}")
72
+ ```
73
+
74
+ **Output :**
75
+ ```
76
+ Input Sentence : আব্দুর রহিম নামের কাস্টমারকে একশ টাকা বাকি দিলাম।
77
+ Person Name Entities : ['আব্দুর' 'রহিম']
78
+
79
+
80
+ Input Sentence : ইঞ্জিনিয়ার্স ইনস্টিটিউশন চট্টগ্রামের সাবেক সভাপতি প্রকৌশলী দেলোয়ার হোসেন মজুমদার প্রথম আলোকে বলেন, 'সংকট নিরসনে বর্তমান খালগুলোকে পূর্ণ প্রবাহে ফিরিয়ে আনার পাশাপাশি নতুন তিনটি খাল খনন জরুরি।'
81
+ Person Name Entities : ['দেলোয়ার' 'হোসেন' 'মজুমদার']
82
+
83
+
84
+ Input Sentence : দলীয় নেতারা তাঁর বাসভবনে যেতে চাইলে আটক হন।
85
+ Person Name Entities : []
86
+ ```