atsizelti commited on
Commit
d4d4df0
·
verified ·
1 Parent(s): d0e7899

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +80 -49
README.md CHANGED
@@ -1,49 +1,80 @@
1
- ### Model Description
2
- This model is a fine-tuned version of the dbmdz/bert-base-turkish-uncased architecture, specifically designed for the binary classification task of identifying organizational accounts on Turkish Twitter. It leverages the pre-trained BERT model's understanding of Turkish language and context to effectively distinguish between organizational and non-organizational user accounts.
3
-
4
- ### Model Training and Optimization
5
- Base Model: dbmdz/bert-base-turkish-uncased
6
-
7
- Training Data: The model was trained and validated using a dataset of Twitter accounts (descriptions, names, screen names) with meticulously annotated labels indicating whether each account belongs to an organization or not.
8
-
9
- ### Fine-Tuning Process:
10
-
11
- Data Preprocessing:
12
-
13
- Combined user descriptions, names, and screen names into a single text field for input.
14
-
15
- Data Splitting:
16
-
17
- Split the dataset into 80% for training and 20% for validation.
18
-
19
- Tokenization:
20
-
21
- Utilized the AutoTokenizer from Hugging Face to prepare text inputs for the BERT model.
22
-
23
- Hyperparameter Optimization:
24
-
25
- Employed Optuna to find the best combination of learning rate, batch size, and training epochs, resulting in optimal performance and minimizing validation loss.
26
-
27
- Optimal Hyperparameters:
28
-
29
- Learning Rate: 1.23e-5
30
-
31
- Batch Size: 32
32
-
33
- Epochs: 2
34
-
35
- ## Evaluation Results
36
- The fine-tuned model demonstrates excellent performance on the validation set, achieving the following metrics:
37
- Precision: 0.945
38
-
39
- Recall: 0.95
40
-
41
- F1-Score (Macro): 0.948
42
-
43
- Accuracy: 0.95
44
-
45
-
46
- Confusion Matrix:
47
- [[369 22]
48
-
49
- [ 19 375]]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: "tr"
3
+ tags:
4
+ - "bert"
5
+ - "turkish"
6
+ - "text-classification"
7
+ license: "apache-2.0"
8
+ datasets:
9
+ - "custom"
10
+ metrics:
11
+ - "precision"
12
+ - "recall"
13
+ - "f1"
14
+ - "accuracy"
15
+ ---
16
+
17
+
18
+ # BERT-based Organization Detection Model for Turkish Texts
19
+
20
+ ## Model Description
21
+
22
+ This model is fine-tuned on the `dbmdz/bert-base-turkish-uncased` architecture for detecting organization accounts within Turkish Twitter. This initiative is part of the Politus Project's efforts to analyze organizational presence in social media data.
23
+
24
+ ## Model Architecture
25
+
26
+ - **Base Model:** BERT (dbmdz/bert-base-turkish-uncased)
27
+ - **Training Data:** Twitter data from 3,922 accounts with high organization-related activity as determined by m3inference scores above 0.7. The data was annotated based on user names, screen names, and descriptions by a human annotator.
28
+
29
+ ## Training Setup
30
+
31
+ - **Tokenization:** Used Hugging Face's AutoTokenizer, padding sequences to a maximum length of 128 tokens.
32
+ - **Dataset Split:** 80% training, 20% validation.
33
+ - **Training Parameters:**
34
+ - Epochs: 3
35
+ - Training batch size: 8
36
+ - Evaluation batch size: 16
37
+ - Warmup steps: 500
38
+ - Weight decay: 0.01
39
+
40
+ ## Hyperparameter Tuning
41
+
42
+ Performed using Optuna, achieving best settings:
43
+ - **Learning rate:** 1.2323083424093641e-05
44
+ - **Batch size:** 32
45
+ - **Epochs:** 2
46
+
47
+ ## Evaluation Metrics
48
+
49
+ - **Precision on Validation Set:** 0.94 (organization class)
50
+ - **Recall on Validation Set:** 0.95 (organization class)
51
+ - **F1-Score (Macro Average):** 0.95
52
+ - **Accuracy:** 0.95
53
+ - **Confusion Matrix on Validation Set:**
54
+ ```
55
+ [[369, 22],
56
+ [19, 375]]
57
+ ```
58
+
59
+ - **Hand-coded Sample of 1000 Accounts:**
60
+ - **Precision:** 0.91
61
+ - **F1-Score (Macro Average):** 0.947
62
+ - **Confusion Matrix:**
63
+ ```
64
+ [[936, 3],
65
+ [ 4, 31]]
66
+ ```
67
+
68
+ ## How to Use
69
+
70
+ ```python
71
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
72
+
73
+ model = AutoModelForSequenceClassification.from_pretrained("atsizelti/atsizelti/turkish_org_classifier_hand_coded")
74
+ tokenizer = AutoTokenizer.from_pretrained("atsizelti/atsizelti/turkish_org_classifier_hand_coded")
75
+
76
+ text = "Örnek metin buraya girilir."
77
+ inputs = tokenizer(text, return_tensors="pt")
78
+ outputs = model(**inputs)
79
+ predictions = outputs.logits.argmax(-1)
80
+ ```