Update README.md
Browse files
README.md
CHANGED
@@ -1,49 +1,80 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: "tr"
|
3 |
+
tags:
|
4 |
+
- "bert"
|
5 |
+
- "turkish"
|
6 |
+
- "text-classification"
|
7 |
+
license: "apache-2.0"
|
8 |
+
datasets:
|
9 |
+
- "custom"
|
10 |
+
metrics:
|
11 |
+
- "precision"
|
12 |
+
- "recall"
|
13 |
+
- "f1"
|
14 |
+
- "accuracy"
|
15 |
+
---
|
16 |
+
|
17 |
+
|
18 |
+
# BERT-based Organization Detection Model for Turkish Texts
|
19 |
+
|
20 |
+
## Model Description
|
21 |
+
|
22 |
+
This model is fine-tuned on the `dbmdz/bert-base-turkish-uncased` architecture for detecting organization accounts within Turkish Twitter. This initiative is part of the Politus Project's efforts to analyze organizational presence in social media data.
|
23 |
+
|
24 |
+
## Model Architecture
|
25 |
+
|
26 |
+
- **Base Model:** BERT (dbmdz/bert-base-turkish-uncased)
|
27 |
+
- **Training Data:** Twitter data from 3,922 accounts with high organization-related activity as determined by m3inference scores above 0.7. The data was annotated based on user names, screen names, and descriptions by a human annotator.
|
28 |
+
|
29 |
+
## Training Setup
|
30 |
+
|
31 |
+
- **Tokenization:** Used Hugging Face's AutoTokenizer, padding sequences to a maximum length of 128 tokens.
|
32 |
+
- **Dataset Split:** 80% training, 20% validation.
|
33 |
+
- **Training Parameters:**
|
34 |
+
- Epochs: 3
|
35 |
+
- Training batch size: 8
|
36 |
+
- Evaluation batch size: 16
|
37 |
+
- Warmup steps: 500
|
38 |
+
- Weight decay: 0.01
|
39 |
+
|
40 |
+
## Hyperparameter Tuning
|
41 |
+
|
42 |
+
Performed using Optuna, achieving best settings:
|
43 |
+
- **Learning rate:** 1.2323083424093641e-05
|
44 |
+
- **Batch size:** 32
|
45 |
+
- **Epochs:** 2
|
46 |
+
|
47 |
+
## Evaluation Metrics
|
48 |
+
|
49 |
+
- **Precision on Validation Set:** 0.94 (organization class)
|
50 |
+
- **Recall on Validation Set:** 0.95 (organization class)
|
51 |
+
- **F1-Score (Macro Average):** 0.95
|
52 |
+
- **Accuracy:** 0.95
|
53 |
+
- **Confusion Matrix on Validation Set:**
|
54 |
+
```
|
55 |
+
[[369, 22],
|
56 |
+
[19, 375]]
|
57 |
+
```
|
58 |
+
|
59 |
+
- **Hand-coded Sample of 1000 Accounts:**
|
60 |
+
- **Precision:** 0.91
|
61 |
+
- **F1-Score (Macro Average):** 0.947
|
62 |
+
- **Confusion Matrix:**
|
63 |
+
```
|
64 |
+
[[936, 3],
|
65 |
+
[ 4, 31]]
|
66 |
+
```
|
67 |
+
|
68 |
+
## How to Use
|
69 |
+
|
70 |
+
```python
|
71 |
+
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
72 |
+
|
73 |
+
model = AutoModelForSequenceClassification.from_pretrained("atsizelti/atsizelti/turkish_org_classifier_hand_coded")
|
74 |
+
tokenizer = AutoTokenizer.from_pretrained("atsizelti/atsizelti/turkish_org_classifier_hand_coded")
|
75 |
+
|
76 |
+
text = "Örnek metin buraya girilir."
|
77 |
+
inputs = tokenizer(text, return_tensors="pt")
|
78 |
+
outputs = model(**inputs)
|
79 |
+
predictions = outputs.logits.argmax(-1)
|
80 |
+
```
|