File size: 3,769 Bytes
ece8f40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67e2874
ece8f40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cdc2c6c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
---
language:
- en
base_model:
- CrabInHoney/urlbert-tiny-base-v3
pipeline_tag: text-classification
tags:
- url
- cybersecurity
- urls
- links
- classification
- phishing-detection
- tiny
- phishing
- malware
- defacement
- transformers
- urlbert
- bert
- malicious
license: apache-2.0
new_version: CrabInHoney/urlbert-tiny-v4-malicious-url-classifier
---

# URLBERT-Tiny-v3 Malicious URL Classifier

This is a lightweight version of BERT, specifically fine-tuned for classifying URLs into four categories: benign, phishing, malware, and defacement.

## Model Details

- **Model size**: 3.69M parameters  
- **Tensor type**: F32  
- **Model weight size**: 14.8 MB  
- **Base model**: [CrabInHoney/urlbert-tiny-base-v3](https://huggingface.co/CrabInHoney/urlbert-tiny-base-v3)  
- **Dataset**: [Malicious URLs Dataset](https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset)  

## Model Evaluation Results

The model was evaluated on a test set with the following classification metrics:

| Class        | Precision  | Recall     | F1-Score   |        
|--------------|------------|------------|------------|
| Benign       | 0.987695   | 0.993717   | 0.990697   |
| Defacement   | 0.988510   | 0.998963   | 0.993709   |
| Malware      | 0.988291   | 0.960332   | 0.974111   |
| Phishing     | 0.958425   | 0.930826   | 0.944423   |
| **Accuracy** | 0.983738   | 0.983738   | 0.983738   |
| **Macro Avg**| 0.980730   | 0.970959   | 0.975735   |
| **Weighted Avg** | 0.983615 | 0.983738   | 0.983627   |

## Usage Example

Below is an example of how to use the model for URL classification using the Hugging Face `transformers` library:

```python
from transformers import BertTokenizerFast, BertForSequenceClassification, pipeline
import torch

# Определение устройства (GPU или CPU)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Используемое устройство: {device}")

# Загрузка модели и токенизатора
model_name = "CrabInHoney/urlbert-tiny-v3-malicious-url-classifier"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)
model.to(device)

# Создание pipeline для классификации
classifier = pipeline(
    "text-classification",
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1,
    return_all_scores=True
)

# Примеры URL для тестирования
test_urls = [
    "wikiobits.com/Obits/TonyProudfoot",
    "http://www.824555.com/app/member/SportOption.php?uid=guest&langx=gb",
]

# Маппинг меток на понятные названия классов
label_mapping = {
    "LABEL_0": "benign",
    "LABEL_1": "defacement",
    "LABEL_2": "malware",
    "LABEL_3": "phishing"
}

# Классификация URL
for url in test_urls:
    results = classifier(url)
    print(f"\nURL: {url}")
    for result in results[0]: 
        label = result['label']
        score = result['score']
        friendly_label = label_mapping.get(label, label)
        print(f"Класс: {friendly_label}, вероятность: {score:.4f}")
```

### Example Output:
```
URL: wikiobits.com/Obits/TonyProudfoot
Класс: benign, вероятность: 0.9953
Класс: defacement, вероятность: 0.0000
Класс: malware, вероятность: 0.0000
Класс: phishing, вероятность: 0.0046

URL: http://www.824555.com/app/member/SportOption.php?uid=guest&langx=gb
Класс: benign, вероятность: 0.0000
Класс: defacement, вероятность: 0.0001
Класс: malware, вероятность: 0.9998
Класс: phishing, вероятность: 0.0001
```