File size: 4,307 Bytes
9fe4157 84aa01c 9fe4157 84aa01c e2bab18 84aa01c 9fe4157 84aa01c 9fe4157 84aa01c 9fe4157 84aa01c 9fe4157 84aa01c 9fe4157 84aa01c 1e47e82 84aa01c 9fe4157 84aa01c a1571fa 84aa01c a1571fa |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
---
language: en
tags:
- cybersecurity
- malicious-url-detection
- bert
- transformers
- phishing-detection
license: apache-2.0
---
# Malicious URL Detection Model
> A fine-tuned **BERT-LoRA** model for detecting malicious URLs, including phishing, malware, and defacement threats.
## Model Description
This model is a **fine-tuned BERT-based classifier** designed to detect **malicious URLs** in real-time. It applies **Low-Rank Adaptation (LoRA)** for efficient fine-tuning, reducing computational costs while maintaining high accuracy.
The model classifies URLs into **four categories**:
- **Benign**
- **Defacement**
- **Phishing**
- **Malware**
It achieves **98% validation accuracy** and an **F1-score of 0.965**, ensuring robust detection capabilities.
---
## Intended Uses
### Use Cases
- Real-time URL classification for cybersecurity tools
- Phishing and malware detection for online safety
- Integration into browser extensions for instant threat alerts
- Security monitoring for SOC (Security Operations Centers)
---
## Model Details
- **Model Type:** BERT-based URL Classifier
- **Fine-Tuning Method:** LoRA (Low-Rank Adaptation)
- **Base Model:** `bert-base-uncased`
- **Number of Parameters:** 110M
- **Dataset:** Kaggle Malicious URLs Dataset (~651,191 samples)
- **Max Sequence Length:** `128`
- **Framework:** π€ `transformers`, `torch`, `peft`
---
## How to Use
You can use this model directly with π€ **Transformers**:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load the model and tokenizer
model_name = "your-huggingface-model-name"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example URL
url = "http://example.com/login"
# Tokenize and predict
inputs = tokenizer(url, return_tensors="pt", truncation=True, padding=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
prediction = torch.argmax(outputs.logits).item()
# Mapping prediction to labels
label_map = {0: "Benign", 1: "Defacement", 2: "Phishing", 3: "Malware"}
print(f"Prediction: {label_map[prediction]}")
```
---
## Training Details
- **Batch Size:** `16`
- **Epochs:** `5`
- **Learning Rate:** `2e-5`
- **Optimizer:** AdamW with weight decay
- **Loss Function:** Weighted Cross-Entropy
- **Evaluation Strategy:** Epoch-based
- **Fine-Tuning Strategy:** LoRA applied to BERT layers
---
## Evaluation Results
| Metric | Value |
| ------------ | --------- |
| Accuracy | **98%** |
| Precision | **0.96** |
| Recall | **0.97** |
| **F1 Score** | **0.965** |
### Category-wise Performance
| Category | Precision | Recall | F1-Score |
| -------------- | --------- | ------ | -------- |
| **Benign** | 0.98 | 0.99 | 0.985 |
| **Defacement** | 0.98 | 0.99 | 0.985 |
| **Phishing** | 0.93 | 0.94 | 0.935 |
| **Malware** | 0.95 | 0.96 | 0.955 |
---
## Deployment Options
### Streamlit Web App
- Deployed on **Streamlit Cloud, AWS, or Google Cloud**.
- Provides **real-time URL analysis** with a user-friendly interface.
### Browser Extension (Planned)
- **Real-time scanning** of visited web pages.
- **Dynamic threat alerts** with confidence scores.
### API Integration
- REST API for bulk URL analysis.
- Supports **Security Operations Centers (SOC)**.
---
## Limitations & Bias
- **May misclassify complex phishing URLs** that mimic legitimate sites.
- **Needs regular updates** to counter evolving threats.
- **Potential bias** if future threats are not represented in training data.
---
## Training Data & Citation
### Data Source
Dataset sourced from **Kaggle Malicious URLs Dataset**:
π [Dataset Link](https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset)
### BibTeX Citation
```
@article{maliciousurl2025,
author = {Gleyzie Tongo, Dr. Farnaz Farid, Dr. Ala Al-Areqi, Dr. Farhad Ahamed},
title = {Fine-Tuned BERT for Malicious URL Detection},
year = {2025},
institution = {Western Sydney University}
}
```
---
## Contact
For inquiries, collaborations, or feedback, feel free to reach out via LinkedIn:
π [Gleyzie Tongo](https://www.linkedin.com/in/gleyzie-tongo-83b454218/)
|