r3ddkahili commited on
Commit
84aa01c
·
verified ·
1 Parent(s): d6d0090

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +202 -3
README.md CHANGED
@@ -1,3 +1,202 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Malicious URL Detection Model
2
+
3
+ > A fine-tuned **BERT-LoRA** model for detecting malicious URLs, including phishing, malware, and defacement threats.
4
+
5
+ ## Model Description
6
+
7
+ This model is a **fine-tuned BERT-based classifier** designed to detect **malicious URLs** in real-time. It applies **Low-Rank Adaptation (LoRA)** for efficient fine-tuning, reducing computational costs while maintaining high accuracy.
8
+
9
+ The model classifies URLs into **four categories**:
10
+
11
+ - ✅ **Benign**
12
+ - 🔴 **Defacement**
13
+ - ⚠️ **Phishing**
14
+ - 🛑 **Malware**
15
+
16
+ It achieves **98% validation accuracy** and an **F1-score of 0.965**, ensuring robust detection capabilities.
17
+
18
+ ---
19
+
20
+ ## Intended Uses & Limitations
21
+
22
+ ### Use Cases
23
+
24
+ ✔️ Real-time URL classification for cybersecurity tools✔️ Phishing and malware detection for online safety✔️ Integration into browser extensions for instant threat alerts✔️ Security monitoring for SOC (Security Operations Centers)
25
+
26
+ ### Limitations
27
+
28
+ ⚠️ May **misclassify short or obfuscated URLs**⚠️ Performance may degrade with **dynamic domain structures**⚠️ Requires **frequent retraining** to adapt to evolving threats
29
+
30
+ ---
31
+
32
+ ## Model Details
33
+
34
+ - **Model Type:** BERT-based URL Classifier
35
+ - **Fine-Tuning Method:** LoRA (Low-Rank Adaptation)
36
+ - **Base Model:** `bert-base-uncased`
37
+ - **Number of Parameters:** 110M
38
+ - **Dataset:** Kaggle Malicious URLs Dataset (~651,191 samples)
39
+ - **Max Sequence Length:** `128`
40
+ - **Framework:** 🤗 `transformers`, `torch`, `peft`
41
+
42
+ ---
43
+
44
+ ## How to Use
45
+
46
+ You can use this model directly with 🤗 **Transformers**:
47
+
48
+ ```python
49
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
50
+ import torch
51
+
52
+ # Load the model and tokenizer
53
+ model_name = "your-huggingface-model-name"
54
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
55
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
56
+
57
+ # Example URL
58
+ url = "http://example.com/login"
59
+
60
+ # Tokenize and predict
61
+ inputs = tokenizer(url, return_tensors="pt", truncation=True, padding=True, max_length=128)
62
+ with torch.no_grad():
63
+ outputs = model(**inputs)
64
+ prediction = torch.argmax(outputs.logits).item()
65
+
66
+ # Mapping prediction to labels
67
+ label_map = {0: "Benign", 1: "Defacement", 2: "Phishing", 3: "Malware"}
68
+ print(f"Prediction: {label_map[prediction]}")
69
+ ```
70
+
71
+ ---
72
+
73
+ ## Training Details
74
+
75
+ - **Batch Size:** `16`
76
+ - **Epochs:** `5`
77
+ - **Learning Rate:** `2e-5`
78
+ - **Optimizer:** AdamW with weight decay
79
+ - **Loss Function:** Weighted Cross-Entropy
80
+ - **Evaluation Strategy:** Epoch-based
81
+ - **Fine-Tuning Strategy:** LoRA applied to BERT layers
82
+
83
+ ---
84
+
85
+ ## Evaluation Results
86
+
87
+ | Metric | Value |
88
+ | ------------ | --------- |
89
+ | Accuracy | **98%** |
90
+ | Precision | **0.96** |
91
+ | Recall | **0.97** |
92
+ | **F1 Score** | **0.965** |
93
+
94
+ ### Category-wise Performance
95
+
96
+ | Category | Precision | Recall | F1-Score |
97
+ | -------------- | --------- | ------ | -------- |
98
+ | **Benign** | 0.98 | 0.99 | 0.985 |
99
+ | **Defacement** | 0.98 | 0.99 | 0.985 |
100
+ | **Phishing** | 0.93 | 0.94 | 0.935 |
101
+ | **Malware** | 0.95 | 0.96 | 0.955 |
102
+
103
+ ---
104
+
105
+ ## Deployment Options
106
+
107
+ ### 1️⃣ Streamlit Web App
108
+
109
+ - Deployed on **Streamlit Cloud, AWS, or Google Cloud**.
110
+ - Provides **real-time URL analysis** with a user-friendly interface.
111
+
112
+ ### 2️⃣ Browser Extension (Planned)
113
+
114
+ - **Real-time scanning** of visited web pages.
115
+ - **Dynamic threat alerts** with confidence scores.
116
+
117
+ ### 3️⃣ API Integration
118
+
119
+ - REST API for bulk URL analysis.
120
+ - Supports **Security Operations Centers (SOC)**.
121
+
122
+ ---
123
+
124
+ ## Limitations & Bias
125
+
126
+ - **May misclassify complex phishing URLs** that mimic legitimate sites.
127
+ - **Needs regular updates** to counter evolving threats.
128
+ - **Potential bias** if future threats are not represented in training data.
129
+
130
+ ---
131
+
132
+ ## Training Data & Citation
133
+
134
+ ### Data Source
135
+
136
+ Dataset sourced from **Kaggle Malicious URLs Dataset**:📌 [Dataset Link](https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset)
137
+
138
+ ### BibTeX Citation
139
+
140
+ ```
141
+ @article{maliciousurl2025,
142
+ author = {Your Name},
143
+ title = {Fine-Tuned BERT for Malicious URL Detection},
144
+ year = {2025},
145
+ journal = {Cybersecurity AI Research},
146
+ url = {https://your-research-paper-link.com}
147
+ }
148
+ ```
149
+
150
+ ---
151
+
152
+ ## Future Work
153
+
154
+ 🚀 **Improvements Planned:**
155
+
156
+ - **Better phishing URL detection** via adversarial training.
157
+ - **Deploying as a real-time browser extension.**
158
+ - **Integration with VirusTotal for enhanced threat intelligence.**
159
+ - **Expanding detection to identify zero-day threats.**
160
+
161
+ ---
162
+
163
+ ## Uploading to Hugging Face
164
+
165
+ To upload this model to **Hugging Face**, follow these steps:
166
+
167
+ ```bash
168
+ pip install transformers huggingface_hub
169
+ ```
170
+
171
+ ```python
172
+ from huggingface_hub import create_repo
173
+ create_repo("your-huggingface-model-name")
174
+ ```
175
+
176
+ ```python
177
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
178
+ from huggingface_hub import HfApi
179
+
180
+ model_name = "your-huggingface-model-name"
181
+ model = AutoModelForSequenceClassification.from_pretrained("your-local-model-directory")
182
+ tokenizer = AutoTokenizer.from_pretrained("your-local-model-directory")
183
+
184
+ # Save & Push Model
185
+ model.save_pretrained(f"{model_name}")
186
+ tokenizer.save_pretrained(f"{model_name}")
187
+
188
+ api = HfApi()
189
+ api.upload_folder(
190
+ folder_path=f"{model_name}",
191
+ repo_id=f"your-huggingface-username/{model_name}",
192
+ repo_type="model",
193
+ )
194
+ ```
195
+
196
+ ---
197
+
198
+ ## Conclusion
199
+
200
+ The **Malicious URL Detection Model** provides **state-of-the-art** accuracy for detecting **phishing, malware, and defacement threats**. It is optimized for **real-time cybersecurity applications** and **deployed using Streamlit**.
201
+
202
+ ✅ **Final F1-score: 0.965**✅ **Optimized for real-time detection**✅ **Ready for deployment via API & browser extension**