URLGuardian / README.md
anvilogic-admin's picture
Update README.md
69eb8c8 verified
---
license: mit
language:
- en
base_model:
- distilbert/distilbert-base-multilingual-cased
pipeline_tag: text-classification
library_name: transformers
tags:
- code
- cyber
---
# Transformer
This is a transformers model fine tuned for malicious URL detection. Given a FQDN URL it outputs probability of it to be malicious identifying common suspicious pattern.
## Model Details
### Model Description
- **Developed by:** Anvilogic
- **Model Type:** Transformer
- **Maximum Sequence Length:** 512 tokens
- **Output Dimensionality:** 768 tokens
- **Finetuned from model:** [distilbert](https://huggingface.co/distilbert/distilbert-base-cased)
- **Language(s) (NLP):** Multilingual
- **License:** MIT
### Full Model Architecture
```
DistilBERT:
name: "distilbert-base-cased"
params:
layers: 6
hidden_size: 768
attention_heads: 12
ff_dim: 3072
max_seq_len: 512
vocab_size: 28996
total_params: 66M
activation: "gelu"
```
## Usage
### Direct Usage
First install the Transformers library:
```bash
pip install -U transformers
```
Then you can load this model and run inference.
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
# Load pre-trained model and tokenizer
model_name = "Anvilogic/URLGuardian"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) # Adjust `num_labels` based on your task
# Example sentences
sentences = ["paypal.com.secure-login.xyz","bit.ly/fake-login"]
# Tokenize inputs
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
# Run inference
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits # Raw predictions
predictions = torch.argmax(logits, dim=-1) # Convert to class labels
# Print results
print(predictions.tolist()) # Example output: [1, 0] (Assuming 1 = Positive, 0 = Negative)
```
### Downstream Usage
This dataset enables real-time malicious URL detection with lightweight models, supporting large-scale inference for phishing prevention and cybersecurity monitoring.
## Training Details
### Framework Versions
- Python: 3.10.14
- Transformers: 4.49.0
- PyTorch: 2.2.2
- Tokenizers: 0.20.3
### Training Data
The model was fine-tuned using [Anvilogic/URL-Guardian-Dataset](https://huggingface.co/datasets/Anvilogic/URL-Guardian-Dataset), which contains URL as well as their labels.
The dataset was filtered and converted to the parquet format for efficient processing.
### Training Procedure
The model was optimized using [BCELoss](https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html)
#### Training Hyperparameters
- **Model Architecture**: encoder fine-tuned from [distilbert](https://huggingface.co/distilbert/distilbert-base-cased)
- **Batch Size**: 32
- **Epochs**: 3
- **Learning Rate**: 2e-5
- **Warmup Steps**: 100
## Evaluation
In the final evaluation after training, the model achieved the following metrics on the test set:
**Binary Classification Evaluator**
```json
Accuracy : 0.9744
F1 Score : 0.9742
Precision : 0.9771
Recall : 0.9712
Average Precision : 0.9962
```
These results indicate the model's high performance in identifying maliciosu URLs, with strong precision and recall scores that make it well-suited for cybersecurity applications.