|
--- |
|
license: mit |
|
language: |
|
- en |
|
base_model: |
|
- distilbert/distilbert-base-multilingual-cased |
|
pipeline_tag: text-classification |
|
library_name: transformers |
|
tags: |
|
- code |
|
- cyber |
|
--- |
|
|
|
|
|
# Transformer |
|
|
|
This is a transformers model fine tuned for malicious URL detection. Given a FQDN URL it outputs probability of it to be malicious identifying common suspicious pattern. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
- **Developed by:** Anvilogic |
|
- **Model Type:** Transformer |
|
- **Maximum Sequence Length:** 512 tokens |
|
- **Output Dimensionality:** 768 tokens |
|
- **Finetuned from model:** [distilbert](https://huggingface.co/distilbert/distilbert-base-cased) |
|
- **Language(s) (NLP):** Multilingual |
|
- **License:** MIT |
|
|
|
### Full Model Architecture |
|
|
|
``` |
|
DistilBERT: |
|
name: "distilbert-base-cased" |
|
params: |
|
layers: 6 |
|
hidden_size: 768 |
|
attention_heads: 12 |
|
ff_dim: 3072 |
|
max_seq_len: 512 |
|
vocab_size: 28996 |
|
total_params: 66M |
|
activation: "gelu" |
|
``` |
|
|
|
## Usage |
|
|
|
### Direct Usage |
|
|
|
First install the Transformers library: |
|
|
|
```bash |
|
pip install -U transformers |
|
``` |
|
Then you can load this model and run inference. |
|
```python |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
import torch |
|
# Load pre-trained model and tokenizer |
|
model_name = "Anvilogic/URLGuardian" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) # Adjust `num_labels` based on your task |
|
# Example sentences |
|
sentences = ["paypal.com.secure-login.xyz","bit.ly/fake-login"] |
|
# Tokenize inputs |
|
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt") |
|
# Run inference |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
logits = outputs.logits # Raw predictions |
|
predictions = torch.argmax(logits, dim=-1) # Convert to class labels |
|
# Print results |
|
print(predictions.tolist()) # Example output: [1, 0] (Assuming 1 = Positive, 0 = Negative) |
|
``` |
|
### Downstream Usage |
|
This dataset enables real-time malicious URL detection with lightweight models, supporting large-scale inference for phishing prevention and cybersecurity monitoring. |
|
## Training Details |
|
|
|
### Framework Versions |
|
- Python: 3.10.14 |
|
- Transformers: 4.49.0 |
|
- PyTorch: 2.2.2 |
|
- Tokenizers: 0.20.3 |
|
|
|
### Training Data |
|
|
|
The model was fine-tuned using [Anvilogic/URL-Guardian-Dataset](https://huggingface.co/datasets/Anvilogic/URL-Guardian-Dataset), which contains URL as well as their labels. |
|
The dataset was filtered and converted to the parquet format for efficient processing. |
|
|
|
### Training Procedure |
|
The model was optimized using [BCELoss](https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html) |
|
|
|
#### Training Hyperparameters |
|
- **Model Architecture**: encoder fine-tuned from [distilbert](https://huggingface.co/distilbert/distilbert-base-cased) |
|
- **Batch Size**: 32 |
|
- **Epochs**: 3 |
|
- **Learning Rate**: 2e-5 |
|
- **Warmup Steps**: 100 |
|
|
|
|
|
## Evaluation |
|
|
|
In the final evaluation after training, the model achieved the following metrics on the test set: |
|
|
|
**Binary Classification Evaluator** |
|
```json |
|
Accuracy : 0.9744 |
|
F1 Score : 0.9742 |
|
Precision : 0.9771 |
|
Recall : 0.9712 |
|
Average Precision : 0.9962 |
|
``` |
|
These results indicate the model's high performance in identifying maliciosu URLs, with strong precision and recall scores that make it well-suited for cybersecurity applications. |