Anvilogic
/

URLGuardian

Text Classification

Model card Files Files and versions

URLGuardian / README.md

anvilogic-admin's picture

anvilogic-admin

Update README.md

69eb8c8 verified 6 months ago

|

history blame contribute delete

3.36 kB

	---
	license: mit
	language:
	- en
	base_model:
	- distilbert/distilbert-base-multilingual-cased
	pipeline_tag: text-classification
	library_name: transformers
	tags:
	- code
	- cyber
	---


	# Transformer

	This is a transformers model fine tuned for malicious URL detection. Given a FQDN URL it outputs probability of it to be malicious identifying common suspicious pattern.

	## Model Details

	### Model Description

	- Developed by: Anvilogic
	- Model Type: Transformer
	- Maximum Sequence Length: 512 tokens
	- Output Dimensionality: 768 tokens
	- Finetuned from model: [distilbert](https://huggingface.co/distilbert/distilbert-base-cased)
	- Language(s) (NLP): Multilingual
	- License: MIT

	### Full Model Architecture

	```
	DistilBERT:
	name: "distilbert-base-cased"
	params:
	layers: 6
	hidden_size: 768
	attention_heads: 12
	ff_dim: 3072
	max_seq_len: 512
	vocab_size: 28996
	total_params: 66M
	activation: "gelu"
	```

	## Usage

	### Direct Usage

	First install the Transformers library:

	```bash
	pip install -U transformers
	```
	Then you can load this model and run inference.
	```python
	from transformers import AutoModelForSequenceClassification, AutoTokenizer
	import torch
	# Load pre-trained model and tokenizer
	model_name = "Anvilogic/URLGuardian"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) # Adjust `num_labels` based on your task
	# Example sentences
	sentences = ["paypal.com.secure-login.xyz","bit.ly/fake-login"]
	# Tokenize inputs
	inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
	# Run inference
	with torch.no_grad():
	outputs = model(**inputs)
	logits = outputs.logits # Raw predictions
	predictions = torch.argmax(logits, dim=-1) # Convert to class labels
	# Print results
	print(predictions.tolist()) # Example output: [1, 0] (Assuming 1 = Positive, 0 = Negative)
	```
	### Downstream Usage
	This dataset enables real-time malicious URL detection with lightweight models, supporting large-scale inference for phishing prevention and cybersecurity monitoring.
	## Training Details

	### Framework Versions
	- Python: 3.10.14
	- Transformers: 4.49.0
	- PyTorch: 2.2.2
	- Tokenizers: 0.20.3

	### Training Data

	The model was fine-tuned using [Anvilogic/URL-Guardian-Dataset](https://huggingface.co/datasets/Anvilogic/URL-Guardian-Dataset), which contains URL as well as their labels.
	The dataset was filtered and converted to the parquet format for efficient processing.

	### Training Procedure
	The model was optimized using [BCELoss](https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html)

	#### Training Hyperparameters
	- Model Architecture: encoder fine-tuned from [distilbert](https://huggingface.co/distilbert/distilbert-base-cased)
	- Batch Size: 32
	- Epochs: 3
	- Learning Rate: 2e-5
	- Warmup Steps: 100


	## Evaluation

	In the final evaluation after training, the model achieved the following metrics on the test set:

	Binary Classification Evaluator
	```json
	Accuracy : 0.9744
	F1 Score : 0.9742
	Precision : 0.9771
	Recall : 0.9712
	Average Precision : 0.9962
	```
	These results indicate the model's high performance in identifying maliciosu URLs, with strong precision and recall scores that make it well-suited for cybersecurity applications.