--- language: - en base_model: - CrabInHoney/urlbert-tiny-base-v3 pipeline_tag: text-classification tags: - url - cybersecurity - urls - links - classification - phishing-detection - tiny - phishing - malware - defacement - transformers - urlbert - bert - malicious license: apache-2.0 new_version: CrabInHoney/urlbert-tiny-v4-malicious-url-classifier --- # URLBERT-Tiny-v3 Malicious URL Classifier This is a lightweight version of BERT, specifically fine-tuned for classifying URLs into four categories: benign, phishing, malware, and defacement. ## Model Details - **Model size**: 3.69M parameters - **Tensor type**: F32 - **Model weight size**: 14.8 MB - **Base model**: [CrabInHoney/urlbert-tiny-base-v3](https://huggingface.co/CrabInHoney/urlbert-tiny-base-v3) - **Dataset**: [Malicious URLs Dataset](https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset) ## Model Evaluation Results The model was evaluated on a test set with the following classification metrics: | Class | Precision | Recall | F1-Score | |--------------|------------|------------|------------| | Benign | 0.987695 | 0.993717 | 0.990697 | | Defacement | 0.988510 | 0.998963 | 0.993709 | | Malware | 0.988291 | 0.960332 | 0.974111 | | Phishing | 0.958425 | 0.930826 | 0.944423 | | **Accuracy** | 0.983738 | 0.983738 | 0.983738 | | **Macro Avg**| 0.980730 | 0.970959 | 0.975735 | | **Weighted Avg** | 0.983615 | 0.983738 | 0.983627 | ## Usage Example Below is an example of how to use the model for URL classification using the Hugging Face `transformers` library: ```python from transformers import BertTokenizerFast, BertForSequenceClassification, pipeline import torch # Определение устройства (GPU или CPU) device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') print(f"Используемое устройство: {device}") # Загрузка модели и токенизатора model_name = "CrabInHoney/urlbert-tiny-v3-malicious-url-classifier" tokenizer = BertTokenizerFast.from_pretrained(model_name) model = BertForSequenceClassification.from_pretrained(model_name) model.to(device) # Создание pipeline для классификации classifier = pipeline( "text-classification", model=model, tokenizer=tokenizer, device=0 if torch.cuda.is_available() else -1, return_all_scores=True ) # Примеры URL для тестирования test_urls = [ "wikiobits.com/Obits/TonyProudfoot", "http://www.824555.com/app/member/SportOption.php?uid=guest&langx=gb", ] # Маппинг меток на понятные названия классов label_mapping = { "LABEL_0": "benign", "LABEL_1": "defacement", "LABEL_2": "malware", "LABEL_3": "phishing" } # Классификация URL for url in test_urls: results = classifier(url) print(f"\nURL: {url}") for result in results[0]: label = result['label'] score = result['score'] friendly_label = label_mapping.get(label, label) print(f"Класс: {friendly_label}, вероятность: {score:.4f}") ``` ### Example Output: ``` URL: wikiobits.com/Obits/TonyProudfoot Класс: benign, вероятность: 0.9953 Класс: defacement, вероятность: 0.0000 Класс: malware, вероятность: 0.0000 Класс: phishing, вероятность: 0.0046 URL: http://www.824555.com/app/member/SportOption.php?uid=guest&langx=gb Класс: benign, вероятность: 0.0000 Класс: defacement, вероятность: 0.0001 Класс: malware, вероятность: 0.9998 Класс: phishing, вероятность: 0.0001 ```