|
--- |
|
library_name: transformers |
|
tags: |
|
- cybersecurity |
|
- mpnet |
|
- embeddings |
|
- classification |
|
license: apache-2.0 |
|
language: |
|
- en |
|
base_model: |
|
- microsoft/mpnet-base |
|
--- |
|
|
|
# MPNet (Cyber) - MPNet Fine-Tuned for Cybersecurity Group Classification |
|
|
|
This MPNet model was fine-tuned specifically for classifying cybersecurity threat groups based on textual descriptions from cybersecurity reports. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
This model is based on `microsoft/mpnet-base` and fine-tuned using Masked Language Modeling (MLM) and supervised classification on cybersecurity threat intelligence descriptions, primarily focused on known threat actor groups. |
|
|
|
### Model Information |
|
- **Base Model:** microsoft/mpnet-base |
|
- **Tasks:** Text classification, embedding generation |
|
- **Language:** English |
|
|
|
## Intended Use |
|
|
|
### Primary Use |
|
|
|
This model generates specialized embeddings that are useful for: |
|
- Identifying cybersecurity threat actor groups from textual descriptions |
|
- Cybersecurity threat intelligence analysis |
|
- Embedding-based retrieval tasks in cybersecurity contexts |
|
|
|
### Out-of-Scope Use |
|
|
|
This model is not intended for general language tasks outside cybersecurity contexts. |
|
|
|
## Performance Evaluation |
|
|
|
The model was benchmarked against state-of-the-art cybersecurity NLP models: |
|
|
|
| Model | Classification Accuracy | Embedding Variability | |
|
|------------------|-------------------------|-----------------------| |
|
| Original MPNet | 55.73% | 0.0798 | |
|
| SecBERT | 91.67% | 0.5911 | |
|
| ATTACK-BERT | 83.51% | 0.0960 | |
|
| MPNet (Cyber) | 72.74% | 0.1239 | |
|
| SecureBERT | 49.31% | 0.0071 | |
|
|
|
### Downstream Tasks |
|
- Attribution of cybersecurity incidents |
|
- Automated analysis of threat intelligence reports |
|
- Embeddings for cybersecurity threat detection |
|
|
|
### Limitations |
|
- Best suited for English language cybersecurity contexts |
|
- May require further fine-tuning for highly specific tasks |
|
|
|
## Usage |
|
|
|
To use this model: |
|
|
|
```python |
|
from transformers import AutoTokenizer, MPNetModel |
|
import torch |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("selfconstruct3d |
|
/ |
|
mpnet-classification-finetuned-cyber-groups ") |
|
model = MPNetModel.from_pretrained("selfconstruct3d |
|
/ |
|
mpnet-classification-finetuned-cyber-groups ") |
|
|
|
inputs = tokenizer("APT38 uses ransomware for financial gains.", return_tensors="pt") |
|
outputs = model(**inputs) |
|
embeddings = outputs.last_hidden_state.mean(dim=1) |
|
``` |
|
|
|
or |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
sentences = ["This is an example sentence", "Each sentence is converted"] |
|
|
|
model = SentenceTransformer('selfconstruct3d/mpnet-classification-finetuned-cyber-groups') |
|
embeddings = model.encode(sentences) |
|
print(embeddings) |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
Fine-tuned on descriptions of threat actor activities sourced from cybersecurity reports, including MITRE ATT&CK techniques. |
|
|
|
### Hyperparameters |
|
- **Epochs:** 10 (MLM), 20 (classification) |
|
- **Batch size:** 16 |
|
- **Learning rate:** 5e-6 (MLM), 2e-6 (classification) |
|
- **Hardware:** GPU (CUDA-enabled) |
|
|
|
## Citation |
|
|
|
If using this model, please cite as: |
|
|
|
```bibtex |
|
@misc{mpnet_cyber_finetune, |
|
author = {Hamzic, D.}, |
|
title = {MPNet Fine-Tuned for Cybersecurity Group Classification}, |
|
year = {2025}, |
|
publisher = {Hugging Face}, |
|
url = {https://huggingface.co/selfconstruct3d/mpnet-classification-finetuned-cyber-groups} |
|
} |
|
``` |
|
|
|
## Contact |
|
- **Author:** Dženan Hamzić |
|
- **Contact Information:** https://www.linkedin.com/in/dzenan-hamzic/ |