selfconstruct3d's picture
Update README.md
4ceccf2 verified
---
library_name: transformers
tags:
- cybersecurity
- mpnet
- embeddings
- classification
license: apache-2.0
language:
- en
base_model:
- microsoft/mpnet-base
---
# MPNet (Cyber) - MPNet Fine-Tuned for Cybersecurity Group Classification
This MPNet model was fine-tuned specifically for classifying cybersecurity threat groups based on textual descriptions from cybersecurity reports.
## Model Details
### Model Description
This model is based on `microsoft/mpnet-base` and fine-tuned using Masked Language Modeling (MLM) and supervised classification on cybersecurity threat intelligence descriptions, primarily focused on known threat actor groups.
### Model Information
- **Base Model:** microsoft/mpnet-base
- **Tasks:** Text classification, embedding generation
- **Language:** English
## Intended Use
### Primary Use
This model generates specialized embeddings that are useful for:
- Identifying cybersecurity threat actor groups from textual descriptions
- Cybersecurity threat intelligence analysis
- Embedding-based retrieval tasks in cybersecurity contexts
### Out-of-Scope Use
This model is not intended for general language tasks outside cybersecurity contexts.
## Performance Evaluation
The model was benchmarked against state-of-the-art cybersecurity NLP models:
| Model | Classification Accuracy | Embedding Variability |
|------------------|-------------------------|-----------------------|
| Original MPNet | 55.73% | 0.0798 |
| SecBERT | 91.67% | 0.5911 |
| ATTACK-BERT | 83.51% | 0.0960 |
| MPNet (Cyber) | 72.74% | 0.1239 |
| SecureBERT | 49.31% | 0.0071 |
### Downstream Tasks
- Attribution of cybersecurity incidents
- Automated analysis of threat intelligence reports
- Embeddings for cybersecurity threat detection
### Limitations
- Best suited for English language cybersecurity contexts
- May require further fine-tuning for highly specific tasks
## Usage
To use this model:
```python
from transformers import AutoTokenizer, MPNetModel
import torch
tokenizer = AutoTokenizer.from_pretrained("selfconstruct3d
/
mpnet-classification-finetuned-cyber-groups ")
model = MPNetModel.from_pretrained("selfconstruct3d
/
mpnet-classification-finetuned-cyber-groups ")
inputs = tokenizer("APT38 uses ransomware for financial gains.", return_tensors="pt")
outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1)
```
or
```python
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('selfconstruct3d/mpnet-classification-finetuned-cyber-groups')
embeddings = model.encode(sentences)
print(embeddings)
```
## Training Details
### Training Data
Fine-tuned on descriptions of threat actor activities sourced from cybersecurity reports, including MITRE ATT&CK techniques.
### Hyperparameters
- **Epochs:** 10 (MLM), 20 (classification)
- **Batch size:** 16
- **Learning rate:** 5e-6 (MLM), 2e-6 (classification)
- **Hardware:** GPU (CUDA-enabled)
## Citation
If using this model, please cite as:
```bibtex
@misc{mpnet_cyber_finetune,
author = {Hamzic, D.},
title = {MPNet Fine-Tuned for Cybersecurity Group Classification},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/selfconstruct3d/mpnet-classification-finetuned-cyber-groups}
}
```
## Contact
- **Author:** Dženan Hamzić
- **Contact Information:** https://www.linkedin.com/in/dzenan-hamzic/