|
--- |
|
library_name: transformers |
|
tags: |
|
- cybersecurity |
|
- mpnet |
|
- classification |
|
- fine-tuned |
|
--- |
|
|
|
# AttackGroup-MPNET - Model Card for MPNet Cybersecurity Classifier |
|
|
|
This is a fine-tuned MPNet model specialized for classifying cybersecurity threat groups based on textual descriptions of their tactics and techniques. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
This model is a fine-tuned MPNet classifier specialized in categorizing cybersecurity threat groups based on textual descriptions of their tactics, techniques, and procedures (TTPs). |
|
|
|
- **Developed by:** Dženan Hamzić |
|
- **Model type:** Transformer-based classification model (MPNet) |
|
- **Language(s) (NLP):** English |
|
- **License:** Apache-2.0 |
|
- **Finetuned from model:** microsoft/mpnet-base (with intermediate MLM fine-tuning) |
|
|
|
### Model Sources |
|
|
|
- **Base Model:** [microsoft/mpnet-base](https://huggingface.co/microsoft/mpnet-base) |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
|
|
This model classifies textual cybersecurity descriptions into known cybersecurity threat groups. |
|
|
|
### Downstream Use |
|
|
|
Integration into Cyber Threat Intelligence platforms, SOC incident analysis tools, and automated threat detection systems. |
|
|
|
### Out-of-Scope Use |
|
|
|
- General language tasks unrelated to cybersecurity |
|
- Tasks outside the cybersecurity domain |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
This model specializes in cybersecurity contexts. Predictions for unrelated contexts may be inaccurate. |
|
|
|
### Recommendations |
|
|
|
Always verify predictions with cybersecurity analysts before using in critical decision-making scenarios. |
|
|
|
## How to Get Started with the Model (Classification) |
|
|
|
```python |
|
import torch |
|
import torch.nn as nn |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
import torch.optim as optim |
|
import numpy as np |
|
from huggingface_hub import hf_hub_download |
|
import json |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
|
|
label_to_groupid_file = hf_hub_download( |
|
repo_id="selfconstruct3d/AttackGroup-MPNET", |
|
filename="label_to_groupid.json" |
|
) |
|
|
|
with open(label_to_groupid_file, "r") as f: |
|
label_to_groupid = json.load(f) |
|
|
|
# Load explicitly your fine-tuned MPNet model |
|
classifier_model = AutoModelForSequenceClassification.from_pretrained("selfconstruct3d/AttackGroup-MPNET", num_labels=len(label_to_groupid)).to(device) |
|
|
|
# Load explicitly your tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("selfconstruct3d/AttackGroup-MPNET") |
|
|
|
def predict_group(sentence): |
|
classifier_model.eval() |
|
encoding = tokenizer( |
|
sentence, |
|
truncation=True, |
|
padding="max_length", |
|
max_length=128, |
|
return_tensors="pt" |
|
) |
|
input_ids = encoding["input_ids"].to(device) |
|
attention_mask = encoding["attention_mask"].to(device) |
|
|
|
with torch.no_grad(): |
|
outputs = classifier_model(input_ids=input_ids, attention_mask=attention_mask) |
|
logits = outputs.logits |
|
predicted_label = torch.argmax(logits, dim=1).cpu().item() |
|
|
|
predicted_groupid = label_to_groupid[str(predicted_label)] |
|
return predicted_groupid |
|
|
|
# Example usage explicitly: |
|
sentence = "APT38 has used phishing emails with malicious links to distribute malware." |
|
predicted_class = predict_group(sentence) |
|
print(f"Predicted GroupID: {predicted_class}") |
|
``` |
|
Predicted GroupID: G0001 |
|
|
|
## How to Get Started with the Model (Embeddings) |
|
|
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
from huggingface_hub import hf_hub_download |
|
import json |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
|
|
label_to_groupid_file = hf_hub_download( |
|
repo_id="selfconstruct3d/AttackGroup-MPNET", |
|
filename="label_to_groupid.json" |
|
) |
|
|
|
with open(label_to_groupid_file, "r") as f: |
|
label_to_groupid = json.load(f) |
|
|
|
|
|
# Load your fine-tuned classification model |
|
model_name = "selfconstruct3d/AttackGroup-MPNET" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
classifier_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(label_to_groupid)).to(device) |
|
|
|
def get_embedding(sentence): |
|
classifier_model.eval() |
|
|
|
encoding = tokenizer( |
|
sentence, |
|
truncation=True, |
|
padding="max_length", |
|
max_length=128, |
|
return_tensors="pt" |
|
) |
|
input_ids = encoding["input_ids"].to(device) |
|
attention_mask = encoding["attention_mask"].to(device) |
|
|
|
with torch.no_grad(): |
|
outputs = classifier_model.mpnet(input_ids=input_ids, attention_mask=attention_mask) |
|
cls_embedding = outputs.last_hidden_state[:, 0, :].cpu().numpy().flatten() |
|
|
|
return cls_embedding |
|
|
|
# Example explicitly: |
|
sentence = "APT38 has used phishing emails with malicious links to distribute malware." |
|
embedding = get_embedding(sentence) |
|
print("Embedding shape:", embedding.shape) |
|
print("Embedding values:", embedding) |
|
``` |
|
|
|
|
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
To be anounced... |
|
|
|
### Training Procedure |
|
|
|
- Fine-tuned from: MLM fine-tuned MPNet ("mpnet_mlm_cyber_finetuned-v2") |
|
- Epochs: 32 |
|
- Learning rate: 5e-6 |
|
- Batch size: 16 |
|
|
|
## Evaluation |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
- **Testing Data:** Stratified sample from original dataset. |
|
- **Metrics:** Accuracy, Weighted F1 Score |
|
|
|
### Results |
|
|
|
| Metric | Value | |
|
|------------------------|---------| |
|
| Cl. Accuracy (Test) | 0.9564 | |
|
| W. F1 Score (Test) | 0.9577 | |
|
|
|
|
|
## Evaluation Results |
|
|
|
| Model | Embedding Variability | Accuracy | |
|
|-----------------------------------------------|-----------------------|----------| |
|
| Original MPNet | 0.085554 | 0.9964 | |
|
| MLM Fine-tuned MPNet | 0.034983 | 0.6536 | |
|
| ** AttackGroup-MPNET ** | 0.193065 | 0.9508 | |
|
| SecBERT | 0.591303 | 0.9886 | |
|
| ATTACK-BERT | 0.096108 | 0.9678 | |
|
| SecureBERT | 0.007100 | 0.4931 | |
|
|
|
|
|
### Single Prediction Example |
|
|
|
```python |
|
|
|
import torch |
|
import torch.nn as nn |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
import torch.optim as optim |
|
import numpy as np |
|
from huggingface_hub import hf_hub_download |
|
import json |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
# Load explicitly your fine-tuned MPNet model |
|
classifier_model = AutoModelForSequenceClassification.from_pretrained("selfconstruct3d/AttackGroup-MPNET").to(device) |
|
|
|
# Load explicitly your tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("selfconstruct3d/AttackGroup-MPNET") |
|
|
|
|
|
label_to_groupid_file = hf_hub_download( |
|
repo_id="selfconstruct3d/AttackGroup-MPNET", |
|
filename="label_to_groupid.json" |
|
) |
|
|
|
with open(label_to_groupid_file, "r") as f: |
|
label_to_groupid = json.load(f) |
|
|
|
def predict_group(sentence): |
|
classifier_model.eval() |
|
encoding = tokenizer( |
|
sentence, |
|
truncation=True, |
|
padding="max_length", |
|
max_length=128, |
|
return_tensors="pt" |
|
) |
|
input_ids = encoding["input_ids"].to(device) |
|
attention_mask = encoding["attention_mask"].to(device) |
|
|
|
with torch.no_grad(): |
|
outputs = classifier_model(input_ids=input_ids, attention_mask=attention_mask) |
|
logits = outputs.logits |
|
predicted_label = torch.argmax(logits, dim=1).cpu().item() |
|
|
|
predicted_groupid = label_to_groupid[str(predicted_label)] |
|
return predicted_groupid |
|
|
|
# Example usage explicitly: |
|
sentence = "APT38 has used phishing emails with malicious links to distribute malware." |
|
predicted_class = predict_group(sentence) |
|
print(f"Predicted GroupID: {predicted_class}") |
|
``` |
|
|
|
## Environmental Impact |
|
|
|
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute). |
|
|
|
- **Hardware Type:** [To be filled by user] |
|
- **Hours used:** [To be filled by user] |
|
- **Cloud Provider:** [To be filled by user] |
|
- **Compute Region:** [To be filled by user] |
|
- **Carbon Emitted:** [To be filled by user] |
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture |
|
|
|
- MPNet architecture with classification head (768 -> 512 -> num_labels) |
|
- Last 10 transformer layers fine-tuned explicitly |
|
|
|
## Environmental Impact |
|
|
|
Carbon emissions should be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute). |
|
|
|
## Model Card Authors |
|
|
|
- Dženan Hamzić |
|
|
|
## Model Card Contact |
|
|
|
- [More Information Needed] |