File size: 2,172 Bytes
8188807
 
 
 
 
 
 
 
 
 
 
a967b13
 
 
 
 
8188807
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a967b13
8188807
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
license: apache-2.0
language:
- en
- de
---

# 🛡️ MLP Cybersecurity Classifier

This repository hosts a lightweight `scikit-learn`-based MLP classifier trained to distinguish cybersecurity-related content from other text, using sentence-transformer embeddings. It supports English and German input texts.

## 📊 Training Data

The model was trained on a multilingual dataset of cybersecurity and non-cybersecurity news articles. The dataset is publicly available on Zenodo:  
🔗 [https://zenodo.org/records/16417939](https://zenodo.org/records/16417939)

## 📦 Model Details

- **Architecture**: `MLPClassifier` with hidden layers `(128, 64)`
- **Embedding model**: [`intfloat/multilingual-e5-large`](https://huggingface.co/intfloat/multilingual-e5-large)
- **Input**: Cleaned article (removed stopwords) or report text
- **Output**: Binary label (e.g., `Cybersecurity`, `Not Cybersecurity`)
- **Languages**: English, German

## 🔧 Usage

```python
from sentence_transformers import SentenceTransformer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import joblib
from huggingface_hub import hf_hub_download

# Load your cleaned dataset
df = pd.read_csv("your_dataset.csv")  # Requires 'clean_text' and 'label' columns

# Load the sentence transformer
embedder = SentenceTransformer("intfloat/multilingual-e5-large")

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    df["clean_text"],
    df["label"],
    test_size=0.05,
    stratify=df["label"],
    random_state=42
)

# Encode labels
label_encoder = LabelEncoder()
y_train_enc = label_encoder.fit_transform(y_train)
y_test_enc = label_encoder.transform(y_test)

# Generate sentence embeddings
X_train_emb = embedder.encode(X_train.tolist(), convert_to_numpy=True, show_progress_bar=True)
X_test_emb = embedder.encode(X_test.tolist(), convert_to_numpy=True, show_progress_bar=True)

# Load the trained classifier
model_path = hf_hub_download(repo_id="selfconstruct3d/cybersec_classifier", filename="cybersec_classifier.pkl")
model = joblib.load(model_path)

# Predict
y_pred = model.predict(X_test_emb)