metadata
license: apache-2.0
language:
- en
- de
🛡️ MLP Cybersecurity Classifier
This repository hosts a lightweight scikit-learn
-based MLP classifier trained to distinguish cybersecurity-related content from other text, using sentence-transformer embeddings. It supports English and German input texts.
📊 Training Data
The model was trained on a multilingual dataset of cybersecurity and non-cybersecurity news articles. The dataset is publicly available on Zenodo:
🔗 https://zenodo.org/records/16417939
📦 Model Details
- Architecture:
MLPClassifier
with hidden layers(128, 64)
- Embedding model:
intfloat/multilingual-e5-large
- Input: Cleaned article (removed stopwords) or report text
- Output: Binary label (e.g.,
Cybersecurity
,Not Cybersecurity
) - Languages: English, German
🔧 Usage
from sentence_transformers import SentenceTransformer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import joblib
from huggingface_hub import hf_hub_download
# Load your cleaned dataset
df = pd.read_csv("your_dataset.csv") # Requires 'clean_text' and 'label' columns
# Load the sentence transformer
embedder = SentenceTransformer("intfloat/multilingual-e5-large")
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
df["clean_text"],
df["label"],
test_size=0.05,
stratify=df["label"],
random_state=42
)
# Encode labels
label_encoder = LabelEncoder()
y_train_enc = label_encoder.fit_transform(y_train)
y_test_enc = label_encoder.transform(y_test)
# Generate sentence embeddings
X_train_emb = embedder.encode(X_train.tolist(), convert_to_numpy=True, show_progress_bar=True)
X_test_emb = embedder.encode(X_test.tolist(), convert_to_numpy=True, show_progress_bar=True)
# Load the trained classifier
model_path = hf_hub_download(repo_id="selfconstruct3d/cybersec_classifier", filename="cybersec_classifier.pkl")
model = joblib.load(model_path)
# Predict
y_pred = model.predict(X_test_emb)