|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
- de |
|
--- |
|
|
|
# 🛡️ MLP Cybersecurity Classifier |
|
|
|
This repository hosts a lightweight `scikit-learn`-based MLP classifier trained to distinguish cybersecurity-related content from other text, using sentence-transformer embeddings. It supports English and German input texts. |
|
|
|
## 📊 Training Data |
|
|
|
The model was trained on a multilingual dataset of cybersecurity and non-cybersecurity news articles. The dataset is publicly available on Zenodo: |
|
🔗 [https://zenodo.org/records/16417939](https://zenodo.org/records/16417939) |
|
|
|
## 📦 Model Details |
|
|
|
- **Architecture**: `MLPClassifier` with hidden layers `(128, 64)` |
|
- **Embedding model**: [`intfloat/multilingual-e5-large`](https://huggingface.co/intfloat/multilingual-e5-large) |
|
- **Input**: Cleaned article (removed stopwords) or report text |
|
- **Output**: Binary label (e.g., `Cybersecurity`, `Not Cybersecurity`) |
|
- **Languages**: English, German |
|
|
|
## 🔧 Usage |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
from sklearn.model_selection import train_test_split |
|
from sklearn.preprocessing import LabelEncoder |
|
import pandas as pd |
|
import joblib |
|
from huggingface_hub import hf_hub_download |
|
|
|
# Load your cleaned dataset |
|
df = pd.read_csv("your_dataset.csv") # Requires 'clean_text' and 'label' columns |
|
|
|
# Load the sentence transformer |
|
embedder = SentenceTransformer("intfloat/multilingual-e5-large") |
|
|
|
# Train-test split |
|
X_train, X_test, y_train, y_test = train_test_split( |
|
df["clean_text"], |
|
df["label"], |
|
test_size=0.05, |
|
stratify=df["label"], |
|
random_state=42 |
|
) |
|
|
|
# Encode labels |
|
label_encoder = LabelEncoder() |
|
y_train_enc = label_encoder.fit_transform(y_train) |
|
y_test_enc = label_encoder.transform(y_test) |
|
|
|
# Generate sentence embeddings |
|
X_train_emb = embedder.encode(X_train.tolist(), convert_to_numpy=True, show_progress_bar=True) |
|
X_test_emb = embedder.encode(X_test.tolist(), convert_to_numpy=True, show_progress_bar=True) |
|
|
|
# Load the trained classifier |
|
model_path = hf_hub_download(repo_id="selfconstruct3d/cybersec_classifier", filename="cybersec_classifier.pkl") |
|
model = joblib.load(model_path) |
|
|
|
# Predict |
|
y_pred = model.predict(X_test_emb) |
|
|