selfconstruct3d
/

cybersec_classifier

Model card Files Files and versions Community

cybersec_classifier / README.md

selfconstruct3d's picture

selfconstruct3d

Update README.md

a967b13 verified about 1 month ago

|

2.17 kB

	---
	license: apache-2.0
	language:
	- en
	- de
	---

	# 🛡️ MLP Cybersecurity Classifier

	This repository hosts a lightweight `scikit-learn`-based MLP classifier trained to distinguish cybersecurity-related content from other text, using sentence-transformer embeddings. It supports English and German input texts.

	## 📊 Training Data

	The model was trained on a multilingual dataset of cybersecurity and non-cybersecurity news articles. The dataset is publicly available on Zenodo:
	🔗 [https://zenodo.org/records/16417939](https://zenodo.org/records/16417939)

	## 📦 Model Details

	- Architecture: `MLPClassifier` with hidden layers `(128, 64)`
	- Embedding model: [`intfloat/multilingual-e5-large`](https://huggingface.co/intfloat/multilingual-e5-large)
	- Input: Cleaned article (removed stopwords) or report text
	- Output: Binary label (e.g., `Cybersecurity`, `Not Cybersecurity`)
	- Languages: English, German

	## 🔧 Usage

	```python
	from sentence_transformers import SentenceTransformer
	from sklearn.model_selection import train_test_split
	from sklearn.preprocessing import LabelEncoder
	import pandas as pd
	import joblib
	from huggingface_hub import hf_hub_download

	# Load your cleaned dataset
	df = pd.read_csv("your_dataset.csv") # Requires 'clean_text' and 'label' columns

	# Load the sentence transformer
	embedder = SentenceTransformer("intfloat/multilingual-e5-large")

	# Train-test split
	X_train, X_test, y_train, y_test = train_test_split(
	df["clean_text"],
	df["label"],
	test_size=0.05,
	stratify=df["label"],
	random_state=42
	)

	# Encode labels
	label_encoder = LabelEncoder()
	y_train_enc = label_encoder.fit_transform(y_train)
	y_test_enc = label_encoder.transform(y_test)

	# Generate sentence embeddings
	X_train_emb = embedder.encode(X_train.tolist(), convert_to_numpy=True, show_progress_bar=True)
	X_test_emb = embedder.encode(X_test.tolist(), convert_to_numpy=True, show_progress_bar=True)

	# Load the trained classifier
	model_path = hf_hub_download(repo_id="selfconstruct3d/cybersec_classifier", filename="cybersec_classifier.pkl")
	model = joblib.load(model_path)

	# Predict
	y_pred = model.predict(X_test_emb)