--- license: apache-2.0 language: - en - de --- # 🛡️ MLP Cybersecurity Classifier This repository hosts a lightweight `scikit-learn`-based MLP classifier trained to distinguish cybersecurity-related content from other text, using sentence-transformer embeddings. It supports English and German input texts. ## 📊 Training Data The model was trained on a multilingual dataset of cybersecurity and non-cybersecurity news articles. The dataset is publicly available on Zenodo: 🔗 [https://zenodo.org/records/16417939](https://zenodo.org/records/16417939) ## 📦 Model Details - **Architecture**: `MLPClassifier` with hidden layers `(128, 64)` - **Embedding model**: [`intfloat/multilingual-e5-large`](https://huggingface.co/intfloat/multilingual-e5-large) - **Input**: Cleaned article (removed stopwords) or report text - **Output**: Binary label (e.g., `Cybersecurity`, `Not Cybersecurity`) - **Languages**: English, German ## 🔧 Usage ```python from sentence_transformers import SentenceTransformer from huggingface_hub import hf_hub_download import joblib # 1. Load the embedding model embedder = SentenceTransformer("intfloat/multilingual-e5-large") # 2. Load the pretrained MLP classifier from Hugging Face Hub model_path = hf_hub_download(repo_id="selfconstruct3d/cybersec_classifier", filename="cybersec_classifier.pkl") model = joblib.load(model_path) # 3. Example input texts (can be in English or German) texts = [ "A new ransomware attack has affected critical infrastructure in Germany.", "The local sports club hosted its annual summer festival this weekend." ] # 4. Generate embeddings embeddings = embedder.encode(texts, convert_to_numpy=True, show_progress_bar=False) # 5. Predict cybersecurity relevance predictions = model.predict(embeddings) # 6. Output results for text, label in zip(texts, predictions): print(f"Text: {text}\nPrediction: {label}\n")