selfconstruct3d commited on
Commit
8188807
·
verified ·
1 Parent(s): c0565d2

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +59 -0
README.md ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - de
6
+ ---
7
+
8
+ # 🛡️ MLP Cybersecurity Classifier
9
+
10
+ This repository hosts a lightweight `scikit-learn`-based MLP classifier trained to distinguish cybersecurity-related content from other text, using sentence-transformer embeddings. It supports English and German input texts.
11
+
12
+ ## 📦 Model Details
13
+
14
+ - **Architecture**: `MLPClassifier` with hidden layers `(128, 64)`
15
+ - **Embedding model**: [`intfloat/multilingual-e5-large`](https://huggingface.co/intfloat/multilingual-e5-large)
16
+ - **Input**: Cleaned article (removed stopwords) or report text
17
+ - **Output**: Binary label (e.g., `Cybersecurity`, `Not Cybersecurity`)
18
+ - **Languages**: English, German
19
+
20
+ ## 🔧 Usage
21
+
22
+ ```python
23
+ from sentence_transformers import SentenceTransformer
24
+ from sklearn.model_selection import train_test_split
25
+ from sklearn.preprocessing import LabelEncoder
26
+ import pandas as pd
27
+ import joblib
28
+ from huggingface_hub import hf_hub_download
29
+
30
+ # Load your cleaned dataset
31
+ df = pd.read_csv("your_dataset.csv") # Requires 'clean_text' and 'label' columns
32
+
33
+ # Load the sentence transformer
34
+ embedder = SentenceTransformer("intfloat/multilingual-e5-large")
35
+
36
+ # Train-test split
37
+ X_train, X_test, y_train, y_test = train_test_split(
38
+ df["clean_text"],
39
+ df["label"],
40
+ test_size=0.05,
41
+ stratify=df["label"],
42
+ random_state=42
43
+ )
44
+
45
+ # Encode labels
46
+ label_encoder = LabelEncoder()
47
+ y_train_enc = label_encoder.fit_transform(y_train)
48
+ y_test_enc = label_encoder.transform(y_test)
49
+
50
+ # Generate sentence embeddings
51
+ X_train_emb = embedder.encode(X_train.tolist(), convert_to_numpy=True, show_progress_bar=True)
52
+ X_test_emb = embedder.encode(X_test.tolist(), convert_to_numpy=True, show_progress_bar=True)
53
+
54
+ # Load the trained classifier
55
+ model_path = hf_hub_download(repo_id="your-selfconstruct3d/cybersec-classifier", filename="cybersec_classifier.pkl")
56
+ model = joblib.load(model_path)
57
+
58
+ # Predict
59
+ y_pred = model.predict(X_test_emb)