LeeSek
/

multilabel-dockerfile-model

+# 🧱 Dockerfile Quality Classifier – Multilabel Model
+This model predicts **which rules are violated** in a given Dockerfile. It is a multilabel classifier trained to detect violations of the top 30 most frequent rules from Hadolint.
+---
+## 🧠 Model Overview
+- **Architecture:** Fine-tuned `microsoft/codebert-base`
+- **Task:** Multi-label classification (30 labels)
+- **Input:** Full Dockerfile content as plain text
+- **Output:** For each rule → probability of violation
+- **Max input length:** 512 tokens
+- **Threshold:** 0.5 (configurable)
+---
+## 📚 Training Details
+- **Total training files:** ~15,000 Dockerfiles with at least one rule violation
+- **Per-rule cap:** Max 2,000 files per rule to avoid imbalance
+- **Perfect (clean) files:** ~1,500 examples with no Hadolint violations
+- **Label source:** Hadolint output (top 30 rules only)
+- **One-hot labels:** `[1, 0, 0, 1, ...]` for 30 rules
+---
+## 🧪 Evaluation Snapshot
+Evaluation on 6,873 labeled examples:
+| Metric         | Value |
+|----------------|--------|
+| Micro avg F1   | 0.97   |
+| Macro avg F1   | 0.95   |
+| Weighted avg F1| 0.97   |
+| Samples avg F1 | 0.97   |
+More metrics available in `classification_report.csv`
+---
+## 🚀 Quick Start
+### 🧪 Step 1 — Create test script
+Save as `test_multilabel_predict.py`:
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+from pathlib import Path
+import numpy as np
+import json
+import sys
+MODEL_DIR = "LeeSek/multilabel-dockerfile-model"
+TOP_RULES_PATH = "top_rules.json"
+THRESHOLD = 0.5
+def main():
+    if len(sys.argv) < 2:
+        print("Usage: python test_multilabel_predict.py Dockerfile [--debug]")
+        return
+    debug = "--debug" in sys.argv
+    file_path = Path(sys.argv[1])
+    if not file_path.exists():
+        print(f"File {file_path} not found.")
+        return
+    labels = json.load(open(TOP_RULES_PATH))
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
+    model = AutoModelForSequenceClassification.from_pretrained(MODEL_DIR)
+    model.eval()
+    text = file_path.read_text(encoding="utf-8")
+    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=512)
+    with torch.no_grad():
+        logits = model(**inputs).logits
+        probs = torch.sigmoid(logits).squeeze().cpu().numpy()
+    triggered = [(labels[i], probs[i]) for i in range(len(labels)) if probs[i] > THRESHOLD]
+    top5 = np.argsort(probs)[-5:][::-1]
+    print(f"\n🧪 Prediction for file: {file_path.name}")
+    print(f"📄 Lines in file: {len(text.splitlines())}")
+    if triggered:
+        print(f"\n🚨 Detected violations (p > {THRESHOLD}):")
+        for rule, p in triggered:
+            print(f" - {rule}: {p:.3f}")
+    else:
+        print("✅ No violations detected.")
+    if debug:
+        print("\n🛠 DEBUG INFO:")
+        print(f"📝 Text snippet:\n{text[:300]}")
+        print(f"🔢 Token count: {len(inputs['input_ids'][0])}")
+        print(f"📈 Logits: {logits.squeeze().tolist()}")
+        print("\n🔥 Top 5 predictions:")
+        for idx in top5:
+            print(f" - {labels[idx]}: {probs[idx]:.3f}")
+if __name__ == "__main__":
+    main()
+```
+Make sure `top_rules.json` is available next to the script.
+---
+### 🧪 Step 2 — Create good and bad Dockerfile
+Good:
+```docker
+FROM node:18
+WORKDIR /app
+COPY . .
+RUN npm install
+CMD ["node", "index.js"]
+```
+Bad:
+```docker
+FROM ubuntu:latest
+RUN apt-get install python3
+ADD . /app
+WORKDIR /app
+RUN pip install flask
+CMD python3 app.py
+```
+### ▶️ Step 3 — Run the script
+```bash
+python test_multilabel_predict.py Dockerfile --debug
+```
+---
+## 🗂 Extras
+The full training and evaluation pipeline — including data preparation, training, validation, prediction, and threshold calibration — is available in the **`scripts/`** folder.
+> 💬 **Note:** Scripts are written with **Polish comments and variable names** for clarity during local development. Logic is fully portable.
+---
+## 📘 License
+MIT
+---
+## 🙌 Credits
+- Based on [Hadolint](https://github.com/hadolint/hadolint)
+- Powered by [Hugging Face Transformers](https://huggingface.co)