Update README.md

c77c530 verified about 1 month ago

4.38 kB

	---
	language: "code"
	license: "mit"
	tags:
	- dockerfile
	- hadolint
	- multilabel-classification
	- codebert
	model-index:
	- name: Multilabel Dockerfile Classifier
	results: []
	---


	# 🧱 Dockerfile Quality Classifier – Multilabel Model

	This model predicts which rules are violated in a given Dockerfile. It is a multilabel classifier trained to detect violations of the top 30 most frequent rules from Hadolint.

	---

	## 🧠 Model Overview

	- Architecture: Fine-tuned `microsoft/codebert-base`
	- Task: Multi-label classification (30 labels)
	- Input: Full Dockerfile content as plain text
	- Output: For each rule → probability of violation
	- Max input length: 512 tokens
	- Threshold: 0.5 (configurable)

	---

	## 📚 Training Details

	- Total training files: ~15,000 Dockerfiles with at least one rule violation
	- Per-rule cap: Max 2,000 files per rule to avoid imbalance
	- Perfect (clean) files: ~1,500 examples with no Hadolint violations
	- Label source: Hadolint output (top 30 rules only)
	- One-hot labels: `[1, 0, 0, 1, ...]` for 30 rules

	---

	## 🧪 Evaluation Snapshot

	Evaluation on 6,873 labeled examples:

	\| Metric \| Value \|
	\|----------------\|--------\|
	\| Micro avg F1 \| 0.97 \|
	\| Macro avg F1 \| 0.95 \|
	\| Weighted avg F1\| 0.97 \|
	\| Samples avg F1 \| 0.97 \|

	More metrics available in `classification_report.csv`

	---

	## 🚀 Quick Start

	### 🧪 Step 1 — Create test script

	Save as `test_multilabel_predict.py`:

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch
	from pathlib import Path
	import numpy as np
	import json
	import sys

	MODEL_DIR = "LeeSek/multilabel-dockerfile-model"
	TOP_RULES_PATH = "top_rules.json"
	THRESHOLD = 0.5

	def main():
	if len(sys.argv) < 2:
	print("Usage: python test_multilabel_predict.py Dockerfile [--debug]")
	return

	debug = "--debug" in sys.argv
	file_path = Path(sys.argv[1])
	if not file_path.exists():
	print(f"File {file_path} not found.")
	return

	labels = json.load(open(TOP_RULES_PATH))
	tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
	model = AutoModelForSequenceClassification.from_pretrained(MODEL_DIR)
	model.eval()

	text = file_path.read_text(encoding="utf-8")
	inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=512)

	with torch.no_grad():
	logits = model(**inputs).logits
	probs = torch.sigmoid(logits).squeeze().cpu().numpy()

	triggered = [(labels[i], probs[i]) for i in range(len(labels)) if probs[i] > THRESHOLD]
	top5 = np.argsort(probs)[-5:][::-1]

	print(f"\n🧪 Prediction for file: {file_path.name}")
	print(f"📄 Lines in file: {len(text.splitlines())}")

	if triggered:
	print(f"\n🚨 Detected violations (p > {THRESHOLD}):")
	for rule, p in triggered:
	print(f" - {rule}: {p:.3f}")
	else:
	print("✅ No violations detected.")

	if debug:
	print("\n🛠 DEBUG INFO:")
	print(f"📝 Text snippet:\n{text[:300]}")
	print(f"🔢 Token count: {len(inputs['input_ids'][0])}")
	print(f"📈 Logits: {logits.squeeze().tolist()}")
	print("\n🔥 Top 5 predictions:")
	for idx in top5:
	print(f" - {labels[idx]}: {probs[idx]:.3f}")

	if __name__ == "__main__":
	main()
	```

	Make sure `top_rules.json` is available next to the script.

	---

	### 📄 Step 2 — Create good and bad Dockerfile

	Good:

	```docker
	FROM node:18
	WORKDIR /app
	COPY . .
	RUN npm install
	CMD ["node", "index.js"]
	```

	Bad:

	```docker
	FROM ubuntu:latest
	RUN apt-get install python3
	ADD . /app
	WORKDIR /app
	RUN pip install flask
	CMD python3 app.py
	```

	### ▶️ Step 3 — Run the script

	```bash
	python test_multilabel_predict.py Dockerfile --debug
	```

	---

	## 🗂 Extras

	The full training and evaluation pipeline — including data preparation, training, validation, prediction, and threshold calibration — is available in the `scripts/` folder.

	> 💬 Note: Scripts are written with Polish comments and variable names for clarity during local development. Logic is fully portable.

	---

	## 📘 License

	MIT

	---

	## 🙌 Credits

	- Based on [Hadolint](https://github.com/hadolint/hadolint)
	- Powered by [Hugging Face Transformers](https://huggingface.co)