LeeSek's picture
Update README.md
c77c530 verified
---
language: "code"
license: "mit"
tags:
- dockerfile
- hadolint
- multilabel-classification
- codebert
model-index:
- name: Multilabel Dockerfile Classifier
results: []
---
# 🧱 Dockerfile Quality Classifier – Multilabel Model
This model predicts **which rules are violated** in a given Dockerfile. It is a multilabel classifier trained to detect violations of the top 30 most frequent rules from Hadolint.
---
## 🧠 Model Overview
- **Architecture:** Fine-tuned `microsoft/codebert-base`
- **Task:** Multi-label classification (30 labels)
- **Input:** Full Dockerfile content as plain text
- **Output:** For each rule β†’ probability of violation
- **Max input length:** 512 tokens
- **Threshold:** 0.5 (configurable)
---
## πŸ“š Training Details
- **Total training files:** ~15,000 Dockerfiles with at least one rule violation
- **Per-rule cap:** Max 2,000 files per rule to avoid imbalance
- **Perfect (clean) files:** ~1,500 examples with no Hadolint violations
- **Label source:** Hadolint output (top 30 rules only)
- **One-hot labels:** `[1, 0, 0, 1, ...]` for 30 rules
---
## πŸ§ͺ Evaluation Snapshot
Evaluation on 6,873 labeled examples:
| Metric | Value |
|----------------|--------|
| Micro avg F1 | 0.97 |
| Macro avg F1 | 0.95 |
| Weighted avg F1| 0.97 |
| Samples avg F1 | 0.97 |
More metrics available in `classification_report.csv`
---
## πŸš€ Quick Start
### πŸ§ͺ Step 1 β€” Create test script
Save as `test_multilabel_predict.py`:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from pathlib import Path
import numpy as np
import json
import sys
MODEL_DIR = "LeeSek/multilabel-dockerfile-model"
TOP_RULES_PATH = "top_rules.json"
THRESHOLD = 0.5
def main():
if len(sys.argv) < 2:
print("Usage: python test_multilabel_predict.py Dockerfile [--debug]")
return
debug = "--debug" in sys.argv
file_path = Path(sys.argv[1])
if not file_path.exists():
print(f"File {file_path} not found.")
return
labels = json.load(open(TOP_RULES_PATH))
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_DIR)
model.eval()
text = file_path.read_text(encoding="utf-8")
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=512)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.sigmoid(logits).squeeze().cpu().numpy()
triggered = [(labels[i], probs[i]) for i in range(len(labels)) if probs[i] > THRESHOLD]
top5 = np.argsort(probs)[-5:][::-1]
print(f"\nπŸ§ͺ Prediction for file: {file_path.name}")
print(f"πŸ“„ Lines in file: {len(text.splitlines())}")
if triggered:
print(f"\n🚨 Detected violations (p > {THRESHOLD}):")
for rule, p in triggered:
print(f" - {rule}: {p:.3f}")
else:
print("βœ… No violations detected.")
if debug:
print("\nπŸ›  DEBUG INFO:")
print(f"πŸ“ Text snippet:\n{text[:300]}")
print(f"πŸ”’ Token count: {len(inputs['input_ids'][0])}")
print(f"πŸ“ˆ Logits: {logits.squeeze().tolist()}")
print("\nπŸ”₯ Top 5 predictions:")
for idx in top5:
print(f" - {labels[idx]}: {probs[idx]:.3f}")
if __name__ == "__main__":
main()
```
Make sure `top_rules.json` is available next to the script.
---
### πŸ“„ Step 2 β€” Create good and bad Dockerfile
Good:
```docker
FROM node:18
WORKDIR /app
COPY . .
RUN npm install
CMD ["node", "index.js"]
```
Bad:
```docker
FROM ubuntu:latest
RUN apt-get install python3
ADD . /app
WORKDIR /app
RUN pip install flask
CMD python3 app.py
```
### ▢️ Step 3 β€” Run the script
```bash
python test_multilabel_predict.py Dockerfile --debug
```
---
## πŸ—‚ Extras
The full training and evaluation pipeline β€” including data preparation, training, validation, prediction, and threshold calibration β€” is available in the **`scripts/`** folder.
> πŸ’¬ **Note:** Scripts are written with **Polish comments and variable names** for clarity during local development. Logic is fully portable.
---
## πŸ“˜ License
MIT
---
## πŸ™Œ Credits
- Based on [Hadolint](https://github.com/hadolint/hadolint)
- Powered by [Hugging Face Transformers](https://huggingface.co)