🧱 Dockerfile Quality Classifier – Multilabel Model

This model predicts which rules are violated in a given Dockerfile. It is a multilabel classifier trained to detect violations of the top 30 most frequent rules from Hadolint.

🧠 Model Overview

Architecture: Fine-tuned microsoft/codebert-base
Task: Multi-label classification (30 labels)
Input: Full Dockerfile content as plain text
Output: For each rule → probability of violation
Max input length: 512 tokens
Threshold: 0.5 (configurable)

📚 Training Details

Total training files: ~15,000 Dockerfiles with at least one rule violation
Per-rule cap: Max 2,000 files per rule to avoid imbalance
Perfect (clean) files: ~1,500 examples with no Hadolint violations
Label source: Hadolint output (top 30 rules only)
One-hot labels: [1, 0, 0, 1, ...] for 30 rules

🧪 Evaluation Snapshot

Evaluation on 6,873 labeled examples:

Metric	Value
Micro avg F1	0.97
Macro avg F1	0.95
Weighted avg F1	0.97
Samples avg F1	0.97

More metrics available in classification_report.csv

🚀 Quick Start

🧪 Step 1 — Create test script

Save as test_multilabel_predict.py:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from pathlib import Path
import numpy as np
import json
import sys

MODEL_DIR = "LeeSek/multilabel-dockerfile-model"
TOP_RULES_PATH = "top_rules.json"
THRESHOLD = 0.5

def main():
    if len(sys.argv) < 2:
        print("Usage: python test_multilabel_predict.py Dockerfile [--debug]")
        return

    debug = "--debug" in sys.argv
    file_path = Path(sys.argv[1])
    if not file_path.exists():
        print(f"File {file_path} not found.")
        return

    labels = json.load(open(TOP_RULES_PATH))
    tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
    model = AutoModelForSequenceClassification.from_pretrained(MODEL_DIR)
    model.eval()

    text = file_path.read_text(encoding="utf-8")
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=512)

    with torch.no_grad():
        logits = model(**inputs).logits
        probs = torch.sigmoid(logits).squeeze().cpu().numpy()

    triggered = [(labels[i], probs[i]) for i in range(len(labels)) if probs[i] > THRESHOLD]
    top5 = np.argsort(probs)[-5:][::-1]

    print(f"\n🧪 Prediction for file: {file_path.name}")
    print(f"📄 Lines in file: {len(text.splitlines())}")

    if triggered:
        print(f"\n🚨 Detected violations (p > {THRESHOLD}):")
        for rule, p in triggered:
            print(f" - {rule}: {p:.3f}")
    else:
        print("✅ No violations detected.")

    if debug:
        print("\n🛠 DEBUG INFO:")
        print(f"📝 Text snippet:\n{text[:300]}")
        print(f"🔢 Token count: {len(inputs['input_ids'][0])}")
        print(f"📈 Logits: {logits.squeeze().tolist()}")
        print("\n🔥 Top 5 predictions:")
        for idx in top5:
            print(f" - {labels[idx]}: {probs[idx]:.3f}")

if __name__ == "__main__":
    main()

Make sure top_rules.json is available next to the script.

📄 Step 2 — Create good and bad Dockerfile

Good:

FROM node:18
WORKDIR /app
COPY . .
RUN npm install
CMD ["node", "index.js"]

Bad:

FROM ubuntu:latest
RUN apt-get install python3
ADD . /app
WORKDIR /app
RUN pip install flask
CMD python3 app.py

▶️ Step 3 — Run the script

python test_multilabel_predict.py Dockerfile --debug

🗂 Extras

The full training and evaluation pipeline — including data preparation, training, validation, prediction, and threshold calibration — is available in the scripts/ folder.

💬 Note: Scripts are written with Polish comments and variable names for clarity during local development. Logic is fully portable.

📘 License

MIT

🙌 Credits

Based on Hadolint
Powered by Hugging Face Transformers

LeeSek
/

multilabel-dockerfile-model