🧱 Dockerfile Quality Classifier – Multilabel Model

This model predicts which rules are violated in a given Dockerfile. It is a multilabel classifier trained to detect violations of the top 30 most frequent rules from Hadolint.


🧠 Model Overview

  • Architecture: Fine-tuned microsoft/codebert-base
  • Task: Multi-label classification (30 labels)
  • Input: Full Dockerfile content as plain text
  • Output: For each rule β†’ probability of violation
  • Max input length: 512 tokens
  • Threshold: 0.5 (configurable)

πŸ“š Training Details

  • Total training files: ~15,000 Dockerfiles with at least one rule violation
  • Per-rule cap: Max 2,000 files per rule to avoid imbalance
  • Perfect (clean) files: ~1,500 examples with no Hadolint violations
  • Label source: Hadolint output (top 30 rules only)
  • One-hot labels: [1, 0, 0, 1, ...] for 30 rules

πŸ§ͺ Evaluation Snapshot

Evaluation on 6,873 labeled examples:

Metric Value
Micro avg F1 0.97
Macro avg F1 0.95
Weighted avg F1 0.97
Samples avg F1 0.97

More metrics available in classification_report.csv


πŸš€ Quick Start

πŸ§ͺ Step 1 β€” Create test script

Save as test_multilabel_predict.py:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from pathlib import Path
import numpy as np
import json
import sys

MODEL_DIR = "LeeSek/multilabel-dockerfile-model"
TOP_RULES_PATH = "top_rules.json"
THRESHOLD = 0.5

def main():
    if len(sys.argv) < 2:
        print("Usage: python test_multilabel_predict.py Dockerfile [--debug]")
        return

    debug = "--debug" in sys.argv
    file_path = Path(sys.argv[1])
    if not file_path.exists():
        print(f"File {file_path} not found.")
        return

    labels = json.load(open(TOP_RULES_PATH))
    tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
    model = AutoModelForSequenceClassification.from_pretrained(MODEL_DIR)
    model.eval()

    text = file_path.read_text(encoding="utf-8")
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=512)

    with torch.no_grad():
        logits = model(**inputs).logits
        probs = torch.sigmoid(logits).squeeze().cpu().numpy()

    triggered = [(labels[i], probs[i]) for i in range(len(labels)) if probs[i] > THRESHOLD]
    top5 = np.argsort(probs)[-5:][::-1]

    print(f"\nπŸ§ͺ Prediction for file: {file_path.name}")
    print(f"πŸ“„ Lines in file: {len(text.splitlines())}")

    if triggered:
        print(f"\n🚨 Detected violations (p > {THRESHOLD}):")
        for rule, p in triggered:
            print(f" - {rule}: {p:.3f}")
    else:
        print("βœ… No violations detected.")

    if debug:
        print("\nπŸ›  DEBUG INFO:")
        print(f"πŸ“ Text snippet:\n{text[:300]}")
        print(f"πŸ”’ Token count: {len(inputs['input_ids'][0])}")
        print(f"πŸ“ˆ Logits: {logits.squeeze().tolist()}")
        print("\nπŸ”₯ Top 5 predictions:")
        for idx in top5:
            print(f" - {labels[idx]}: {probs[idx]:.3f}")

if __name__ == "__main__":
    main()

Make sure top_rules.json is available next to the script.


πŸ“„ Step 2 β€” Create good and bad Dockerfile

Good:

FROM node:18
WORKDIR /app
COPY . .
RUN npm install
CMD ["node", "index.js"]

Bad:

FROM ubuntu:latest
RUN apt-get install python3
ADD . /app
WORKDIR /app
RUN pip install flask
CMD python3 app.py

▢️ Step 3 β€” Run the script

python test_multilabel_predict.py Dockerfile --debug

πŸ—‚ Extras

The full training and evaluation pipeline β€” including data preparation, training, validation, prediction, and threshold calibration β€” is available in the scripts/ folder.

πŸ’¬ Note: Scripts are written with Polish comments and variable names for clarity during local development. Logic is fully portable.


πŸ“˜ License

MIT


πŸ™Œ Credits

Downloads last month
1
Safetensors
Model size
125M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support