File size: 4,383 Bytes

---
language: "code"
license: "mit"
tags:
  - dockerfile
  - hadolint
  - multilabel-classification
  - codebert
model-index:
  - name: Multilabel Dockerfile Classifier
    results: []
---


# 🧱 Dockerfile Quality Classifier – Multilabel Model

This model predicts **which rules are violated** in a given Dockerfile. It is a multilabel classifier trained to detect violations of the top 30 most frequent rules from Hadolint.

---

## 🧠 Model Overview

- **Architecture:** Fine-tuned `microsoft/codebert-base`
- **Task:** Multi-label classification (30 labels)
- **Input:** Full Dockerfile content as plain text
- **Output:** For each rule → probability of violation
- **Max input length:** 512 tokens
- **Threshold:** 0.5 (configurable)

---

## 📚 Training Details

- **Total training files:** ~15,000 Dockerfiles with at least one rule violation
- **Per-rule cap:** Max 2,000 files per rule to avoid imbalance
- **Perfect (clean) files:** ~1,500 examples with no Hadolint violations
- **Label source:** Hadolint output (top 30 rules only)
- **One-hot labels:** `[1, 0, 0, 1, ...]` for 30 rules

---

## 🧪 Evaluation Snapshot

Evaluation on 6,873 labeled examples:

| Metric         | Value |
|----------------|--------|
| Micro avg F1   | 0.97   |
| Macro avg F1   | 0.95   |
| Weighted avg F1| 0.97   |
| Samples avg F1 | 0.97   |

More metrics available in `classification_report.csv`

---

## 🚀 Quick Start

### 🧪 Step 1 — Create test script

Save as `test_multilabel_predict.py`:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from pathlib import Path
import numpy as np
import json
import sys

MODEL_DIR = "LeeSek/multilabel-dockerfile-model"
TOP_RULES_PATH = "top_rules.json"
THRESHOLD = 0.5

def main():
    if len(sys.argv) < 2:
        print("Usage: python test_multilabel_predict.py Dockerfile [--debug]")
        return

    debug = "--debug" in sys.argv
    file_path = Path(sys.argv[1])
    if not file_path.exists():
        print(f"File {file_path} not found.")
        return

    labels = json.load(open(TOP_RULES_PATH))
    tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
    model = AutoModelForSequenceClassification.from_pretrained(MODEL_DIR)
    model.eval()

    text = file_path.read_text(encoding="utf-8")
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=512)

    with torch.no_grad():
        logits = model(**inputs).logits
        probs = torch.sigmoid(logits).squeeze().cpu().numpy()

    triggered = [(labels[i], probs[i]) for i in range(len(labels)) if probs[i] > THRESHOLD]
    top5 = np.argsort(probs)[-5:][::-1]

    print(f"\n🧪 Prediction for file: {file_path.name}")
    print(f"📄 Lines in file: {len(text.splitlines())}")

    if triggered:
        print(f"\n🚨 Detected violations (p > {THRESHOLD}):")
        for rule, p in triggered:
            print(f" - {rule}: {p:.3f}")
    else:
        print("✅ No violations detected.")

    if debug:
        print("\n🛠 DEBUG INFO:")
        print(f"📝 Text snippet:\n{text[:300]}")
        print(f"🔢 Token count: {len(inputs['input_ids'][0])}")
        print(f"📈 Logits: {logits.squeeze().tolist()}")
        print("\n🔥 Top 5 predictions:")
        for idx in top5:
            print(f" - {labels[idx]}: {probs[idx]:.3f}")

if __name__ == "__main__":
    main()
```

Make sure `top_rules.json` is available next to the script.

---

### 📄 Step 2 — Create good and bad Dockerfile

Good:

```docker
FROM node:18
WORKDIR /app
COPY . .
RUN npm install
CMD ["node", "index.js"]
```

Bad:

```docker
FROM ubuntu:latest
RUN apt-get install python3
ADD . /app
WORKDIR /app
RUN pip install flask
CMD python3 app.py
```

### ▶️ Step 3 — Run the script

```bash
python test_multilabel_predict.py Dockerfile --debug
```

---

## 🗂 Extras

The full training and evaluation pipeline — including data preparation, training, validation, prediction, and threshold calibration — is available in the **`scripts/`** folder.

> 💬 **Note:** Scripts are written with **Polish comments and variable names** for clarity during local development. Logic is fully portable.

---

## 📘 License

MIT

---

## 🙌 Credits

- Based on [Hadolint](https://github.com/hadolint/hadolint)
- Powered by [Hugging Face Transformers](https://huggingface.co)