File size: 4,383 Bytes
c77c530
 
 
 
 
 
 
 
 
 
 
 
 
 
09e00ff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d7ec707
09e00ff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
language: "code"
license: "mit"
tags:
  - dockerfile
  - hadolint
  - multilabel-classification
  - codebert
model-index:
  - name: Multilabel Dockerfile Classifier
    results: []
---


# 🧱 Dockerfile Quality Classifier – Multilabel Model

This model predicts **which rules are violated** in a given Dockerfile. It is a multilabel classifier trained to detect violations of the top 30 most frequent rules from Hadolint.

---

## 🧠 Model Overview

- **Architecture:** Fine-tuned `microsoft/codebert-base`
- **Task:** Multi-label classification (30 labels)
- **Input:** Full Dockerfile content as plain text
- **Output:** For each rule β†’ probability of violation
- **Max input length:** 512 tokens
- **Threshold:** 0.5 (configurable)

---

## πŸ“š Training Details

- **Total training files:** ~15,000 Dockerfiles with at least one rule violation
- **Per-rule cap:** Max 2,000 files per rule to avoid imbalance
- **Perfect (clean) files:** ~1,500 examples with no Hadolint violations
- **Label source:** Hadolint output (top 30 rules only)
- **One-hot labels:** `[1, 0, 0, 1, ...]` for 30 rules

---

## πŸ§ͺ Evaluation Snapshot

Evaluation on 6,873 labeled examples:

| Metric         | Value |
|----------------|--------|
| Micro avg F1   | 0.97   |
| Macro avg F1   | 0.95   |
| Weighted avg F1| 0.97   |
| Samples avg F1 | 0.97   |

More metrics available in `classification_report.csv`

---

## πŸš€ Quick Start

### πŸ§ͺ Step 1 β€” Create test script

Save as `test_multilabel_predict.py`:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from pathlib import Path
import numpy as np
import json
import sys

MODEL_DIR = "LeeSek/multilabel-dockerfile-model"
TOP_RULES_PATH = "top_rules.json"
THRESHOLD = 0.5

def main():
    if len(sys.argv) < 2:
        print("Usage: python test_multilabel_predict.py Dockerfile [--debug]")
        return

    debug = "--debug" in sys.argv
    file_path = Path(sys.argv[1])
    if not file_path.exists():
        print(f"File {file_path} not found.")
        return

    labels = json.load(open(TOP_RULES_PATH))
    tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
    model = AutoModelForSequenceClassification.from_pretrained(MODEL_DIR)
    model.eval()

    text = file_path.read_text(encoding="utf-8")
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=512)

    with torch.no_grad():
        logits = model(**inputs).logits
        probs = torch.sigmoid(logits).squeeze().cpu().numpy()

    triggered = [(labels[i], probs[i]) for i in range(len(labels)) if probs[i] > THRESHOLD]
    top5 = np.argsort(probs)[-5:][::-1]

    print(f"\nπŸ§ͺ Prediction for file: {file_path.name}")
    print(f"πŸ“„ Lines in file: {len(text.splitlines())}")

    if triggered:
        print(f"\n🚨 Detected violations (p > {THRESHOLD}):")
        for rule, p in triggered:
            print(f" - {rule}: {p:.3f}")
    else:
        print("βœ… No violations detected.")

    if debug:
        print("\nπŸ›  DEBUG INFO:")
        print(f"πŸ“ Text snippet:\n{text[:300]}")
        print(f"πŸ”’ Token count: {len(inputs['input_ids'][0])}")
        print(f"πŸ“ˆ Logits: {logits.squeeze().tolist()}")
        print("\nπŸ”₯ Top 5 predictions:")
        for idx in top5:
            print(f" - {labels[idx]}: {probs[idx]:.3f}")

if __name__ == "__main__":
    main()
```

Make sure `top_rules.json` is available next to the script.

---

### πŸ“„ Step 2 β€” Create good and bad Dockerfile

Good:

```docker
FROM node:18
WORKDIR /app
COPY . .
RUN npm install
CMD ["node", "index.js"]
```

Bad:

```docker
FROM ubuntu:latest
RUN apt-get install python3
ADD . /app
WORKDIR /app
RUN pip install flask
CMD python3 app.py
```

### ▢️ Step 3 β€” Run the script

```bash
python test_multilabel_predict.py Dockerfile --debug
```

---

## πŸ—‚ Extras

The full training and evaluation pipeline β€” including data preparation, training, validation, prediction, and threshold calibration β€” is available in the **`scripts/`** folder.

> πŸ’¬ **Note:** Scripts are written with **Polish comments and variable names** for clarity during local development. Logic is fully portable.

---

## πŸ“˜ License

MIT

---

## πŸ™Œ Credits

- Based on [Hadolint](https://github.com/hadolint/hadolint)
- Powered by [Hugging Face Transformers](https://huggingface.co)