File size: 3,340 Bytes
8b994a8
 
 
 
 
 
 
 
 
 
 
 
 
 
7532aef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fcf3855
7532aef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f5afad6
7532aef
f5afad6
7532aef
f5afad6
7532aef
 
 
 
 
 
 
f5afad6
 
 
 
 
 
 
 
 
 
 
7532aef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
08b35cc
 
c240bb1
08b35cc
 
 
 
 
7532aef
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
language: "code"
license: "mit"
tags:
  - dockerfile
  - hadolint
  - binary-classification
  - codebert
model-index:
  - name: Binary Dockerfile Classifier
    results: []
---


# 🧱 Dockerfile Quality Classifier – Binary Model

This model predicts whether a given Dockerfile is:

- βœ… **GOOD** – clean and adheres to best practices (no top rule violations)
- ❌ **BAD** – violates at least one important rule (from Hadolint)

It is the first step in a full ML-based Dockerfile linter.

---

## 🧠 Model Overview

- **Architecture:** Fine-tuned `microsoft/codebert-base`
- **Task:** Binary classification (`good` vs `bad`)
- **Input:** Full Dockerfile content as plain text
- **Output:** `[prob_good, prob_bad]` β€” softmax scores
- **Max input length:** 512 tokens

---

## πŸ“š Training Details

- **Data source:** Real-world and synthetic Dockerfiles
- **Labels:** Based on [Hadolint](https://github.com/hadolint/hadolint) top 30 rules
- **Bad examples:** At least one rule violated
- **Good examples:** Fully clean files
- **Dataset balance:** 15000 BAD / 1500 GOOD (clean)

---

## πŸ§ͺ Evaluation Results

Evaluation on a held-out test set of 1,650 Dockerfiles:

| Class | Precision | Recall | F1-score | Support |
|-------|-----------|--------|----------|---------|
| good  | 0.96      | 0.91   | 0.93     | 150     |
| bad   | 0.99      | 1.00   | 0.99     | 1500    |
| **Accuracy** |       |        | **0.99** | 1650    |

---

## πŸš€ Quick Start

### πŸ§ͺ Step 1 β€” Create test script

Save this as `test_binary_predict.py`:

```python
import sys
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from pathlib import Path

path = Path(sys.argv[1])
text = path.read_text(encoding="utf-8")

tokenizer = AutoTokenizer.from_pretrained("LeeSek/binary-dockerfile-model")
model = AutoModelForSequenceClassification.from_pretrained("LeeSek/binary-dockerfile-model")
model.eval()

inputs = tokenizer(text, return_tensors="pt", padding="max_length", truncation=True, max_length=512)

with torch.no_grad():
    logits = model(**inputs).logits
    probs = torch.nn.functional.softmax(logits, dim=1).squeeze()

label = "GOOD" if torch.argmax(probs).item() == 0 else "BAD"
print(f"Prediction: {label} β€” Probabilities: good={probs[0]:.3f}, bad={probs[1]:.3f}")
```

---

### πŸ“„ Step 2 β€” Create good and bad Dockerfile

Good:

```docker
FROM node:18
WORKDIR /app
COPY . .
RUN npm install
CMD ["node", "index.js"]
```

Bad:

```docker
FROM ubuntu:latest
RUN apt-get install python3
ADD . /app
WORKDIR /app
RUN pip install flask
CMD python3 app.py
```

---

### ▢️ Step 3 β€” Run the prediction

```bash
python test_binary_predict.py Dockerfile
```

Expected output:

```
Prediction: GOOD β€” Probabilities: good=0.998, bad=0.002
```

---

## πŸ—‚ Extras

The full training and evaluation pipeline β€” including data preparation, training, validation, prediction β€” is available in the **`scripts/`** folder.

> πŸ’¬ **Note:** Scripts are written with **Polish comments and variable names** for clarity during local development. Logic is fully portable.

---

## πŸ“˜ License

MIT

---

## πŸ™Œ Credits

- Model powered by [Hugging Face Transformers](https://huggingface.co/transformers)
- Tokenizer: CodeBERT
- Rule definitions: [Hadolint](https://github.com/hadolint/hadolint)