LeeSek commited on
Commit
7532aef
Β·
verified Β·
1 Parent(s): 686fe96

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +113 -0
README.md ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🧱 Dockerfile Quality Classifier – Binary Model
2
+
3
+ This model predicts whether a given Dockerfile is:
4
+
5
+ - βœ… **GOOD** – clean and adheres to best practices (no top rule violations)
6
+ - ❌ **BAD** – violates at least one important rule (from Hadolint)
7
+
8
+ It is the first step in a full ML-based Dockerfile linter.
9
+
10
+ ---
11
+
12
+ ## 🧠 Model Overview
13
+
14
+ - **Architecture:** Fine-tuned `microsoft/codebert-base`
15
+ - **Task:** Binary classification (`good` vs `bad`)
16
+ - **Input:** Full Dockerfile content as plain text
17
+ - **Output:** `[prob_good, prob_bad]` β€” softmax scores
18
+ - **Max input length:** 512 tokens
19
+
20
+ ---
21
+
22
+ ## πŸ“š Training Details
23
+
24
+ - **Data source:** Real-world and synthetic Dockerfiles
25
+ - **Labels:** Based on [Hadolint](https://github.com/hadolint/hadolint) top 30 rules
26
+ - **Bad examples:** At least one rule violated
27
+ - **Good examples:** Fully clean files
28
+ - **Dataset balance:** 50/50
29
+
30
+ ---
31
+
32
+ ## πŸ§ͺ Evaluation Results
33
+
34
+ Evaluation on a held-out test set of 1,650 Dockerfiles:
35
+
36
+ | Class | Precision | Recall | F1-score | Support |
37
+ |-------|-----------|--------|----------|---------|
38
+ | good | 0.96 | 0.91 | 0.93 | 150 |
39
+ | bad | 0.99 | 1.00 | 0.99 | 1500 |
40
+ | **Accuracy** | | | **0.99** | 1650 |
41
+
42
+ ---
43
+
44
+ ## πŸš€ Quick Start
45
+
46
+ ### πŸ§ͺ Step 1 β€” Create test script
47
+
48
+ Save this as `test_binary_predict.py`:
49
+
50
+ ```python
51
+ import sys
52
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
53
+ import torch
54
+ from pathlib import Path
55
+
56
+ path = Path(sys.argv[1])
57
+ text = path.read_text(encoding="utf-8")
58
+
59
+ tokenizer = AutoTokenizer.from_pretrained("LeeSek/binary-dockerfile-model")
60
+ model = AutoModelForSequenceClassification.from_pretrained("LeeSek/binary-dockerfile-model")
61
+ model.eval()
62
+
63
+ inputs = tokenizer(text, return_tensors="pt", padding="max_length", truncation=True, max_length=512)
64
+
65
+ with torch.no_grad():
66
+ logits = model(**inputs).logits
67
+ probs = torch.nn.functional.softmax(logits, dim=1).squeeze()
68
+
69
+ label = "GOOD" if torch.argmax(probs).item() == 0 else "BAD"
70
+ print(f"Prediction: {label} β€” Probabilities: good={probs[0]:.3f}, bad={probs[1]:.3f}")
71
+ ```
72
+
73
+ ---
74
+
75
+ ### πŸ“„ Step 2 β€” Create a test Dockerfile
76
+
77
+ Save the following as `Dockerfile`:
78
+
79
+ ```dockerfile
80
+ FROM node:18
81
+ WORKDIR /app
82
+ COPY . .
83
+ RUN npm install
84
+ CMD ["node", "index.js"]
85
+ ```
86
+
87
+ ---
88
+
89
+ ### ▢️ Step 3 β€” Run the prediction
90
+
91
+ ```bash
92
+ python test_binary_predict.py Dockerfile
93
+ ```
94
+
95
+ Expected output:
96
+
97
+ ```
98
+ Prediction: GOOD β€” Probabilities: good=0.998, bad=0.002
99
+ ```
100
+
101
+ ---
102
+
103
+ ## πŸ“˜ License
104
+
105
+ MIT
106
+
107
+ ---
108
+
109
+ ## πŸ™Œ Credits
110
+
111
+ - Model powered by [Hugging Face Transformers](https://huggingface.co/transformers)
112
+ - Tokenizer: CodeBERT
113
+ - Rule definitions: [Hadolint](https://github.com/hadolint/hadolint)