LeeSek commited on
Commit
09e00ff
Β·
verified Β·
1 Parent(s): e466c22

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +162 -0
README.md ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🧱 Dockerfile Quality Classifier – Multilabel Model
2
+
3
+ This model predicts **which rules are violated** in a given Dockerfile. It is a multilabel classifier trained to detect violations of the top 30 most frequent rules from Hadolint.
4
+
5
+ ---
6
+
7
+ ## 🧠 Model Overview
8
+
9
+ - **Architecture:** Fine-tuned `microsoft/codebert-base`
10
+ - **Task:** Multi-label classification (30 labels)
11
+ - **Input:** Full Dockerfile content as plain text
12
+ - **Output:** For each rule β†’ probability of violation
13
+ - **Max input length:** 512 tokens
14
+ - **Threshold:** 0.5 (configurable)
15
+
16
+ ---
17
+
18
+ ## πŸ“š Training Details
19
+
20
+ - **Total training files:** ~15,000 Dockerfiles with at least one rule violation
21
+ - **Per-rule cap:** Max 2,000 files per rule to avoid imbalance
22
+ - **Perfect (clean) files:** ~1,500 examples with no Hadolint violations
23
+ - **Label source:** Hadolint output (top 30 rules only)
24
+ - **One-hot labels:** `[1, 0, 0, 1, ...]` for 30 rules
25
+
26
+ ---
27
+
28
+ ## πŸ§ͺ Evaluation Snapshot
29
+
30
+ Evaluation on 6,873 labeled examples:
31
+
32
+ | Metric | Value |
33
+ |----------------|--------|
34
+ | Micro avg F1 | 0.97 |
35
+ | Macro avg F1 | 0.95 |
36
+ | Weighted avg F1| 0.97 |
37
+ | Samples avg F1 | 0.97 |
38
+
39
+ More metrics available in `classification_report.csv`
40
+
41
+ ---
42
+
43
+ ## πŸš€ Quick Start
44
+
45
+ ### πŸ§ͺ Step 1 β€” Create test script
46
+
47
+ Save as `test_multilabel_predict.py`:
48
+
49
+ ```python
50
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
51
+ import torch
52
+ from pathlib import Path
53
+ import numpy as np
54
+ import json
55
+ import sys
56
+
57
+ MODEL_DIR = "LeeSek/multilabel-dockerfile-model"
58
+ TOP_RULES_PATH = "top_rules.json"
59
+ THRESHOLD = 0.5
60
+
61
+ def main():
62
+ if len(sys.argv) < 2:
63
+ print("Usage: python test_multilabel_predict.py Dockerfile [--debug]")
64
+ return
65
+
66
+ debug = "--debug" in sys.argv
67
+ file_path = Path(sys.argv[1])
68
+ if not file_path.exists():
69
+ print(f"File {file_path} not found.")
70
+ return
71
+
72
+ labels = json.load(open(TOP_RULES_PATH))
73
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
74
+ model = AutoModelForSequenceClassification.from_pretrained(MODEL_DIR)
75
+ model.eval()
76
+
77
+ text = file_path.read_text(encoding="utf-8")
78
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=512)
79
+
80
+ with torch.no_grad():
81
+ logits = model(**inputs).logits
82
+ probs = torch.sigmoid(logits).squeeze().cpu().numpy()
83
+
84
+ triggered = [(labels[i], probs[i]) for i in range(len(labels)) if probs[i] > THRESHOLD]
85
+ top5 = np.argsort(probs)[-5:][::-1]
86
+
87
+ print(f"\nπŸ§ͺ Prediction for file: {file_path.name}")
88
+ print(f"πŸ“„ Lines in file: {len(text.splitlines())}")
89
+
90
+ if triggered:
91
+ print(f"\n🚨 Detected violations (p > {THRESHOLD}):")
92
+ for rule, p in triggered:
93
+ print(f" - {rule}: {p:.3f}")
94
+ else:
95
+ print("βœ… No violations detected.")
96
+
97
+ if debug:
98
+ print("\nπŸ›  DEBUG INFO:")
99
+ print(f"πŸ“ Text snippet:\n{text[:300]}")
100
+ print(f"πŸ”’ Token count: {len(inputs['input_ids'][0])}")
101
+ print(f"πŸ“ˆ Logits: {logits.squeeze().tolist()}")
102
+ print("\nπŸ”₯ Top 5 predictions:")
103
+ for idx in top5:
104
+ print(f" - {labels[idx]}: {probs[idx]:.3f}")
105
+
106
+ if __name__ == "__main__":
107
+ main()
108
+ ```
109
+
110
+ Make sure `top_rules.json` is available next to the script.
111
+
112
+ ---
113
+
114
+ ### πŸ§ͺ Step 2 β€” Create good and bad Dockerfile
115
+
116
+ Good:
117
+
118
+ ```docker
119
+ FROM node:18
120
+ WORKDIR /app
121
+ COPY . .
122
+ RUN npm install
123
+ CMD ["node", "index.js"]
124
+ ```
125
+
126
+ Bad:
127
+
128
+ ```docker
129
+ FROM ubuntu:latest
130
+ RUN apt-get install python3
131
+ ADD . /app
132
+ WORKDIR /app
133
+ RUN pip install flask
134
+ CMD python3 app.py
135
+ ```
136
+
137
+ ### ▢️ Step 3 β€” Run the script
138
+
139
+ ```bash
140
+ python test_multilabel_predict.py Dockerfile --debug
141
+ ```
142
+
143
+ ---
144
+
145
+ ## πŸ—‚ Extras
146
+
147
+ The full training and evaluation pipeline β€” including data preparation, training, validation, prediction, and threshold calibration β€” is available in the **`scripts/`** folder.
148
+
149
+ > πŸ’¬ **Note:** Scripts are written with **Polish comments and variable names** for clarity during local development. Logic is fully portable.
150
+
151
+ ---
152
+
153
+ ## πŸ“˜ License
154
+
155
+ MIT
156
+
157
+ ---
158
+
159
+ ## πŸ™Œ Credits
160
+
161
+ - Based on [Hadolint](https://github.com/hadolint/hadolint)
162
+ - Powered by [Hugging Face Transformers](https://huggingface.co)