File size: 4,383 Bytes
c77c530 09e00ff d7ec707 09e00ff |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 |
---
language: "code"
license: "mit"
tags:
- dockerfile
- hadolint
- multilabel-classification
- codebert
model-index:
- name: Multilabel Dockerfile Classifier
results: []
---
# π§± Dockerfile Quality Classifier β Multilabel Model
This model predicts **which rules are violated** in a given Dockerfile. It is a multilabel classifier trained to detect violations of the top 30 most frequent rules from Hadolint.
---
## π§ Model Overview
- **Architecture:** Fine-tuned `microsoft/codebert-base`
- **Task:** Multi-label classification (30 labels)
- **Input:** Full Dockerfile content as plain text
- **Output:** For each rule β probability of violation
- **Max input length:** 512 tokens
- **Threshold:** 0.5 (configurable)
---
## π Training Details
- **Total training files:** ~15,000 Dockerfiles with at least one rule violation
- **Per-rule cap:** Max 2,000 files per rule to avoid imbalance
- **Perfect (clean) files:** ~1,500 examples with no Hadolint violations
- **Label source:** Hadolint output (top 30 rules only)
- **One-hot labels:** `[1, 0, 0, 1, ...]` for 30 rules
---
## π§ͺ Evaluation Snapshot
Evaluation on 6,873 labeled examples:
| Metric | Value |
|----------------|--------|
| Micro avg F1 | 0.97 |
| Macro avg F1 | 0.95 |
| Weighted avg F1| 0.97 |
| Samples avg F1 | 0.97 |
More metrics available in `classification_report.csv`
---
## π Quick Start
### π§ͺ Step 1 β Create test script
Save as `test_multilabel_predict.py`:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from pathlib import Path
import numpy as np
import json
import sys
MODEL_DIR = "LeeSek/multilabel-dockerfile-model"
TOP_RULES_PATH = "top_rules.json"
THRESHOLD = 0.5
def main():
if len(sys.argv) < 2:
print("Usage: python test_multilabel_predict.py Dockerfile [--debug]")
return
debug = "--debug" in sys.argv
file_path = Path(sys.argv[1])
if not file_path.exists():
print(f"File {file_path} not found.")
return
labels = json.load(open(TOP_RULES_PATH))
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_DIR)
model.eval()
text = file_path.read_text(encoding="utf-8")
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=512)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.sigmoid(logits).squeeze().cpu().numpy()
triggered = [(labels[i], probs[i]) for i in range(len(labels)) if probs[i] > THRESHOLD]
top5 = np.argsort(probs)[-5:][::-1]
print(f"\nπ§ͺ Prediction for file: {file_path.name}")
print(f"π Lines in file: {len(text.splitlines())}")
if triggered:
print(f"\nπ¨ Detected violations (p > {THRESHOLD}):")
for rule, p in triggered:
print(f" - {rule}: {p:.3f}")
else:
print("β
No violations detected.")
if debug:
print("\nπ DEBUG INFO:")
print(f"π Text snippet:\n{text[:300]}")
print(f"π’ Token count: {len(inputs['input_ids'][0])}")
print(f"π Logits: {logits.squeeze().tolist()}")
print("\nπ₯ Top 5 predictions:")
for idx in top5:
print(f" - {labels[idx]}: {probs[idx]:.3f}")
if __name__ == "__main__":
main()
```
Make sure `top_rules.json` is available next to the script.
---
### π Step 2 β Create good and bad Dockerfile
Good:
```docker
FROM node:18
WORKDIR /app
COPY . .
RUN npm install
CMD ["node", "index.js"]
```
Bad:
```docker
FROM ubuntu:latest
RUN apt-get install python3
ADD . /app
WORKDIR /app
RUN pip install flask
CMD python3 app.py
```
### βΆοΈ Step 3 β Run the script
```bash
python test_multilabel_predict.py Dockerfile --debug
```
---
## π Extras
The full training and evaluation pipeline β including data preparation, training, validation, prediction, and threshold calibration β is available in the **`scripts/`** folder.
> π¬ **Note:** Scripts are written with **Polish comments and variable names** for clarity during local development. Logic is fully portable.
---
## π License
MIT
---
## π Credits
- Based on [Hadolint](https://github.com/hadolint/hadolint)
- Powered by [Hugging Face Transformers](https://huggingface.co)
|