|
--- |
|
language: "code" |
|
license: "mit" |
|
tags: |
|
- dockerfile |
|
- hadolint |
|
- multilabel-classification |
|
- codebert |
|
model-index: |
|
- name: Multilabel Dockerfile Classifier |
|
results: [] |
|
--- |
|
|
|
|
|
# π§± Dockerfile Quality Classifier β Multilabel Model |
|
|
|
This model predicts **which rules are violated** in a given Dockerfile. It is a multilabel classifier trained to detect violations of the top 30 most frequent rules from Hadolint. |
|
|
|
--- |
|
|
|
## π§ Model Overview |
|
|
|
- **Architecture:** Fine-tuned `microsoft/codebert-base` |
|
- **Task:** Multi-label classification (30 labels) |
|
- **Input:** Full Dockerfile content as plain text |
|
- **Output:** For each rule β probability of violation |
|
- **Max input length:** 512 tokens |
|
- **Threshold:** 0.5 (configurable) |
|
|
|
--- |
|
|
|
## π Training Details |
|
|
|
- **Total training files:** ~15,000 Dockerfiles with at least one rule violation |
|
- **Per-rule cap:** Max 2,000 files per rule to avoid imbalance |
|
- **Perfect (clean) files:** ~1,500 examples with no Hadolint violations |
|
- **Label source:** Hadolint output (top 30 rules only) |
|
- **One-hot labels:** `[1, 0, 0, 1, ...]` for 30 rules |
|
|
|
--- |
|
|
|
## π§ͺ Evaluation Snapshot |
|
|
|
Evaluation on 6,873 labeled examples: |
|
|
|
| Metric | Value | |
|
|----------------|--------| |
|
| Micro avg F1 | 0.97 | |
|
| Macro avg F1 | 0.95 | |
|
| Weighted avg F1| 0.97 | |
|
| Samples avg F1 | 0.97 | |
|
|
|
More metrics available in `classification_report.csv` |
|
|
|
--- |
|
|
|
## π Quick Start |
|
|
|
### π§ͺ Step 1 β Create test script |
|
|
|
Save as `test_multilabel_predict.py`: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
import torch |
|
from pathlib import Path |
|
import numpy as np |
|
import json |
|
import sys |
|
|
|
MODEL_DIR = "LeeSek/multilabel-dockerfile-model" |
|
TOP_RULES_PATH = "top_rules.json" |
|
THRESHOLD = 0.5 |
|
|
|
def main(): |
|
if len(sys.argv) < 2: |
|
print("Usage: python test_multilabel_predict.py Dockerfile [--debug]") |
|
return |
|
|
|
debug = "--debug" in sys.argv |
|
file_path = Path(sys.argv[1]) |
|
if not file_path.exists(): |
|
print(f"File {file_path} not found.") |
|
return |
|
|
|
labels = json.load(open(TOP_RULES_PATH)) |
|
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR) |
|
model = AutoModelForSequenceClassification.from_pretrained(MODEL_DIR) |
|
model.eval() |
|
|
|
text = file_path.read_text(encoding="utf-8") |
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=512) |
|
|
|
with torch.no_grad(): |
|
logits = model(**inputs).logits |
|
probs = torch.sigmoid(logits).squeeze().cpu().numpy() |
|
|
|
triggered = [(labels[i], probs[i]) for i in range(len(labels)) if probs[i] > THRESHOLD] |
|
top5 = np.argsort(probs)[-5:][::-1] |
|
|
|
print(f"\nπ§ͺ Prediction for file: {file_path.name}") |
|
print(f"π Lines in file: {len(text.splitlines())}") |
|
|
|
if triggered: |
|
print(f"\nπ¨ Detected violations (p > {THRESHOLD}):") |
|
for rule, p in triggered: |
|
print(f" - {rule}: {p:.3f}") |
|
else: |
|
print("β
No violations detected.") |
|
|
|
if debug: |
|
print("\nπ DEBUG INFO:") |
|
print(f"π Text snippet:\n{text[:300]}") |
|
print(f"π’ Token count: {len(inputs['input_ids'][0])}") |
|
print(f"π Logits: {logits.squeeze().tolist()}") |
|
print("\nπ₯ Top 5 predictions:") |
|
for idx in top5: |
|
print(f" - {labels[idx]}: {probs[idx]:.3f}") |
|
|
|
if __name__ == "__main__": |
|
main() |
|
``` |
|
|
|
Make sure `top_rules.json` is available next to the script. |
|
|
|
--- |
|
|
|
### π Step 2 β Create good and bad Dockerfile |
|
|
|
Good: |
|
|
|
```docker |
|
FROM node:18 |
|
WORKDIR /app |
|
COPY . . |
|
RUN npm install |
|
CMD ["node", "index.js"] |
|
``` |
|
|
|
Bad: |
|
|
|
```docker |
|
FROM ubuntu:latest |
|
RUN apt-get install python3 |
|
ADD . /app |
|
WORKDIR /app |
|
RUN pip install flask |
|
CMD python3 app.py |
|
``` |
|
|
|
### βΆοΈ Step 3 β Run the script |
|
|
|
```bash |
|
python test_multilabel_predict.py Dockerfile --debug |
|
``` |
|
|
|
--- |
|
|
|
## π Extras |
|
|
|
The full training and evaluation pipeline β including data preparation, training, validation, prediction, and threshold calibration β is available in the **`scripts/`** folder. |
|
|
|
> π¬ **Note:** Scripts are written with **Polish comments and variable names** for clarity during local development. Logic is fully portable. |
|
|
|
--- |
|
|
|
## π License |
|
|
|
MIT |
|
|
|
--- |
|
|
|
## π Credits |
|
|
|
- Based on [Hadolint](https://github.com/hadolint/hadolint) |
|
- Powered by [Hugging Face Transformers](https://huggingface.co) |
|
|