UnixCoder-VulnCWE / README.md

Update README.md

9f12f32 verified 6 months ago

3.86 kB

	---
	license: mit
	datasets:
	- mahdin70/cwe_enriched_balanced_bigvul_primevul
	base_model:
	- microsoft/unixcoder-base
	library_name: transformers
	---

	# UnixCoder-VulnCWE - Fine-Tuned UnixCoder for Vulnerability and CWE Classification

	## Model Overview
	This model is a fine-tuned version of microsoft/unixcoder-base on a curated and enriched dataset for vulnerability detection and CWE classification. It is capable of predicting whether a given code snippet is vulnerable and, if vulnerable, identifying the specific CWE ID associated with it.

	## Dataset
	The model was fine-tuned using the dataset [mahdin70/cwe_enriched_balanced_bigvul_primevul](https://huggingface.co/datasets/mahdin70/cwe_enriched_balanced_bigvul_primevul). The dataset contains both vulnerable and non-vulnerable code samples and is enriched with CWE metadata.

	### CWE IDs Covered:
	1. CWE-119: Improper Restriction of Operations within the Bounds of a Memory Buffer
	2. CWE-20: Improper Input Validation
	3. CWE-125: Out-of-bounds Read
	4. CWE-399: Resource Management Errors
	5. CWE-200: Information Exposure
	6. CWE-787: Out-of-bounds Write
	7. CWE-264: Permissions, Privileges, and Access Controls
	8. CWE-416: Use After Free
	9. CWE-476: NULL Pointer Dereference
	10. CWE-190: Integer Overflow or Wraparound
	11. CWE-189: Numeric Errors
	12. CWE-362: Concurrent Execution using Shared Resource with Improper Synchronization

	---

	## Model Training
	The model was trained for 3 epochs with the following configuration:
	- Learning Rate: 2e-5
	- Weight Decay: 0.01
	- Batch Size: 8
	- Optimizer: AdamW
	- Scheduler: Linear

	### Training Loss and Validation Loss Per Epoch:
	\| Epoch \| Training Loss \| Validation Loss \| Vul Accuracy \| Vul Precision \| Vul Recall \| Vul F1 \| CWE Accuracy \|
	\|------\|---------------\|----------------\|--------------\|---------------\|-----------\|-------\|---------------\|
	\| 1 \| 1.3732 \| 1.2689 \| 0.8220 \| 0.8831 \| 0.6231 \| 0.7307\| 0.4032 \|
	\| 2 \| 1.0318 \| 1.1613 \| 0.8229 \| 0.8238 \| 0.6907 \| 0.7514\| 0.4903 \|
	\| 3 \| 0.8192 \| 1.1871 \| 0.8158 \| 0.7997 \| 0.6999 \| 0.7465\| 0.5326 \|

	#### Training Summary:
	- Total Training Steps: 2958
	- Training Loss: 1.1267
	- Training Time: 2687.8 seconds (~45 minutes)
	- Training Speed: 17.6 samples per second
	- Steps Per Second: 1.1

	---

	## Model Evaluation (Test Set Results)
	The model was evaluated on the test set with the following metrics:

	### Vulnerability Detection Metrics:
	- Accuracy: 82.73%
	- Precision: 82.15%
	- Recall: 70.86%
	- F1-Score: 76.09%

	### CWE Classification Metrics:
	- Accuracy: 51.46%
	- Precision: 51.11%
	- Recall: 51.46%
	- F1-Score: 50.65%

	---

	## How to Use the Model
	```python
	from transformers import AutoModel, AutoTokenizer

	model = AutoModel.from_pretrained("mahdin70/UnixCoder-VulnCWE", trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained("microsoft/unixcoder-base")

	code_snippet = "int main() { int arr[10]; arr[11] = 5; return 0; }"
	inputs = tokenizer(code_snippet, return_tensors="pt")
	outputs = model(**inputs)

	vul_logits = outputs["vul_logits"]
	cwe_logits = outputs["cwe_logits"]

	vul_pred = vul_logits.argmax(dim=1).item()
	cwe_pred = cwe_logits.argmax(dim=1).item()

	print(f"Vulnerability: {'Vulnerable' if vul_pred == 1 else 'Non-vulnerable'}")
	print(f"CWE ID: {cwe_pred if vul_pred == 1 else 'N/A'}")
	```

	## Limitations and Future Improvements
	- The model has limited accuracy on CWE classification (51.46%). Improving the model with advanced architectures or better data balancing could yield better results.
	- The model might not perform well on edge cases or unseen CWEs.