Update README.md

03ce0dc verified 5 months ago

4.97 kB

	---
	library_name: transformers
	tags:
	- Vulnerability
	- C/C++
	- Detection
	datasets:
	- DetectVul/devign
	language:
	- en
	base_model:
	- microsoft/unixcoder-base
	---

	# Model Card: UniXcoder for Code Vulnerability Detection

	## Model Summary
	This model is a fine-tuned version of Microsoft's UniXcoder, optimized for detecting vulnerabilities in C/C++ code. It is trained on the DetectVul/devign dataset and achieves 68.34% accuracy with an F1 score of 62.14%. The model takes in a code snippet and classifies it as either safe (0) or vulnerable (1).

	## Model Details

	- Developed by: [mahdin70(Mukit Mahdin)]
	- Finetuned from: `microsoft/unixcoder-base`
	- Language(s): English (for code comments & metadata), C/C++
	- License: MIT
	- Task: Code vulnerability detection
	- Dataset Used: `DetectVul/devign`
	- Architecture: Transformer-based sequence classification

	## Model Sources
	- Repository: [Add Hugging Face Model Link Here]
	- Paper (UniXcoder): [https://arxiv.org/abs/2203.03850](https://arxiv.org/abs/2203.03850)
	- Demo (Optional): [Add Gradio/Streamlit Link Here]

	## Uses

	### Direct Use
	This model can be used for static code analysis, security audits, and automatic vulnerability detection in software repositories. It is useful for:
	- Developers: To analyze their code for potential security flaws.
	- Security Teams: To scan repositories for known vulnerabilities.
	- Researchers: To study vulnerability detection in AI-powered systems.

	### Downstream Use
	This model can be integrated into IDE plugins, CI/CD pipelines, or security scanners to provide real-time vulnerability detection.

	### Out-of-Scope Use
	- The model is not meant to replace human security experts.
	- It may not generalize well to languages other than C/C++.
	- False positives/negatives may occur due to dataset limitations.

	## Bias, Risks, and Limitations
	- False Positives & False Negatives: The model may flag safe code as vulnerable or miss actual vulnerabilities.
	- Limited to C/C++: The model was trained on a dataset primarily composed of C and C++ code. It may not perform well on other languages.
	- Dataset Bias: The training data may not cover all possible vulnerabilities.

	### Recommendations
	Users should not rely solely on the model for security assessments. Instead, it should be used alongside manual code review and static analysis tools.

	## How to Get Started with the Model
	Use the code below to load the model and run inference on a sample code snippet:

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	# Load the fine-tuned model
	tokenizer = AutoTokenizer.from_pretrained("your_username/unixcoder-code-vulnerability-detector")
	model = AutoModelForSequenceClassification.from_pretrained("your_username/unixcoder-code-vulnerability-detector")

	# Sample code snippet
	code_snippet = """
	void process(char *input) {
	char buffer[50];
	strcpy(buffer, input); // Potential buffer overflow
	}
	"""

	# Tokenize the input
	inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding="max_length", max_length=512)

	# Run inference
	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
	predicted_label = torch.argmax(predictions, dim=1).item()

	# Output the result
	print("⚠️ Vulnerable Code" if predicted_label == 1 else "✅ Safe Code")
	```

	## Training Details

	### Training Data
	- Dataset: `DetectVul/devign`
	- Classes: `0 (Safe)`, `1 (Vulnerable)`
	- Size: 50,000+ code snippets

	### Training Procedure
	- Optimizer: AdamW
	- Loss Function: Cross-Entropy Loss
	- Batch Size: 8
	- Learning Rate: 2e-5
	- Epochs: 3
	- Hardware Used: 2x T4 GPU
	- Mixed Precision: FP16

	### Training Metrics
	\| Metric \| Score \|
	\|---------\|--------\|
	\| Train Loss \| 0.4835 \|
	\| Evaluation Loss \| 0.6855 \|
	\| Accuracy \| 68.34% \|
	\| F1 Score \| 62.14% \|
	\| Precision \| 69.18% \|
	\| Recall \| 56.40% \|

	## Evaluation

	### Testing Data & Metrics
	The model was evaluated using 20% of the dataset, with the following results:

	- Evaluation Accuracy: 68.34%
	- F1 Score: 62.14%
	- Precision: 69.18%
	- Recall: 56.40%
	- Evaluation Runtime: 41.16 sec
	- Evaluation Speed: 53.1 samples/sec

	## Environmental Impact

	\| Factor \| Value \|
	\|---------\|--------\|
	\| GPU Used \| 2x T4 GPU \|
	\| Training Time \| ~1 hour \|

	## Citation
	If you use this model in your research or applications, please cite:

	```
	@article{unixcoder,
	title={UniXcoder: Unified Cross-Modal Pretraining for Code Representation},
	author={Guo, Daya and Wang, Shuo and Wan, Yao and others},
	year={2022},
	journal={arXiv preprint arXiv:2203.03850}
	}
	```

	## Model Card Authors
	- Mukit Mahdin
	- Contact: [[email protected]]


	---

	Let me know if you need further modifications! 🚀