|
--- |
|
library_name: transformers |
|
tags: |
|
- Vulnerability |
|
- C/C++ |
|
- Detection |
|
datasets: |
|
- DetectVul/devign |
|
language: |
|
- en |
|
base_model: |
|
- microsoft/unixcoder-base |
|
--- |
|
|
|
# Model Card: UniXcoder for Code Vulnerability Detection |
|
|
|
## Model Summary |
|
This model is a fine-tuned version of **Microsoft's UniXcoder**, optimized for detecting vulnerabilities in C/C++ code. It is trained on the **DetectVul/devign** dataset and achieves **68.34% accuracy** with an **F1 score of 62.14%**. The model takes in a code snippet and classifies it as either **safe (0)** or **vulnerable (1)**. |
|
|
|
## Model Details |
|
|
|
- **Developed by:** [mahdin70(Mukit Mahdin)] |
|
- **Finetuned from:** `microsoft/unixcoder-base` |
|
- **Language(s):** English (for code comments & metadata), C/C++ |
|
- **License:** MIT |
|
- **Task:** Code vulnerability detection |
|
- **Dataset Used:** `DetectVul/devign` |
|
- **Architecture:** Transformer-based sequence classification |
|
|
|
## Model Sources |
|
- **Repository:** [Add Hugging Face Model Link Here] |
|
- **Paper (UniXcoder):** [https://arxiv.org/abs/2203.03850](https://arxiv.org/abs/2203.03850) |
|
- **Demo (Optional):** [Add Gradio/Streamlit Link Here] |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
This model can be used for **static code analysis**, security audits, and automatic vulnerability detection in software repositories. It is useful for: |
|
- **Developers**: To analyze their code for potential security flaws. |
|
- **Security Teams**: To scan repositories for known vulnerabilities. |
|
- **Researchers**: To study vulnerability detection in AI-powered systems. |
|
|
|
### Downstream Use |
|
This model can be integrated into **IDE plugins**, **CI/CD pipelines**, or **security scanners** to provide real-time vulnerability detection. |
|
|
|
### Out-of-Scope Use |
|
- The model is **not meant to replace human security experts**. |
|
- It may not generalize well to **languages other than C/C++**. |
|
- False positives/negatives may occur due to dataset limitations. |
|
|
|
## Bias, Risks, and Limitations |
|
- **False Positives & False Negatives:** The model may flag safe code as vulnerable or miss actual vulnerabilities. |
|
- **Limited to C/C++:** The model was trained on a dataset primarily composed of **C and C++ code**. It may not perform well on other languages. |
|
- **Dataset Bias:** The training data may not cover all possible vulnerabilities. |
|
|
|
### Recommendations |
|
Users should **not rely solely on the model** for security assessments. Instead, it should be used alongside **manual code review and static analysis tools**. |
|
|
|
## How to Get Started with the Model |
|
Use the code below to load the model and run inference on a sample code snippet: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
import torch |
|
|
|
# Load the fine-tuned model |
|
tokenizer = AutoTokenizer.from_pretrained("your_username/unixcoder-code-vulnerability-detector") |
|
model = AutoModelForSequenceClassification.from_pretrained("your_username/unixcoder-code-vulnerability-detector") |
|
|
|
# Sample code snippet |
|
code_snippet = """ |
|
void process(char *input) { |
|
char buffer[50]; |
|
strcpy(buffer, input); // Potential buffer overflow |
|
} |
|
""" |
|
|
|
# Tokenize the input |
|
inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding="max_length", max_length=512) |
|
|
|
# Run inference |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
predicted_label = torch.argmax(predictions, dim=1).item() |
|
|
|
# Output the result |
|
print("โ ๏ธ Vulnerable Code" if predicted_label == 1 else "โ
Safe Code") |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
- **Dataset:** `DetectVul/devign` |
|
- **Classes:** `0 (Safe)`, `1 (Vulnerable)` |
|
- **Size:** 50,000+ code snippets |
|
|
|
### Training Procedure |
|
- **Optimizer:** AdamW |
|
- **Loss Function:** Cross-Entropy Loss |
|
- **Batch Size:** 8 |
|
- **Learning Rate:** 2e-5 |
|
- **Epochs:** 3 |
|
- **Hardware Used:** 2x T4 GPU |
|
- **Mixed Precision:** FP16 |
|
|
|
### Training Metrics |
|
| Metric | Score | |
|
|---------|--------| |
|
| **Train Loss** | 0.4835 | |
|
| **Evaluation Loss** | 0.6855 | |
|
| **Accuracy** | 68.34% | |
|
| **F1 Score** | 62.14% | |
|
| **Precision** | 69.18% | |
|
| **Recall** | 56.40% | |
|
|
|
## Evaluation |
|
|
|
### Testing Data & Metrics |
|
The model was evaluated using **20% of the dataset**, with the following results: |
|
|
|
- **Evaluation Accuracy:** 68.34% |
|
- **F1 Score:** 62.14% |
|
- **Precision:** 69.18% |
|
- **Recall:** 56.40% |
|
- **Evaluation Runtime:** 41.16 sec |
|
- **Evaluation Speed:** 53.1 samples/sec |
|
|
|
## Environmental Impact |
|
|
|
| Factor | Value | |
|
|---------|--------| |
|
| **GPU Used** | 2x T4 GPU | |
|
| **Training Time** | ~1 hour | |
|
|
|
## Citation |
|
If you use this model in your research or applications, please cite: |
|
|
|
``` |
|
@article{unixcoder, |
|
title={UniXcoder: Unified Cross-Modal Pretraining for Code Representation}, |
|
author={Guo, Daya and Wang, Shuo and Wan, Yao and others}, |
|
year={2022}, |
|
journal={arXiv preprint arXiv:2203.03850} |
|
} |
|
``` |
|
|
|
## Model Card Authors |
|
- **Mukit Mahdin** |
|
- Contact: [[email protected]] |
|
|
|
|
|
--- |
|
|
|
Let me know if you need further modifications! ๐ |