mahdin70's picture
Update README.md
03ce0dc verified
|
raw
history blame
4.97 kB
---
library_name: transformers
tags:
- Vulnerability
- C/C++
- Detection
datasets:
- DetectVul/devign
language:
- en
base_model:
- microsoft/unixcoder-base
---
# Model Card: UniXcoder for Code Vulnerability Detection
## Model Summary
This model is a fine-tuned version of **Microsoft's UniXcoder**, optimized for detecting vulnerabilities in C/C++ code. It is trained on the **DetectVul/devign** dataset and achieves **68.34% accuracy** with an **F1 score of 62.14%**. The model takes in a code snippet and classifies it as either **safe (0)** or **vulnerable (1)**.
## Model Details
- **Developed by:** [mahdin70(Mukit Mahdin)]
- **Finetuned from:** `microsoft/unixcoder-base`
- **Language(s):** English (for code comments & metadata), C/C++
- **License:** MIT
- **Task:** Code vulnerability detection
- **Dataset Used:** `DetectVul/devign`
- **Architecture:** Transformer-based sequence classification
## Model Sources
- **Repository:** [Add Hugging Face Model Link Here]
- **Paper (UniXcoder):** [https://arxiv.org/abs/2203.03850](https://arxiv.org/abs/2203.03850)
- **Demo (Optional):** [Add Gradio/Streamlit Link Here]
## Uses
### Direct Use
This model can be used for **static code analysis**, security audits, and automatic vulnerability detection in software repositories. It is useful for:
- **Developers**: To analyze their code for potential security flaws.
- **Security Teams**: To scan repositories for known vulnerabilities.
- **Researchers**: To study vulnerability detection in AI-powered systems.
### Downstream Use
This model can be integrated into **IDE plugins**, **CI/CD pipelines**, or **security scanners** to provide real-time vulnerability detection.
### Out-of-Scope Use
- The model is **not meant to replace human security experts**.
- It may not generalize well to **languages other than C/C++**.
- False positives/negatives may occur due to dataset limitations.
## Bias, Risks, and Limitations
- **False Positives & False Negatives:** The model may flag safe code as vulnerable or miss actual vulnerabilities.
- **Limited to C/C++:** The model was trained on a dataset primarily composed of **C and C++ code**. It may not perform well on other languages.
- **Dataset Bias:** The training data may not cover all possible vulnerabilities.
### Recommendations
Users should **not rely solely on the model** for security assessments. Instead, it should be used alongside **manual code review and static analysis tools**.
## How to Get Started with the Model
Use the code below to load the model and run inference on a sample code snippet:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load the fine-tuned model
tokenizer = AutoTokenizer.from_pretrained("your_username/unixcoder-code-vulnerability-detector")
model = AutoModelForSequenceClassification.from_pretrained("your_username/unixcoder-code-vulnerability-detector")
# Sample code snippet
code_snippet = """
void process(char *input) {
char buffer[50];
strcpy(buffer, input); // Potential buffer overflow
}
"""
# Tokenize the input
inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding="max_length", max_length=512)
# Run inference
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_label = torch.argmax(predictions, dim=1).item()
# Output the result
print("โš ๏ธ Vulnerable Code" if predicted_label == 1 else "โœ… Safe Code")
```
## Training Details
### Training Data
- **Dataset:** `DetectVul/devign`
- **Classes:** `0 (Safe)`, `1 (Vulnerable)`
- **Size:** 50,000+ code snippets
### Training Procedure
- **Optimizer:** AdamW
- **Loss Function:** Cross-Entropy Loss
- **Batch Size:** 8
- **Learning Rate:** 2e-5
- **Epochs:** 3
- **Hardware Used:** 2x T4 GPU
- **Mixed Precision:** FP16
### Training Metrics
| Metric | Score |
|---------|--------|
| **Train Loss** | 0.4835 |
| **Evaluation Loss** | 0.6855 |
| **Accuracy** | 68.34% |
| **F1 Score** | 62.14% |
| **Precision** | 69.18% |
| **Recall** | 56.40% |
## Evaluation
### Testing Data & Metrics
The model was evaluated using **20% of the dataset**, with the following results:
- **Evaluation Accuracy:** 68.34%
- **F1 Score:** 62.14%
- **Precision:** 69.18%
- **Recall:** 56.40%
- **Evaluation Runtime:** 41.16 sec
- **Evaluation Speed:** 53.1 samples/sec
## Environmental Impact
| Factor | Value |
|---------|--------|
| **GPU Used** | 2x T4 GPU |
| **Training Time** | ~1 hour |
## Citation
If you use this model in your research or applications, please cite:
```
@article{unixcoder,
title={UniXcoder: Unified Cross-Modal Pretraining for Code Representation},
author={Guo, Daya and Wang, Shuo and Wan, Yao and others},
year={2022},
journal={arXiv preprint arXiv:2203.03850}
}
```
## Model Card Authors
- **Mukit Mahdin**
- Contact: [[email protected]]
---
Let me know if you need further modifications! ๐Ÿš€