jorgemarcc's picture
Update README.md
3dfa2e3 verified
|
raw
history blame
1.92 kB
# Code Similarity Visualization with GraphCodeBERT
This interactive application visualizes token-level embeddings generated by [GraphCodeBERT](https://huggingface.co/microsoft/graphcodebert-base) for classical sorting algorithms. It supports pairwise comparison of algorithms based on their representation in the model’s embedding space, using PCA for dimensionality reduction.
## ✒️ Reference
Martinez-Gil, J. (2025).
**Augmenting the Interpretability of GraphCodeBERT for Code Similarity Tasks**.
*International Journal of Software Engineering and Knowledge Engineering*, 35(05), 657–678.
## 🚀 Features
- Selection of two classical sorting algorithms.
- Automatic tokenization and embedding via GraphCodeBERT.
- PCA-based projection into 2D space for visualization.
- Clean, static matplotlib plots showing token overlap and divergence.
## 🧠 Technical Overview
- **Model**: [`microsoft/graphcodebert-base`](https://huggingface.co/microsoft/graphcodebert-base)
- **Tokenizer**: RobertaTokenizer
- **Embeddings**: Last hidden layer of GraphCodeBERT
- **Reduction Technique**: Principal Component Analysis (PCA)
- **Interface**: Gradio
- **Languages**: Python 3.10+
## 🔬 Research Context
This tool supports research on code similarity, clone detection, and representation learning for source code. It offers insight into how GraphCodeBERT encodes common algorithmic patterns, providing a visual companion to embedding-based analysis.
## 🛠 Dependencies
All required libraries are listed in `requirements.txt`:
```
transformers
torch
scikit-learn
numpy
matplotlib
gradio
Pillow
```
## 🖥️ Intended Use
- Academic teaching and demonstration of code embeddings
- Qualitative evaluation of pretrained models for source code
- Supplementary visualization for software engineering publications
## 📬 Contact
**Jorge Martinez-Gil**
Senior Research Scientist in Computer Science