|
# Code Similarity Visualization with GraphCodeBERT |
|
|
|
This interactive application visualizes token-level embeddings generated by [GraphCodeBERT](https://huggingface.co/microsoft/graphcodebert-base) for classical sorting algorithms. It supports pairwise comparison of algorithms based on their representation in the model’s embedding space, using PCA for dimensionality reduction. |
|
|
|
## ✒️ Reference |
|
|
|
Martinez-Gil, J. (2025). |
|
**Augmenting the Interpretability of GraphCodeBERT for Code Similarity Tasks**. |
|
*International Journal of Software Engineering and Knowledge Engineering*, 35(05), 657–678. |
|
|
|
## 🚀 Features |
|
|
|
- Selection of two classical sorting algorithms. |
|
- Automatic tokenization and embedding via GraphCodeBERT. |
|
- PCA-based projection into 2D space for visualization. |
|
- Clean, static matplotlib plots showing token overlap and divergence. |
|
|
|
## 🧠 Technical Overview |
|
|
|
- **Model**: [`microsoft/graphcodebert-base`](https://huggingface.co/microsoft/graphcodebert-base) |
|
- **Tokenizer**: RobertaTokenizer |
|
- **Embeddings**: Last hidden layer of GraphCodeBERT |
|
- **Reduction Technique**: Principal Component Analysis (PCA) |
|
- **Interface**: Gradio |
|
- **Languages**: Python 3.10+ |
|
|
|
## 🔬 Research Context |
|
|
|
This tool supports research on code similarity, clone detection, and representation learning for source code. It offers insight into how GraphCodeBERT encodes common algorithmic patterns, providing a visual companion to embedding-based analysis. |
|
|
|
## 🛠 Dependencies |
|
|
|
All required libraries are listed in `requirements.txt`: |
|
|
|
``` |
|
|
|
transformers |
|
torch |
|
scikit-learn |
|
numpy |
|
matplotlib |
|
gradio |
|
Pillow |
|
|
|
``` |
|
|
|
## 🖥️ Intended Use |
|
|
|
- Academic teaching and demonstration of code embeddings |
|
- Qualitative evaluation of pretrained models for source code |
|
- Supplementary visualization for software engineering publications |
|
|
|
## 📬 Contact |
|
|
|
**Jorge Martinez-Gil** |
|
Senior Research Scientist in Computer Science |