Code Similarity Visualization with GraphCodeBERT
This interactive application visualizes token-level embeddings generated by GraphCodeBERT for classical sorting algorithms. It supports pairwise comparison of algorithms based on their representation in the modelβs embedding space, using PCA for dimensionality reduction.
βοΈ Reference
Martinez-Gil, J. (2025).
Augmenting the Interpretability of GraphCodeBERT for Code Similarity Tasks.
International Journal of Software Engineering and Knowledge Engineering, 35(05), 657β678.
π Features
- Selection of two classical sorting algorithms.
- Automatic tokenization and embedding via GraphCodeBERT.
- PCA-based projection into 2D space for visualization.
- Clean, static matplotlib plots showing token overlap and divergence.
π§ Technical Overview
- Model:
microsoft/graphcodebert-base
- Tokenizer: RobertaTokenizer
- Embeddings: Last hidden layer of GraphCodeBERT
- Reduction Technique: Principal Component Analysis (PCA)
- Interface: Gradio
- Languages: Python 3.10+
π¬ Research Context
This tool supports research on code similarity, clone detection, and representation learning for source code. It offers insight into how GraphCodeBERT encodes common algorithmic patterns, providing a visual companion to embedding-based analysis.
π Dependencies
All required libraries are listed in requirements.txt
:
transformers
torch
scikit-learn
numpy
matplotlib
gradio
Pillow
π₯οΈ Intended Use
- Academic teaching and demonstration of code embeddings
- Qualitative evaluation of pretrained models for source code
- Supplementary visualization for software engineering publications
π¬ Contact
Jorge Martinez-Gil
Senior Research Scientist in Computer Science