jorgemarcc's picture
Update README.md
3dfa2e3 verified
|
raw
history blame
1.92 kB

Code Similarity Visualization with GraphCodeBERT

This interactive application visualizes token-level embeddings generated by GraphCodeBERT for classical sorting algorithms. It supports pairwise comparison of algorithms based on their representation in the model’s embedding space, using PCA for dimensionality reduction.

βœ’οΈ Reference

Martinez-Gil, J. (2025).
Augmenting the Interpretability of GraphCodeBERT for Code Similarity Tasks.
International Journal of Software Engineering and Knowledge Engineering, 35(05), 657–678.

πŸš€ Features

  • Selection of two classical sorting algorithms.
  • Automatic tokenization and embedding via GraphCodeBERT.
  • PCA-based projection into 2D space for visualization.
  • Clean, static matplotlib plots showing token overlap and divergence.

🧠 Technical Overview

  • Model: microsoft/graphcodebert-base
  • Tokenizer: RobertaTokenizer
  • Embeddings: Last hidden layer of GraphCodeBERT
  • Reduction Technique: Principal Component Analysis (PCA)
  • Interface: Gradio
  • Languages: Python 3.10+

πŸ”¬ Research Context

This tool supports research on code similarity, clone detection, and representation learning for source code. It offers insight into how GraphCodeBERT encodes common algorithmic patterns, providing a visual companion to embedding-based analysis.

πŸ›  Dependencies

All required libraries are listed in requirements.txt:


transformers
torch
scikit-learn
numpy
matplotlib
gradio
Pillow

πŸ–₯️ Intended Use

  • Academic teaching and demonstration of code embeddings
  • Qualitative evaluation of pretrained models for source code
  • Supplementary visualization for software engineering publications

πŸ“¬ Contact

Jorge Martinez-Gil
Senior Research Scientist in Computer Science