jorgemarcc commited on
Commit
3dfa2e3
·
verified ·
1 Parent(s): 311be36

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +56 -0
README.md CHANGED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Code Similarity Visualization with GraphCodeBERT
2
+
3
+ This interactive application visualizes token-level embeddings generated by [GraphCodeBERT](https://huggingface.co/microsoft/graphcodebert-base) for classical sorting algorithms. It supports pairwise comparison of algorithms based on their representation in the model’s embedding space, using PCA for dimensionality reduction.
4
+
5
+ ## ✒️ Reference
6
+
7
+ Martinez-Gil, J. (2025).
8
+ **Augmenting the Interpretability of GraphCodeBERT for Code Similarity Tasks**.
9
+ *International Journal of Software Engineering and Knowledge Engineering*, 35(05), 657–678.
10
+
11
+ ## 🚀 Features
12
+
13
+ - Selection of two classical sorting algorithms.
14
+ - Automatic tokenization and embedding via GraphCodeBERT.
15
+ - PCA-based projection into 2D space for visualization.
16
+ - Clean, static matplotlib plots showing token overlap and divergence.
17
+
18
+ ## 🧠 Technical Overview
19
+
20
+ - **Model**: [`microsoft/graphcodebert-base`](https://huggingface.co/microsoft/graphcodebert-base)
21
+ - **Tokenizer**: RobertaTokenizer
22
+ - **Embeddings**: Last hidden layer of GraphCodeBERT
23
+ - **Reduction Technique**: Principal Component Analysis (PCA)
24
+ - **Interface**: Gradio
25
+ - **Languages**: Python 3.10+
26
+
27
+ ## 🔬 Research Context
28
+
29
+ This tool supports research on code similarity, clone detection, and representation learning for source code. It offers insight into how GraphCodeBERT encodes common algorithmic patterns, providing a visual companion to embedding-based analysis.
30
+
31
+ ## 🛠 Dependencies
32
+
33
+ All required libraries are listed in `requirements.txt`:
34
+
35
+ ```
36
+
37
+ transformers
38
+ torch
39
+ scikit-learn
40
+ numpy
41
+ matplotlib
42
+ gradio
43
+ Pillow
44
+
45
+ ```
46
+
47
+ ## 🖥️ Intended Use
48
+
49
+ - Academic teaching and demonstration of code embeddings
50
+ - Qualitative evaluation of pretrained models for source code
51
+ - Supplementary visualization for software engineering publications
52
+
53
+ ## 📬 Contact
54
+
55
+ **Jorge Martinez-Gil**
56
+ Senior Research Scientist in Computer Science