jorgemarcc commited on
Commit
64478e1
·
verified ·
1 Parent(s): cadb900

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -0
README.md ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Code Similarity Visualization with GraphCodeBERT
3
+ emoji: 🧠
4
+ colorFrom: gray
5
+ colorTo: blue
6
+ sdk: gradio
7
+ sdk_version: "4.30.0"
8
+ app_file: app.py
9
+ pinned: false
10
+ ---
11
+
12
+ # Code Similarity Visualization with GraphCodeBERT
13
+
14
+ This interactive application visualizes token-level embeddings generated by [GraphCodeBERT](https://huggingface.co/microsoft/graphcodebert-base) for classical sorting algorithms. It supports pairwise comparison of algorithms based on their representation in the model’s embedding space, using PCA for dimensionality reduction.
15
+
16
+ ## ✒️ Reference
17
+
18
+ Martinez-Gil, J. (2025).
19
+ **Augmenting the Interpretability of GraphCodeBERT for Code Similarity Tasks**.
20
+ *International Journal of Software Engineering and Knowledge Engineering*, 35(05), 657–678.
21
+
22
+ ## 🚀 Features
23
+
24
+ - Select two classical sorting algorithms.
25
+ - Automatic tokenization and embedding via GraphCodeBERT.
26
+ - PCA-based projection into 2D space for visualization.
27
+ - Clear matplotlib plots showing token-level distribution differences.
28
+
29
+ ## 🧠 Technical Overview
30
+
31
+ - **Model**: [`microsoft/graphcodebert-base`](https://huggingface.co/microsoft/graphcodebert-base)
32
+ - **Embedding Layer**: Last hidden state
33
+ - **Reduction**: Principal Component Analysis (PCA)
34
+ - **Interface**: Gradio
35
+ - **Languages**: Python 3.10+
36
+
37
+ ## 🛠 Dependencies
38
+
39
+ All required libraries are listed in `requirements.txt`:
40
+
41
+ ```
42
+
43
+ transformers
44
+ torch
45
+ scikit-learn
46
+ numpy
47
+ matplotlib
48
+ gradio
49
+ Pillow
50
+
51
+ ```
52
+
53
+ ## 🖥️ Intended Use
54
+
55
+ - Academic teaching and demonstration of code embeddings
56
+ - Qualitative evaluation of pretrained models for source code
57
+ - Supplementary visualization for software engineering publications
58
+
59
+ ## 📬 Contact
60
+
61
+ **Jorge Martinez-Gil**
62
+ Senior Research Scientist in Computer Science