Judithvdw commited on
Commit
5058b92
·
verified ·
1 Parent(s): d0dc5ec

Update README.md (#1)

Browse files

- Update README.md (c1d9f6142bb432a1a14d37cfc273b9cbead57008)

Files changed (1) hide show
  1. README.md +102 -1
README.md CHANGED
@@ -1,3 +1,104 @@
1
  ---
2
  license: eupl-1.1
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: eupl-1.1
3
+ language: code
4
+ ---
5
+
6
+ Model Card - ARM64BERT
7
+ ----------
8
+
9
+
10
+ _Who to contact:_ fbda [at] nfi [dot] nl \
11
+ _Version / Date:_ v1, 15/05/2025\
12
+ TODO: add link to github repo once known
13
+
14
+ ## General
15
+ ### What is the purpose of the model
16
+ The model is a semantic search BERT model of ARM64 assembly code that can be used to find similar ARM64 functions to a
17
+ given ARM4 function. This specific model has NOT been specifically finetuned for semantic similarity, you most likely want
18
+ to use our [other
19
+ model](https://huggingface.co/NetherlandsForensicInstitute/ARM64bert-embedding). The main purpose of this model is to be a baseline
20
+ to compare the finetuned model against.
21
+
22
+ ### What does the model architecture look like?
23
+ The model architecture is inspired by [jTrans](https://github.com/vul337/jTrans) (Wang et al., 2022). It is a BERT model
24
+ (Devlin et al. 2019),
25
+ although the typical Next Sentence Prediction has been replaced with Jump Target Prediction, as proposed in Wang et al.
26
+
27
+ ### What is the output of the model?
28
+ The model returns a vector of 768 dimensions for each function that it's given. These vectors can be compared to
29
+ get an indication of which functions are similar to each other.
30
+
31
+ ### How does the model perform?
32
+ The model has been evaluated on [Mean Reciprocal Rank (MRR)](https://en.wikipedia.org/wiki/Mean_reciprocal_rank) and
33
+ [Recall@1](https://en.wikipedia.org/wiki/Precision_and_recall).
34
+ When the model has to pick the positive example out of a pool of 32, it almost always ranks it first. When
35
+ the pool is significantly enlarged to 10.000 functions, it still ranks the positive example highest most of the time.
36
+
37
+
38
+ | Model | Pool size | MRR | Recall@1 |
39
+ |---------|-----------|------|----------|
40
+ | ASMBert | 32 | 0.78 | 0.72 |
41
+ | ASMBert | 10.000 | 0.58 | 0.56 |
42
+
43
+
44
+
45
+ ## Purpose and use of the model
46
+
47
+ ### For which problem has the model been designed?
48
+ The model has been designed to act as a basemodel for the ARM64 language.
49
+
50
+ ### What else could the model be used for?
51
+ The model can also be used to find similar ARM64 functions in a database of known ARM64 functions.
52
+
53
+ ### To what problems is the model not applicable?
54
+ Although the model performs reasonably well on the semantic search task, this model has NOT been finetuned on that task.
55
+ For a finetuned ARM64-BERT model, please refer to the [other
56
+ model](https://huggingface.co/NetherlandsForensicInstitute/ARM64bert-embedding) we have published.
57
+
58
+
59
+ ## Data
60
+ ### What data was used for training and evaluation?
61
+ The dataset is created in the same way as Wang et al. create Binary Corp. A large set of binary code comes from the
62
+ [ArchLinux official repositories](https://aur.archlinux.org/) and the [ArchLinux user repositories](https://archlinux.org/packages/).
63
+ All this code is split into functions that are compiled with different optimisation
64
+ (O0, O1, O2, O3 and O3) and security settings (fortify or no-fortify). This results
65
+ in a maximum of 10 (5*2) different functions which are semantically similar i.e. they represent the same functionality but are written differently.
66
+ The dataset is split into a train and a test set. This in done on project level, so all binaries and functions belonging to one project are part of
67
+ either the train or the test set, not both. We have not performed any deduplication on the dataset for training.
68
+
69
+
70
+ | set | # functions |
71
+ |-------|------------:|
72
+ | train | 18,083,285 |
73
+ | test | 3,375,741 |
74
+
75
+ ### By whom was the dataset collected and annotated?
76
+ The dataset was collected by our team. The annotation of similar/non-similar function comes from the different compilation
77
+ levels, i.e. what we consider "similar functions" is in fact the same function that has been compiled in a different way.
78
+
79
+
80
+ ### Any remarks on data quality and bias?
81
+ The way we classify functions as similar may have implications. For example, sometimes, two different ways of compiling
82
+ the same function does not result in a different piece of code. We did not remove duplicates from the data during training,
83
+ but we did implement checks in the evaluation stage and it seems that the model has not suffered from the simple training
84
+ examples.
85
+
86
+ After training this base model, we found out that something had gone wrong when compiling our dataset. Consequently,
87
+ the last instruction of the previous function was included in the next. Due to the long training process, and the
88
+ good performance of the model despite the mistake, we have decided not to retrain our model.
89
+
90
+
91
+
92
+ ## Fairness Metrics
93
+
94
+ ### Which metrics have been used to measure bias in the data/model and why?
95
+ n.a.
96
+
97
+ ### What do those metrics show?
98
+ n.a.
99
+
100
+ ### Any other notable issues?
101
+ n.a.
102
+
103
+ ## Analyses (optional)
104
+ n.a.