Update README.md (#1)
Browse files- Update README.md (c1d9f6142bb432a1a14d37cfc273b9cbead57008)
README.md
CHANGED
@@ -1,3 +1,104 @@
|
|
1 |
---
|
2 |
license: eupl-1.1
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: eupl-1.1
|
3 |
+
language: code
|
4 |
+
---
|
5 |
+
|
6 |
+
Model Card - ARM64BERT
|
7 |
+
----------
|
8 |
+
|
9 |
+
|
10 |
+
_Who to contact:_ fbda [at] nfi [dot] nl \
|
11 |
+
_Version / Date:_ v1, 15/05/2025\
|
12 |
+
TODO: add link to github repo once known
|
13 |
+
|
14 |
+
## General
|
15 |
+
### What is the purpose of the model
|
16 |
+
The model is a semantic search BERT model of ARM64 assembly code that can be used to find similar ARM64 functions to a
|
17 |
+
given ARM4 function. This specific model has NOT been specifically finetuned for semantic similarity, you most likely want
|
18 |
+
to use our [other
|
19 |
+
model](https://huggingface.co/NetherlandsForensicInstitute/ARM64bert-embedding). The main purpose of this model is to be a baseline
|
20 |
+
to compare the finetuned model against.
|
21 |
+
|
22 |
+
### What does the model architecture look like?
|
23 |
+
The model architecture is inspired by [jTrans](https://github.com/vul337/jTrans) (Wang et al., 2022). It is a BERT model
|
24 |
+
(Devlin et al. 2019),
|
25 |
+
although the typical Next Sentence Prediction has been replaced with Jump Target Prediction, as proposed in Wang et al.
|
26 |
+
|
27 |
+
### What is the output of the model?
|
28 |
+
The model returns a vector of 768 dimensions for each function that it's given. These vectors can be compared to
|
29 |
+
get an indication of which functions are similar to each other.
|
30 |
+
|
31 |
+
### How does the model perform?
|
32 |
+
The model has been evaluated on [Mean Reciprocal Rank (MRR)](https://en.wikipedia.org/wiki/Mean_reciprocal_rank) and
|
33 |
+
[Recall@1](https://en.wikipedia.org/wiki/Precision_and_recall).
|
34 |
+
When the model has to pick the positive example out of a pool of 32, it almost always ranks it first. When
|
35 |
+
the pool is significantly enlarged to 10.000 functions, it still ranks the positive example highest most of the time.
|
36 |
+
|
37 |
+
|
38 |
+
| Model | Pool size | MRR | Recall@1 |
|
39 |
+
|---------|-----------|------|----------|
|
40 |
+
| ASMBert | 32 | 0.78 | 0.72 |
|
41 |
+
| ASMBert | 10.000 | 0.58 | 0.56 |
|
42 |
+
|
43 |
+
|
44 |
+
|
45 |
+
## Purpose and use of the model
|
46 |
+
|
47 |
+
### For which problem has the model been designed?
|
48 |
+
The model has been designed to act as a basemodel for the ARM64 language.
|
49 |
+
|
50 |
+
### What else could the model be used for?
|
51 |
+
The model can also be used to find similar ARM64 functions in a database of known ARM64 functions.
|
52 |
+
|
53 |
+
### To what problems is the model not applicable?
|
54 |
+
Although the model performs reasonably well on the semantic search task, this model has NOT been finetuned on that task.
|
55 |
+
For a finetuned ARM64-BERT model, please refer to the [other
|
56 |
+
model](https://huggingface.co/NetherlandsForensicInstitute/ARM64bert-embedding) we have published.
|
57 |
+
|
58 |
+
|
59 |
+
## Data
|
60 |
+
### What data was used for training and evaluation?
|
61 |
+
The dataset is created in the same way as Wang et al. create Binary Corp. A large set of binary code comes from the
|
62 |
+
[ArchLinux official repositories](https://aur.archlinux.org/) and the [ArchLinux user repositories](https://archlinux.org/packages/).
|
63 |
+
All this code is split into functions that are compiled with different optimisation
|
64 |
+
(O0, O1, O2, O3 and O3) and security settings (fortify or no-fortify). This results
|
65 |
+
in a maximum of 10 (5*2) different functions which are semantically similar i.e. they represent the same functionality but are written differently.
|
66 |
+
The dataset is split into a train and a test set. This in done on project level, so all binaries and functions belonging to one project are part of
|
67 |
+
either the train or the test set, not both. We have not performed any deduplication on the dataset for training.
|
68 |
+
|
69 |
+
|
70 |
+
| set | # functions |
|
71 |
+
|-------|------------:|
|
72 |
+
| train | 18,083,285 |
|
73 |
+
| test | 3,375,741 |
|
74 |
+
|
75 |
+
### By whom was the dataset collected and annotated?
|
76 |
+
The dataset was collected by our team. The annotation of similar/non-similar function comes from the different compilation
|
77 |
+
levels, i.e. what we consider "similar functions" is in fact the same function that has been compiled in a different way.
|
78 |
+
|
79 |
+
|
80 |
+
### Any remarks on data quality and bias?
|
81 |
+
The way we classify functions as similar may have implications. For example, sometimes, two different ways of compiling
|
82 |
+
the same function does not result in a different piece of code. We did not remove duplicates from the data during training,
|
83 |
+
but we did implement checks in the evaluation stage and it seems that the model has not suffered from the simple training
|
84 |
+
examples.
|
85 |
+
|
86 |
+
After training this base model, we found out that something had gone wrong when compiling our dataset. Consequently,
|
87 |
+
the last instruction of the previous function was included in the next. Due to the long training process, and the
|
88 |
+
good performance of the model despite the mistake, we have decided not to retrain our model.
|
89 |
+
|
90 |
+
|
91 |
+
|
92 |
+
## Fairness Metrics
|
93 |
+
|
94 |
+
### Which metrics have been used to measure bias in the data/model and why?
|
95 |
+
n.a.
|
96 |
+
|
97 |
+
### What do those metrics show?
|
98 |
+
n.a.
|
99 |
+
|
100 |
+
### Any other notable issues?
|
101 |
+
n.a.
|
102 |
+
|
103 |
+
## Analyses (optional)
|
104 |
+
n.a.
|