Update README.md
Browse files
README.md
CHANGED
@@ -14,14 +14,15 @@ ARM64BERT-embedding 🦾
|
|
14 |
## General
|
15 |
### What is the purpose of the model
|
16 |
The model is a BERT model of ARM64 assembly code that can be used to find similar ARM64 functions to a given ARM64 function.
|
|
|
17 |
|
18 |
### What does the model architecture look like?
|
19 |
The model architecture is inspired by [jTrans](https://github.com/vul337/jTrans) (Wang et al., 2022).
|
20 |
It is a BERT model (Devlin et al. 2019) although the typical Next Sentence Prediction has been replaced with Jump Target Prediction, as proposed in Wang et al.
|
21 |
This architecture has subsequently been finetuned for semantic search purposes. We have followed the procedure proposed by [S-BERT](https://www.sbert.net/examples/applications/semantic-search/README.html).
|
22 |
|
23 |
-
### What is the output of the model?
|
24 |
-
The model returns
|
25 |
get an indication of which functions are similar to each other.
|
26 |
|
27 |
### How does the model perform?
|
@@ -44,14 +45,15 @@ The model has been designed to find similar ARM64 functions in a database of kno
|
|
44 |
We do not see other applications for this model.
|
45 |
|
46 |
### To what problems is the model not applicable?
|
47 |
-
This model has been finetuned on the semantic search task
|
48 |
-
model
|
|
|
49 |
|
50 |
## Data
|
51 |
### What data was used for training and evaluation?
|
52 |
The dataset is created in the same way as Wang et al. created Binary Corp.
|
53 |
-
A large set of
|
54 |
-
All this code is split into functions that are compiled with different optimalizations
|
55 |
(`O0`, `O1`, `O2`, `O3` and `Os`) and security settings (fortify or no-fortify).
|
56 |
This results in a maximum of 10 (5×2) different functions which are semantically similar, i.e. they represent the same functionality, but have different machine code.
|
57 |
The dataset is split into a train and a test set. This is done on project level, so all binaries and functions belonging to one project are part of
|
@@ -62,6 +64,9 @@ either the train or the test set, not both. We have not performed any deduplicat
|
|
62 |
| train | 18,083,285 |
|
63 |
| test | 3,375,741 |
|
64 |
|
|
|
|
|
|
|
65 |
### By whom was the dataset collected and annotated?
|
66 |
The dataset was collected by our team.
|
67 |
|
|
|
14 |
## General
|
15 |
### What is the purpose of the model
|
16 |
The model is a BERT model of ARM64 assembly code that can be used to find similar ARM64 functions to a given ARM64 function.
|
17 |
+
This task is known as _binary code similarity detection_, which is similar to the _sentence similarity_ task in natural language processing.
|
18 |
|
19 |
### What does the model architecture look like?
|
20 |
The model architecture is inspired by [jTrans](https://github.com/vul337/jTrans) (Wang et al., 2022).
|
21 |
It is a BERT model (Devlin et al. 2019) although the typical Next Sentence Prediction has been replaced with Jump Target Prediction, as proposed in Wang et al.
|
22 |
This architecture has subsequently been finetuned for semantic search purposes. We have followed the procedure proposed by [S-BERT](https://www.sbert.net/examples/applications/semantic-search/README.html).
|
23 |
|
24 |
+
### What is the output of the model?
|
25 |
+
The model returns an embddding vector of 768 dimensions for each function that it's given. These embeddings can be compared to
|
26 |
get an indication of which functions are similar to each other.
|
27 |
|
28 |
### How does the model perform?
|
|
|
45 |
We do not see other applications for this model.
|
46 |
|
47 |
### To what problems is the model not applicable?
|
48 |
+
This model has been finetuned on the semantic search task.
|
49 |
+
For the base ARM64BERT model, please refer to the [other
|
50 |
+
model](https://huggingface.co/NetherlandsForensicInstitute/ARM64BERT) we have published.
|
51 |
|
52 |
## Data
|
53 |
### What data was used for training and evaluation?
|
54 |
The dataset is created in the same way as Wang et al. created Binary Corp.
|
55 |
+
A large set of source code comes from the [ArchLinux official repositories](https://archlinux.org/packages/) and the [ArchLinux user repositories](https://aur.archlinux.org/packages/).
|
56 |
+
All this code is split into functions that are compiled into binary code with different optimalizations
|
57 |
(`O0`, `O1`, `O2`, `O3` and `Os`) and security settings (fortify or no-fortify).
|
58 |
This results in a maximum of 10 (5×2) different functions which are semantically similar, i.e. they represent the same functionality, but have different machine code.
|
59 |
The dataset is split into a train and a test set. This is done on project level, so all binaries and functions belonging to one project are part of
|
|
|
64 |
| train | 18,083,285 |
|
65 |
| test | 3,375,741 |
|
66 |
|
67 |
+
For our training and evaluation code, see our [GitHub repository](https://github.com/NetherlandsForensicInstitute/asmtransformers).
|
68 |
+
|
69 |
+
|
70 |
### By whom was the dataset collected and annotated?
|
71 |
The dataset was collected by our team.
|
72 |
|