Fix small errors in wording of the dataset
Browse files
README.md
CHANGED
@@ -5,6 +5,7 @@ base_model:
|
|
5 |
- NetherlandsForensicInstitute/ARM64BERT
|
6 |
library_name: sentence-transformers
|
7 |
---
|
|
|
8 |
Model Card
|
9 |
==========
|
10 |
|
@@ -13,13 +14,11 @@ TODO: add link to github repo
|
|
13 |
|
14 |
## General
|
15 |
### What is the purpose of the model
|
16 |
-
The model is a BERT model of ARM64 assembly code that can be used to find similar ARM64 functions to a
|
17 |
-
given ARM64 function.
|
18 |
|
19 |
### What does the model architecture look like?
|
20 |
-
The model architecture is inspired by [jTrans](https://github.com/vul337/jTrans) (Wang et al., 2022).
|
21 |
-
(Devlin et al. 2019)
|
22 |
-
although the typical Next Sentence Prediction has been replaced with Jump Target Prediction, as proposed in Wang et al.
|
23 |
This architecture has subsequently been finetuned for semantic search purposes. We have followed the procedure proposed by [S-BERT](https://www.sbert.net/examples/applications/semantic-search/README.html).
|
24 |
|
25 |
### What is the output of the model?
|
@@ -51,11 +50,11 @@ model](https://huggingface.co/NetherlandsForensicInstitute/ARM64bert) we have pu
|
|
51 |
|
52 |
## Data
|
53 |
### What data was used for training and evaluation?
|
54 |
-
The dataset is created in the same way as Wang et al.
|
55 |
-
[ArchLinux official repositories](https://
|
56 |
-
All this code is split into functions that are compiled with different
|
57 |
-
(O0
|
58 |
-
in a maximum of 10 (5
|
59 |
The dataset is split into a train and a test set. This is done on project level, so all binaries and functions belonging to one project are part of
|
60 |
either the train or the test set, not both. We have not performed any deduplication on the dataset for training.
|
61 |
|
@@ -68,9 +67,10 @@ either the train or the test set, not both. We have not performed any deduplicat
|
|
68 |
The dataset was collected by our team.
|
69 |
|
70 |
### Any remarks on data quality and bias?
|
71 |
-
After training our models, we found out that something had gone wrong when compiling our dataset.
|
72 |
-
the
|
73 |
-
|
|
|
74 |
|
75 |
## Fairness Metrics
|
76 |
|
|
|
5 |
- NetherlandsForensicInstitute/ARM64BERT
|
6 |
library_name: sentence-transformers
|
7 |
---
|
8 |
+
|
9 |
Model Card
|
10 |
==========
|
11 |
|
|
|
14 |
|
15 |
## General
|
16 |
### What is the purpose of the model
|
17 |
+
The model is a BERT model of ARM64 assembly code that can be used to find similar ARM64 functions to a given ARM64 function.
|
|
|
18 |
|
19 |
### What does the model architecture look like?
|
20 |
+
The model architecture is inspired by [jTrans](https://github.com/vul337/jTrans) (Wang et al., 2022).
|
21 |
+
It is a BERT model (Devlin et al. 2019) although the typical Next Sentence Prediction has been replaced with Jump Target Prediction, as proposed in Wang et al.
|
|
|
22 |
This architecture has subsequently been finetuned for semantic search purposes. We have followed the procedure proposed by [S-BERT](https://www.sbert.net/examples/applications/semantic-search/README.html).
|
23 |
|
24 |
### What is the output of the model?
|
|
|
50 |
|
51 |
## Data
|
52 |
### What data was used for training and evaluation?
|
53 |
+
The dataset is created in the same way as Wang et al. created Binary Corp.
|
54 |
+
A large set of binary code comes from the [ArchLinux official repositories](https://archlinux.org/packages/) and the [ArchLinux user repositories](https://aur.archlinux.org/).
|
55 |
+
All this code is split into functions that are compiled with different optimalizations
|
56 |
+
(`O0`, `O1`, `O2`, `O3` and `Os`) and security settings (fortify or no-fortify).
|
57 |
+
This results in a maximum of 10 (5×2) different functions which are semantically similar, i.e. they represent the same functionality, but have different machine code.
|
58 |
The dataset is split into a train and a test set. This is done on project level, so all binaries and functions belonging to one project are part of
|
59 |
either the train or the test set, not both. We have not performed any deduplication on the dataset for training.
|
60 |
|
|
|
67 |
The dataset was collected by our team.
|
68 |
|
69 |
### Any remarks on data quality and bias?
|
70 |
+
After training our models, we found out that something had gone wrong when compiling our dataset.
|
71 |
+
Consequently, the first line of the next function was included in the previous.
|
72 |
+
This has been fixed for the finetuning, but due to the long training process,
|
73 |
+
and the good performance of the model despite the mistake, we have decided not to retrain the base model.
|
74 |
|
75 |
## Fairness Metrics
|
76 |
|