akaIDIOT commited on
Commit
dd4cd84
·
verified ·
1 Parent(s): 9ef540d

Fix small errors in wording of the dataset

Browse files
Files changed (1) hide show
  1. README.md +13 -13
README.md CHANGED
@@ -5,6 +5,7 @@ base_model:
5
  - NetherlandsForensicInstitute/ARM64BERT
6
  library_name: sentence-transformers
7
  ---
 
8
  Model Card
9
  ==========
10
 
@@ -13,13 +14,11 @@ TODO: add link to github repo
13
 
14
  ## General
15
  ### What is the purpose of the model
16
- The model is a BERT model of ARM64 assembly code that can be used to find similar ARM64 functions to a
17
- given ARM64 function.
18
 
19
  ### What does the model architecture look like?
20
- The model architecture is inspired by [jTrans](https://github.com/vul337/jTrans) (Wang et al., 2022). It is a BERT model
21
- (Devlin et al. 2019)
22
- although the typical Next Sentence Prediction has been replaced with Jump Target Prediction, as proposed in Wang et al.
23
  This architecture has subsequently been finetuned for semantic search purposes. We have followed the procedure proposed by [S-BERT](https://www.sbert.net/examples/applications/semantic-search/README.html).
24
 
25
  ### What is the output of the model?
@@ -51,11 +50,11 @@ model](https://huggingface.co/NetherlandsForensicInstitute/ARM64bert) we have pu
51
 
52
  ## Data
53
  ### What data was used for training and evaluation?
54
- The dataset is created in the same way as Wang et al. create Binary Corp. A large set of binary code comes from the
55
- [ArchLinux official repositories](https://aur.archlinux.org/) and the [ArchLinux user repositories](https://archlinux.org/packages/).
56
- All this code is split into functions that are compiled with different optimalization
57
- (O0, O1, O2, O3 and O3) and security settings (fortify or no-fortify). This results
58
- in a maximum of 10 (5*2) different functions which are semantically similar i.e. they represent the same functionality but are written differently.
59
  The dataset is split into a train and a test set. This is done on project level, so all binaries and functions belonging to one project are part of
60
  either the train or the test set, not both. We have not performed any deduplication on the dataset for training.
61
 
@@ -68,9 +67,10 @@ either the train or the test set, not both. We have not performed any deduplicat
68
  The dataset was collected by our team.
69
 
70
  ### Any remarks on data quality and bias?
71
- After training our models, we found out that something had gone wrong when compiling our dataset. Consequently,
72
- the last line (instruction) of the previous function was included in the next. This has been fixed for the finetuning, but due to the long training process, and the
73
- good performance of the model despite the mistake, we have decided not to retrain the base model.
 
74
 
75
  ## Fairness Metrics
76
 
 
5
  - NetherlandsForensicInstitute/ARM64BERT
6
  library_name: sentence-transformers
7
  ---
8
+
9
  Model Card
10
  ==========
11
 
 
14
 
15
  ## General
16
  ### What is the purpose of the model
17
+ The model is a BERT model of ARM64 assembly code that can be used to find similar ARM64 functions to a given ARM64 function.
 
18
 
19
  ### What does the model architecture look like?
20
+ The model architecture is inspired by [jTrans](https://github.com/vul337/jTrans) (Wang et al., 2022).
21
+ It is a BERT model (Devlin et al. 2019) although the typical Next Sentence Prediction has been replaced with Jump Target Prediction, as proposed in Wang et al.
 
22
  This architecture has subsequently been finetuned for semantic search purposes. We have followed the procedure proposed by [S-BERT](https://www.sbert.net/examples/applications/semantic-search/README.html).
23
 
24
  ### What is the output of the model?
 
50
 
51
  ## Data
52
  ### What data was used for training and evaluation?
53
+ The dataset is created in the same way as Wang et al. created Binary Corp.
54
+ A large set of binary code comes from the [ArchLinux official repositories](https://archlinux.org/packages/) and the [ArchLinux user repositories](https://aur.archlinux.org/).
55
+ All this code is split into functions that are compiled with different optimalizations
56
+ (`O0`, `O1`, `O2`, `O3` and `Os`) and security settings (fortify or no-fortify).
57
+ This results in a maximum of 10 (5×2) different functions which are semantically similar, i.e. they represent the same functionality, but have different machine code.
58
  The dataset is split into a train and a test set. This is done on project level, so all binaries and functions belonging to one project are part of
59
  either the train or the test set, not both. We have not performed any deduplication on the dataset for training.
60
 
 
67
  The dataset was collected by our team.
68
 
69
  ### Any remarks on data quality and bias?
70
+ After training our models, we found out that something had gone wrong when compiling our dataset.
71
+ Consequently, the first line of the next function was included in the previous.
72
+ This has been fixed for the finetuning, but due to the long training process,
73
+ and the good performance of the model despite the mistake, we have decided not to retrain the base model.
74
 
75
  ## Fairness Metrics
76