NetherlandsForensicInstitute
/

ARM64BERT

@@ -6,9 +6,7 @@ language: code
 Model Card - ARM64BERT
 ----------
-_Who to contact:_ fbda [at] nfi [dot] nl \
-TODO: add link to github repo once known
 ## General
 ### What is the purpose of the model
@@ -33,14 +31,11 @@ The model was then evaluated on [Mean Reciprocal Rank (MRR)](https://en.wikipedi
 When the model has to pick the positive example out of a pool of 32, it almost always ranks it first. When
 the pool is significantly enlarged to 10.000 functions, it still ranks the positive example highest most of the time.
 | Model   | Pool size | MRR  | Recall@1 |
 |---------|-----------|------|----------|
 | ASMBert | 32        | 0.78 | 0.72     |
 | ASMBert | 10.000    | 0.58 | 0.56     |
 ## Purpose and use of the model
 ### For which problem has the model been designed?
@@ -51,20 +46,17 @@ The model can also be used to find similar ARM64 functions in a database of know
 ### To what problems is the model not applicable?
 Although the model performs reasonably well on the semantic search task, this model has NOT been finetuned on that task.
-For a finetuned ARM64-BERT model, please refer to the [other
-model](https://huggingface.co/NetherlandsForensicInstitute/ARM64bert-embedding) we have published.
 ## Data
 ### What data was used for training and evaluation?
-The dataset is created in the same way as Wang et al. create Binary Corp. A large set of binary code comes from the
-[ArchLinux official repositories](https://aur.archlinux.org/) and the [ArchLinux user repositories](https://archlinux.org/packages/).
-All this code is split into functions that are compiled with different optimisation
-(O0, O1, O2, O3 and O3) and security settings (fortify or no-fortify). This results
-in a maximum of 10 (5*2) different functions which are semantically similar i.e. they represent the same functionality but are written differently.
-The dataset is split into a train and a test set. This in done on project level, so all binaries and functions belonging to one project are part of
-either the train or the test set, not both. We have not performed any deduplication on the dataset for training.
 | set   | # functions |
 |-------|------------:|
@@ -85,19 +77,3 @@ examples.
 After training this base model, we found out that something had gone wrong when compiling our dataset. Consequently,
 the last instruction of the previous function was included in the next. Due to the long training process, and the
 good performance of the model despite the mistake, we have decided not to retrain our model.
-## Fairness Metrics
-### Which metrics have been used to measure bias in the data/model and why?
-n.a.
-### What do those metrics show?
-n.a.
-### Any other notable issues?
-n.a.
-## Analyses (optional)
-n.a.

 Model Card - ARM64BERT
 ----------
+[GitHub repository](https://github.com/NetherlandsForensicInstitute/asmtransformers)
 ## General
 ### What is the purpose of the model
 When the model has to pick the positive example out of a pool of 32, it almost always ranks it first. When
 the pool is significantly enlarged to 10.000 functions, it still ranks the positive example highest most of the time.
 | Model   | Pool size | MRR  | Recall@1 |
 |---------|-----------|------|----------|
 | ASMBert | 32        | 0.78 | 0.72     |
 | ASMBert | 10.000    | 0.58 | 0.56     |
 ## Purpose and use of the model
 ### For which problem has the model been designed?
 ### To what problems is the model not applicable?
 Although the model performs reasonably well on the semantic search task, this model has NOT been finetuned on that task.
+For a finetuned ARM64BERT model, please refer to the [other model](https://huggingface.co/NetherlandsForensicInstitute/ARM64bert-embedding) published alongside this one.
 ## Data
 ### What data was used for training and evaluation?
+The dataset is created in the same way as Wang et al. created Binary Corp.
+A large set of binary code comes from the [ArchLinux official repositories](https://archlinux.org/packages/) and the [ArchLinux user repositories](https://aur.archlinux.org/packages/).
+All this code is split into functions that are compiled with different optimalizations
+(`O0`, `O1`, `O2`, `O3` and `Os`) and security settings (fortify or no-fortify).
+This results in a maximum of 10 (5×2) different functions which are semantically similar, i.e. they represent the same functionality, but have different machine code.
+The dataset is split into a train and a test set. This is done on project level, so all binaries and functions belonging to one project are part of
+either the train or the test set, not both. We have not performed any deduplication on the dataset for training.
 | set   | # functions |
 |-------|------------:|
 After training this base model, we found out that something had gone wrong when compiling our dataset. Consequently,
 the last instruction of the previous function was included in the next. Due to the long training process, and the
 good performance of the model despite the mistake, we have decided not to retrain our model.