NetherlandsForensicInstitute
/

ARM64BERT

 ---
 license: eupl-1.1
+language: code
+---
+Model Card - ARM64BERT
+----------
+_Who to contact:_ fbda [at] nfi [dot] nl \
+_Version / Date:_ v1, 15/05/2025\
+TODO: add link to github repo once known
+## General
+### What is the purpose of the model
+The model is a semantic search BERT model of ARM64 assembly code that can be used to find similar ARM64 functions to a
+given ARM4 function. This specific model has NOT been specifically finetuned for semantic similarity, you most likely want
+to use our [other
+model](https://huggingface.co/NetherlandsForensicInstitute/ARM64bert-embedding). The main purpose of this model is to be a baseline
+to compare the finetuned model against.
+### What does the model architecture look like?
+The model architecture is inspired by [jTrans](https://github.com/vul337/jTrans) (Wang et al., 2022). It is a BERT model
+(Devlin et al. 2019),
+although the typical Next Sentence Prediction has been replaced with Jump Target Prediction, as proposed in Wang et al.
+### What is the output of the model?
+The model returns a vector of 768 dimensions for each function that it's given. These vectors can be compared to
+get an indication of which functions are similar to each other.
+### How does the model perform?
+The model has been evaluated on [Mean Reciprocal Rank (MRR)](https://en.wikipedia.org/wiki/Mean_reciprocal_rank) and
+[Recall@1](https://en.wikipedia.org/wiki/Precision_and_recall).
+When the model has to pick the positive example out of a pool of 32, it almost always ranks it first. When
+the pool is significantly enlarged to 10.000 functions, it still ranks the positive example highest most of the time.
+| Model   | Pool size | MRR  | Recall@1 |
+|---------|-----------|------|----------|
+| ASMBert | 32        | 0.78 | 0.72     |
+| ASMBert | 10.000    | 0.58 | 0.56     |
+## Purpose and use of the model
+### For which problem has the model been designed?
+The model has been designed to act as a basemodel for the ARM64 language.
+### What else could the model be used for?
+The model can also be used to find similar ARM64 functions in a database of known ARM64 functions.
+### To what problems is the model not applicable?
+Although the model performs reasonably well on the semantic search task, this model has NOT been finetuned on that task.
+For a finetuned ARM64-BERT model, please refer to the [other
+model](https://huggingface.co/NetherlandsForensicInstitute/ARM64bert-embedding) we have published.
+## Data
+### What data was used for training and evaluation?
+The dataset is created in the same way as Wang et al. create Binary Corp. A large set of binary code comes from the
+[ArchLinux official repositories](https://aur.archlinux.org/) and the [ArchLinux user repositories](https://archlinux.org/packages/).
+All this code is split into functions that are compiled with different optimisation
+(O0, O1, O2, O3 and O3) and security settings (fortify or no-fortify). This results
+in a maximum of 10 (5*2) different functions which are semantically similar i.e. they represent the same functionality but are written differently.
+The dataset is split into a train and a test set. This in done on project level, so all binaries and functions belonging to one project are part of
+either the train or the test set, not both. We have not performed any deduplication on the dataset for training.
+| set   | # functions |
+|-------|------------:|
+| train |  18,083,285 |
+| test  |   3,375,741 |
+### By whom was the dataset collected and annotated?
+The dataset was collected by our team. The annotation of similar/non-similar function comes from the different compilation
+levels, i.e. what we consider "similar functions" is in fact the same function that has been compiled in a different way.
+### Any remarks on data quality and bias?
+The way we classify functions as similar may have implications. For example, sometimes, two different ways of compiling
+the same function does not result in a different piece of code. We did not remove duplicates from the data during training,
+but we did implement checks in the evaluation stage and it seems that the model has not suffered from the simple training
+examples.
+After training this base model, we found out that something had gone wrong when compiling our dataset. Consequently,
+the last instruction of the previous function was included in the next. Due to the long training process, and the
+good performance of the model despite the mistake, we have decided not to retrain our model.
+## Fairness Metrics
+### Which metrics have been used to measure bias in the data/model and why?
+n.a.
+### What do those metrics show?
+n.a.
+### Any other notable issues?
+n.a.
+## Analyses (optional)
+n.a.