|
Model Card |
|
---------- |
|
|
|
_Who to contact:_ fbda [at] nfi [dot] nl \ |
|
_Version / Date:_ v1, 15/05/2025\ |
|
TODO: add link to github repo |
|
|
|
## General |
|
### What is the purpose of the model |
|
The model is a BERT model of ARM64 assembly code that can be used to find similar ARM64 functions to a |
|
given ARM64 function. |
|
|
|
### What does the model architecture look like? |
|
The model architecture is inspired by [jTrans](https://github.com/vul337/jTrans) (Wang et al., 2022). It is a BERT model |
|
(Devlin et al. 2019) |
|
although the typical Next Sentence Prediction has been replaced with Jump Target Prediction, as proposed in Wang et al. |
|
This architecture has subsequently been finetuned for semantic search purposes. We have followed the procedure proposed by [S-BERT](https://www.sbert.net/examples/applications/semantic-search/README.html). |
|
|
|
|
|
### What is the output of the model? |
|
The model returns a vector of 768 dimensions for each function that it's given. These vectors can be compared to |
|
get an indication of which functions are similar to each other. |
|
|
|
### How does the model perform? |
|
The model has been evaluated on [Mean Reciprocal Rank (MRR)](https://en.wikipedia.org/wiki/Mean_reciprocal_rank) and |
|
[Recall@1](https://en.wikipedia.org/wiki/Precision_and_recall). |
|
When the model has to pick the positive example out of a pool of 32, ranks the positive example highest most of the time. |
|
When the pool is significantly enlarged to 10.000 functions, it still ranks the positive example first or second in most cases. |
|
|
|
|
|
| Model | Pool size | MRR | Recall@1 | |
|
|---------|-----------|------|----------| |
|
| ASMBert | 32 | 0.99 | 0.99 | |
|
| ASMBert | 10.000 | 0.87 | 0.83 | |
|
|
|
|
|
## Purpose and use of the model |
|
|
|
### For which problem has the model been designed? |
|
The model has been designed to find similar ARM64 functions in a database of known ARM64 functions. |
|
|
|
### What else could the model be used for? |
|
We do not see other applications for this model. |
|
|
|
### To what problems is the model not applicable? |
|
This model has been finetuned on the semantic search task, for a generic ARM64-BERT model, please refer to the [other |
|
model](https://huggingface.co/NetherlandsForensicInstitute/ARM64bert) we have published. |
|
|
|
|
|
|
|
## Data |
|
### What data was used for training and evaluation? |
|
The dataset is created in the same way as Wang et al. create Binary Corp. A large set of binary code comes from the |
|
[ArchLinux official repositories](https://aur.archlinux.org/) and the [ArchLinux user repositories](https://archlinux.org/packages/). |
|
All this code is split into functions that are compiled with different optimalization |
|
(O0, O1, O2, O3 and O3) and security settings (fortify or no-fortify). This results |
|
in a maximum of 10 (5*2) different functions which are semantically similar i.e. they represent the same functionality but are written differently. |
|
The dataset is split into a train and a test set. This is done on project level, so all binaries and functions belonging to one project are part of |
|
either the train or the test set, not both. We have not performed any deduplication on the dataset for training. |
|
|
|
|
|
| set | # functions | |
|
|-------|------------:| |
|
| train | 18,083,285 | |
|
| test | 3,375,741 | |
|
|
|
### By whom was the dataset collected and annotated? |
|
The dataset was collected by our team. |
|
|
|
### Any remarks on data quality and bias? |
|
After training our models, we found out that something had gone wrong when compiling our dataset. Consequently, |
|
the last line (instruction) of the previous function was included in the next. This has been fixed for the finetuning, but due to the long training process, and the |
|
good performance of the model despite the mistake, we have decided not to retrain the base model. |
|
|
|
|
|
|
|
## Fairness Metrics |
|
|
|
### Which metrics have been used to measure bias in the data/model and why? |
|
n.a. |
|
|
|
### What do those metrics show? |
|
n.a. |
|
|
|
### Any other notable issues? |
|
n.a. |
|
|
|
## Analyses (optional) |
|
n.a. |
|
|