Update README.md

d24ff29 verified 27 days ago

3.94 kB

	Model Card
	----------

	_Who to contact:_ fbda [at] nfi [dot] nl \
	_Version / Date:_ v1, 15/05/2025\
	TODO: add link to github repo

	## General
	### What is the purpose of the model
	The model is a BERT model of ARM64 assembly code that can be used to find similar ARM64 functions to a
	given ARM64 function.

	### What does the model architecture look like?
	The model architecture is inspired by [jTrans](https://github.com/vul337/jTrans) (Wang et al., 2022). It is a BERT model
	(Devlin et al. 2019)
	although the typical Next Sentence Prediction has been replaced with Jump Target Prediction, as proposed in Wang et al.
	This architecture has subsequently been finetuned for semantic search purposes. We have followed the procedure proposed by [S-BERT](https://www.sbert.net/examples/applications/semantic-search/README.html).


	### What is the output of the model?
	The model returns a vector of 768 dimensions for each function that it's given. These vectors can be compared to
	get an indication of which functions are similar to each other.

	### How does the model perform?
	The model has been evaluated on [Mean Reciprocal Rank (MRR)](https://en.wikipedia.org/wiki/Mean_reciprocal_rank) and
	[Recall@1](https://en.wikipedia.org/wiki/Precision_and_recall).
	When the model has to pick the positive example out of a pool of 32, ranks the positive example highest most of the time.
	When the pool is significantly enlarged to 10.000 functions, it still ranks the positive example first or second in most cases.


	\| Model \| Pool size \| MRR \| Recall@1 \|
	\|---------\|-----------\|------\|----------\|
	\| ASMBert \| 32 \| 0.99 \| 0.99 \|
	\| ASMBert \| 10.000 \| 0.87 \| 0.83 \|


	## Purpose and use of the model

	### For which problem has the model been designed?
	The model has been designed to find similar ARM64 functions in a database of known ARM64 functions.

	### What else could the model be used for?
	We do not see other applications for this model.

	### To what problems is the model not applicable?
	This model has been finetuned on the semantic search task, for a generic ARM64-BERT model, please refer to the [other
	model](https://huggingface.co/NetherlandsForensicInstitute/ARM64bert) we have published.



	## Data
	### What data was used for training and evaluation?
	The dataset is created in the same way as Wang et al. create Binary Corp. A large set of binary code comes from the
	[ArchLinux official repositories](https://aur.archlinux.org/) and the [ArchLinux user repositories](https://archlinux.org/packages/).
	All this code is split into functions that are compiled with different optimalization
	(O0, O1, O2, O3 and O3) and security settings (fortify or no-fortify). This results
	in a maximum of 10 (5*2) different functions which are semantically similar i.e. they represent the same functionality but are written differently.
	The dataset is split into a train and a test set. This is done on project level, so all binaries and functions belonging to one project are part of
	either the train or the test set, not both. We have not performed any deduplication on the dataset for training.


	\| set \| # functions \|
	\|-------\|------------:\|
	\| train \| 18,083,285 \|
	\| test \| 3,375,741 \|

	### By whom was the dataset collected and annotated?
	The dataset was collected by our team.

	### Any remarks on data quality and bias?
	After training our models, we found out that something had gone wrong when compiling our dataset. Consequently,
	the last line (instruction) of the previous function was included in the next. This has been fixed for the finetuning, but due to the long training process, and the
	good performance of the model despite the mistake, we have decided not to retrain the base model.



	## Fairness Metrics

	### Which metrics have been used to measure bias in the data/model and why?
	n.a.

	### What do those metrics show?
	n.a.

	### Any other notable issues?
	n.a.

	## Analyses (optional)
	n.a.