Judithvdw commited on
Commit
d24ff29
·
verified ·
1 Parent(s): c345d84

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +89 -4
README.md CHANGED
@@ -1,4 +1,89 @@
1
- ---
2
- license: eupl-1.1
3
- language: code
4
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Model Card
2
+ ----------
3
+
4
+ _Who to contact:_ fbda [at] nfi [dot] nl \
5
+ _Version / Date:_ v1, 15/05/2025\
6
+ TODO: add link to github repo
7
+
8
+ ## General
9
+ ### What is the purpose of the model
10
+ The model is a BERT model of ARM64 assembly code that can be used to find similar ARM64 functions to a
11
+ given ARM64 function.
12
+
13
+ ### What does the model architecture look like?
14
+ The model architecture is inspired by [jTrans](https://github.com/vul337/jTrans) (Wang et al., 2022). It is a BERT model
15
+ (Devlin et al. 2019)
16
+ although the typical Next Sentence Prediction has been replaced with Jump Target Prediction, as proposed in Wang et al.
17
+ This architecture has subsequently been finetuned for semantic search purposes. We have followed the procedure proposed by [S-BERT](https://www.sbert.net/examples/applications/semantic-search/README.html).
18
+
19
+
20
+ ### What is the output of the model?
21
+ The model returns a vector of 768 dimensions for each function that it's given. These vectors can be compared to
22
+ get an indication of which functions are similar to each other.
23
+
24
+ ### How does the model perform?
25
+ The model has been evaluated on [Mean Reciprocal Rank (MRR)](https://en.wikipedia.org/wiki/Mean_reciprocal_rank) and
26
+ [Recall@1](https://en.wikipedia.org/wiki/Precision_and_recall).
27
+ When the model has to pick the positive example out of a pool of 32, ranks the positive example highest most of the time.
28
+ When the pool is significantly enlarged to 10.000 functions, it still ranks the positive example first or second in most cases.
29
+
30
+
31
+ | Model | Pool size | MRR | Recall@1 |
32
+ |---------|-----------|------|----------|
33
+ | ASMBert | 32 | 0.99 | 0.99 |
34
+ | ASMBert | 10.000 | 0.87 | 0.83 |
35
+
36
+
37
+ ## Purpose and use of the model
38
+
39
+ ### For which problem has the model been designed?
40
+ The model has been designed to find similar ARM64 functions in a database of known ARM64 functions.
41
+
42
+ ### What else could the model be used for?
43
+ We do not see other applications for this model.
44
+
45
+ ### To what problems is the model not applicable?
46
+ This model has been finetuned on the semantic search task, for a generic ARM64-BERT model, please refer to the [other
47
+ model](https://huggingface.co/NetherlandsForensicInstitute/ARM64bert) we have published.
48
+
49
+
50
+
51
+ ## Data
52
+ ### What data was used for training and evaluation?
53
+ The dataset is created in the same way as Wang et al. create Binary Corp. A large set of binary code comes from the
54
+ [ArchLinux official repositories](https://aur.archlinux.org/) and the [ArchLinux user repositories](https://archlinux.org/packages/).
55
+ All this code is split into functions that are compiled with different optimalization
56
+ (O0, O1, O2, O3 and O3) and security settings (fortify or no-fortify). This results
57
+ in a maximum of 10 (5*2) different functions which are semantically similar i.e. they represent the same functionality but are written differently.
58
+ The dataset is split into a train and a test set. This is done on project level, so all binaries and functions belonging to one project are part of
59
+ either the train or the test set, not both. We have not performed any deduplication on the dataset for training.
60
+
61
+
62
+ | set | # functions |
63
+ |-------|------------:|
64
+ | train | 18,083,285 |
65
+ | test | 3,375,741 |
66
+
67
+ ### By whom was the dataset collected and annotated?
68
+ The dataset was collected by our team.
69
+
70
+ ### Any remarks on data quality and bias?
71
+ After training our models, we found out that something had gone wrong when compiling our dataset. Consequently,
72
+ the last line (instruction) of the previous function was included in the next. This has been fixed for the finetuning, but due to the long training process, and the
73
+ good performance of the model despite the mistake, we have decided not to retrain the base model.
74
+
75
+
76
+
77
+ ## Fairness Metrics
78
+
79
+ ### Which metrics have been used to measure bias in the data/model and why?
80
+ n.a.
81
+
82
+ ### What do those metrics show?
83
+ n.a.
84
+
85
+ ### Any other notable issues?
86
+ n.a.
87
+
88
+ ## Analyses (optional)
89
+ n.a.