File size: 2,499 Bytes
65743a6 0b1bba2 83074ed 2aa9ee7 734e51c 2aa9ee7 83074ed 6464326 83074ed 65743a6 2aa9ee7 c089279 2aa9ee7 b557f32 5207533 2aa9ee7 25e2b57 2aa9ee7 cf75e58 2aa9ee7 25e2b57 863178e d57e130 2aa9ee7 5a11720 b130037 a617dab b130037 aa0cc44 e0e11e3 b130037 2aa9ee7 da148a5 0b1bba2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
---
license: apache-2.0
datasets:
- FredZhang7/malicious-website-features-2.4M
wget:
- text: https://chat.openai.com/
- text: https://huggingface.co/FredZhang7/aivance-safesearch-v3
metrics:
- accuracy
language:
- af
- en
- et
- sw
- sv
- sq
- de
- ca
- hu
- da
- tl
- so
- fi
- fr
- cs
- hr
- cy
- es
- sl
- tr
- pl
- pt
- nl
- id
- sk
- lt
- 'no'
- lv
- vi
- it
- ro
- ru
- mk
- bg
- th
- ja
- ko
- multilingual
---
It's very important to note that this model is not production-ready.
<br>
The classification task for v1 is split into two stages:
1. URL features model
- **96.5%+ accurate** on training and validation data
- 2,436,727 rows of labelled URLs
- evaluation from v2: slightly overfitted, by perhaps around 0.8%
2. Website features model
- **98.4% accurate** on training data, and **98.9% accurate** on validation data
- 911,180 rows of 42 features
- evaluation from v2: slightly biased towards the URL feature (bert_confidence) more than the other columns
## Training
I applied cross-validation with `cv=5` to the training dataset to search for the best hyperparameters.
Here's the dict passed to `sklearn`'s `GridSearchCV` function:
```python
params = {
'objective': 'binary',
'metric': 'binary_logloss',
'boosting_type': ['gbdt', 'dart'],
'num_leaves': [15, 23, 31, 63],
'learning_rate': [0.001, 0.002, 0.01, 0.02],
'feature_fraction': [0.5, 0.6, 0.7, 0.9],
'early_stopping_rounds': [10, 20],
'num_boost_round': [500, 750, 800, 900, 1000, 1250, 2000]
}
```
To reproduce the 98.4% accurate model, you can follow the data analysis on the [dataset page](https://huggingface.co/datasets/FredZhang7/malicious-website-features-2.4M) to filter out the unimportant features.
Then train a LightGBM model using the most suited hyperparamters for this task:
```python
params = {
'objective': 'binary',
'metric': 'binary_logloss',
'boosting_type': 'gbdt',
'num_leaves': 31,
'learning_rate': 0.01,
'feature_fraction': 0.6,
'early_stopping_rounds': 10,
'num_boost_round': 800
}
```
## URL Features
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("FredZhang7/malware-phisher")
model = AutoModelForSequenceClassification.from_pretrained("FredZhang7/malware-phisher")
```
## Website Features
```bash
pip install lightgbm
```
```python
import lightgbm as lgb
lgb.Booster(model_file="phishing_model_combined_0.984_train.txt")
``` |