|
--- |
|
library_name: sentence-transformers |
|
tags: |
|
- cross-encoder |
|
- cyber |
|
- cybersecurity |
|
- code |
|
license: mit |
|
metrics: |
|
- accuracy |
|
- f1 |
|
- recall |
|
- precision |
|
base_model: |
|
- google/canine-c |
|
pipeline_tag: text-classification |
|
datasets: |
|
- Anvilogic/CE-Typosquat-Training-Dataset |
|
--- |
|
|
|
# Typosquat CE detector |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
This model is a cross-encoder fine-tuned for binary classification to detect typosquatting domain names, leveraging the CANINE-c transformer model. |
|
The model can be used to classify whether a domain name is a typographical variant (typosquat) of another domain. |
|
|
|
- **Developed by:** Anvilogic |
|
- **Model type:** Cross-encoder binary classification |
|
- **Maximum Sequence Length**: 512 tokens |
|
- **Language(s) (NLP):** Multilingual |
|
- **License:** MIT |
|
- **Finetuned from model :** [google/CANINE-c](https://huggingface.co/google/canine-c) |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Direct Usage (Sentence Transformers) |
|
|
|
This model can be directly used in cybersecurity applications to identify malicious typosquatting domains by analyzing a domain name similarity to a legitimate one. |
|
|
|
To start using this model, the following code can be used for loading and testing: |
|
```python |
|
from sentence_transformers import CrossEncoder |
|
|
|
model = CrossEncoder("Anvilogic/CE-typosquat-detect-Canine") |
|
result = model.predict([("example.com", "exarnple.com")]) |
|
``` |
|
|
|
### Downstream Usage |
|
This model can be used with an embedding model to enhance typosquatting detection. |
|
First, an embedding model retrieves similar domains from a legitimate database. |
|
Then, the cross-encoder labels these pairs, confirming if a domain is a typosquat and identifying its original source. |
|
|
|
For embedding, consider using: [Anvilogic/Embedder-typosquat-detect](https://huggingface.co/Anvilogic/Embedder-typosquat-detect) |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
Users are advised to use this model as a supportive tool rather than a sole indicator for domain security. |
|
Regular updates may be needed to maintain its performance against new and evolving types of domain spoofing. |
|
|
|
## Training Details |
|
|
|
### Framework Versions |
|
- Python: 3.10.14 |
|
- Sentence Transformers: 3.2.1 |
|
- Transformers: 4.46.2 |
|
- PyTorch: 2.2.2 |
|
- Tokenizers: 0.20.3 |
|
|
|
### Training Data |
|
|
|
The model was fine-tuned using [Anvilogic/CE-Typosquat-Training-Dataset](https://huggingface.co/datasets/Anvilogic/CE-Typosquat-Training-Dataset), which contains pairs of domain names and their similarity labels. |
|
The dataset was filtered and converted to the parquet format for efficient processing. |
|
|
|
### Training Procedure |
|
The model was optimized using the binary cross-entropy loss function with logits, `nn.BCEWithLogitsLoss()`. |
|
|
|
#### Training Hyperparameters |
|
- **Model Architecture**: Cross-encoder fine-tuned from [canine-c](https://huggingface.co/google/canine-c) |
|
- **Batch Size**: 64 |
|
- **Epochs**: 3 |
|
- **Learning Rate**: 2e-5 |
|
- **Warmup Steps**: 100 |
|
|
|
|
|
## Evaluation |
|
|
|
In the final evaluation after training, the model achieved the following metrics on the test set: |
|
|
|
**CE Binary Classification Evaluator** |
|
```json |
|
Accuracy : 0.9740 |
|
F1 Score : 0.9737 |
|
Precision : 0.9836 |
|
Recall : 0.964 |
|
Average Precision : 0.9969 |
|
``` |
|
These results indicate the model's high performance in identifying typosquatting domains, with strong precision and recall scores that make it well-suited for cybersecurity applications. |