CE-Typosquat-Detect / README.md
chgrdj's picture
Update README.md
1581bf1 verified
---
library_name: sentence-transformers
tags:
- cross-encoder
- cyber
- cybersecurity
- code
license: mit
metrics:
- accuracy
- f1
- recall
- precision
base_model:
- google/canine-c
pipeline_tag: text-classification
datasets:
- Anvilogic/CE-Typosquat-Training-Dataset
---
# Typosquat CE detector
## Model Details
### Model Description
This model is a cross-encoder fine-tuned for binary classification to detect typosquatting domain names, leveraging the CANINE-c transformer model.
The model can be used to classify whether a domain name is a typographical variant (typosquat) of another domain.
- **Developed by:** Anvilogic
- **Model type:** Cross-encoder binary classification
- **Maximum Sequence Length**: 512 tokens
- **Language(s) (NLP):** Multilingual
- **License:** MIT
- **Finetuned from model :** [google/CANINE-c](https://huggingface.co/google/canine-c)
## Usage
### Direct Usage (Sentence Transformers)
This model can be directly used in cybersecurity applications to identify malicious typosquatting domains by analyzing a domain name similarity to a legitimate one.
To start using this model, the following code can be used for loading and testing:
```python
from sentence_transformers import CrossEncoder
model = CrossEncoder("Anvilogic/CE-typosquat-detect-Canine")
result = model.predict([("example.com", "exarnple.com")])
```
### Downstream Usage
This model can be used with an embedding model to enhance typosquatting detection.
First, an embedding model retrieves similar domains from a legitimate database.
Then, the cross-encoder labels these pairs, confirming if a domain is a typosquat and identifying its original source.
For embedding, consider using: [Anvilogic/Embedder-typosquat-detect](https://huggingface.co/Anvilogic/Embedder-typosquat-detect)
## Bias, Risks, and Limitations
Users are advised to use this model as a supportive tool rather than a sole indicator for domain security.
Regular updates may be needed to maintain its performance against new and evolving types of domain spoofing.
## Training Details
### Framework Versions
- Python: 3.10.14
- Sentence Transformers: 3.2.1
- Transformers: 4.46.2
- PyTorch: 2.2.2
- Tokenizers: 0.20.3
### Training Data
The model was fine-tuned using [Anvilogic/CE-Typosquat-Training-Dataset](https://huggingface.co/datasets/Anvilogic/CE-Typosquat-Training-Dataset), which contains pairs of domain names and their similarity labels.
The dataset was filtered and converted to the parquet format for efficient processing.
### Training Procedure
The model was optimized using the binary cross-entropy loss function with logits, `nn.BCEWithLogitsLoss()`.
#### Training Hyperparameters
- **Model Architecture**: Cross-encoder fine-tuned from [canine-c](https://huggingface.co/google/canine-c)
- **Batch Size**: 64
- **Epochs**: 3
- **Learning Rate**: 2e-5
- **Warmup Steps**: 100
## Evaluation
In the final evaluation after training, the model achieved the following metrics on the test set:
**CE Binary Classification Evaluator**
```json
Accuracy : 0.9740
F1 Score : 0.9737
Precision : 0.9836
Recall : 0.964
Average Precision : 0.9969
```
These results indicate the model's high performance in identifying typosquatting domains, with strong precision and recall scores that make it well-suited for cybersecurity applications.