Anvilogic
/

CE-Typosquat-Detect

Text Classification

sentence-transformers

Model card Files Files and versions Community

CE-Typosquat-Detect / README.md

chgrdj's picture

Update README.md

1581bf1 verified 4 months ago

|

history blame contribute delete

3.32 kB

	---
	library_name: sentence-transformers
	tags:
	- cross-encoder
	- cyber
	- cybersecurity
	- code
	license: mit
	metrics:
	- accuracy
	- f1
	- recall
	- precision
	base_model:
	- google/canine-c
	pipeline_tag: text-classification
	datasets:
	- Anvilogic/CE-Typosquat-Training-Dataset
	---

	# Typosquat CE detector

	## Model Details

	### Model Description
	This model is a cross-encoder fine-tuned for binary classification to detect typosquatting domain names, leveraging the CANINE-c transformer model.
	The model can be used to classify whether a domain name is a typographical variant (typosquat) of another domain.

	- Developed by: Anvilogic
	- Model type: Cross-encoder binary classification
	- Maximum Sequence Length: 512 tokens
	- Language(s) (NLP): Multilingual
	- License: MIT
	- Finetuned from model : [google/CANINE-c](https://huggingface.co/google/canine-c)


	## Usage


	### Direct Usage (Sentence Transformers)

	This model can be directly used in cybersecurity applications to identify malicious typosquatting domains by analyzing a domain name similarity to a legitimate one.

	To start using this model, the following code can be used for loading and testing:
	```python
	from sentence_transformers import CrossEncoder

	model = CrossEncoder("Anvilogic/CE-typosquat-detect-Canine")
	result = model.predict([("example.com", "exarnple.com")])
	```

	### Downstream Usage
	This model can be used with an embedding model to enhance typosquatting detection.
	First, an embedding model retrieves similar domains from a legitimate database.
	Then, the cross-encoder labels these pairs, confirming if a domain is a typosquat and identifying its original source.

	For embedding, consider using: [Anvilogic/Embedder-typosquat-detect](https://huggingface.co/Anvilogic/Embedder-typosquat-detect)

	## Bias, Risks, and Limitations

	Users are advised to use this model as a supportive tool rather than a sole indicator for domain security.
	Regular updates may be needed to maintain its performance against new and evolving types of domain spoofing.

	## Training Details

	### Framework Versions
	- Python: 3.10.14
	- Sentence Transformers: 3.2.1
	- Transformers: 4.46.2
	- PyTorch: 2.2.2
	- Tokenizers: 0.20.3

	### Training Data

	The model was fine-tuned using [Anvilogic/CE-Typosquat-Training-Dataset](https://huggingface.co/datasets/Anvilogic/CE-Typosquat-Training-Dataset), which contains pairs of domain names and their similarity labels.
	The dataset was filtered and converted to the parquet format for efficient processing.

	### Training Procedure
	The model was optimized using the binary cross-entropy loss function with logits, `nn.BCEWithLogitsLoss()`.

	#### Training Hyperparameters
	- Model Architecture: Cross-encoder fine-tuned from [canine-c](https://huggingface.co/google/canine-c)
	- Batch Size: 64
	- Epochs: 3
	- Learning Rate: 2e-5
	- Warmup Steps: 100


	## Evaluation

	In the final evaluation after training, the model achieved the following metrics on the test set:

	CE Binary Classification Evaluator
	```json
	Accuracy : 0.9740
	F1 Score : 0.9737
	Precision : 0.9836
	Recall : 0.964
	Average Precision : 0.9969
	```
	These results indicate the model's high performance in identifying typosquatting domains, with strong precision and recall scores that make it well-suited for cybersecurity applications.