scampion
/

piiranha

@@ -1,138 +1,70 @@
 ---
-datasets:
-- ai4privacy/pii-masking-400k
 metrics:
-- accuracy
-- recall
 - precision
 - f1
-base_model:
-- answerdotai/ModernBERT-base
-pipeline_tag: token-classification
-tags:
-- pii
-- privacy
-- personal
-- identification
 ---
-# 🐟 PII-RANHA: Privacy-Preserving Token Classification Model
-## Overview
-PII-RANHA is a fine-tuned token classification model based on **ModernBERT-base** from Answer.AI. It is designed to identify and classify Personally Identifiable Information (PII) in text data. The model is trained on the `ai4privacy/pii-masking-400k` dataset and can detect 17 different PII categories, such as account numbers, credit card numbers, email addresses, and more.
-This model is intended for privacy-preserving applications, such as data anonymization, redaction, or compliance with data protection regulations.
-## Model Details
-### Model Architecture
-- **Base Model**: `answerdotai/ModernBERT-base`
-- **Task**: Token Classification
-- **Number of Labels**: 18 (17 PII categories + "O" for non-PII tokens)
-## Usage
-### Installation
-To use the model, ensure you have the `transformers` and `datasets` libraries installed:
-```bash
-pip install transformers datasets
-```
-Inference Example
-Here’s how to load and use the model for PII detection:
-```python
-from transformers import AutoTokenizer, AutoModelForTokenClassification
-from transformers import pipeline
-# Load the model and tokenizer
-model_name = "scampion/piiranha"
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-model = AutoModelForTokenClassification.from_pretrained(model_name)
-# Create a token classification pipeline
-pii_pipeline = pipeline("token-classification", model=model, tokenizer=tokenizer)
-# Example input
-text = "My email is [email protected] and my phone number is 555-123-4567."
-# Detect PII
-results = pii_pipeline(text)
-for entity in results:
-    print(f"Entity: {entity['word']}, Label: {entity['entity']}, Score: {entity['score']:.4f}")
-```
-```bash
-Entity: Ġj, Label: I-ACCOUNTNUM, Score: 0.6445
-Entity: ohn, Label: I-ACCOUNTNUM, Score: 0.3657
-Entity: ., Label: I-USERNAME, Score: 0.5871
-Entity: do, Label: I-USERNAME, Score: 0.5350
-Entity: Ġ555, Label: I-ACCOUNTNUM, Score: 0.8399
-Entity: -, Label: I-SOCIALNUM, Score: 0.5948
-Entity: 123, Label: I-SOCIALNUM, Score: 0.6309
-Entity: -, Label: I-SOCIALNUM, Score: 0.6151
-Entity: 45, Label: I-SOCIALNUM, Score: 0.3742
-Entity: 67, Label: I-TELEPHONENUM, Score: 0.3440
-```
-## Training Details
-### Dataset
-The model was trained on the ai4privacy/pii-masking-400k dataset, which contains 400,000 examples of text with annotated PII tokens.
-### Training Configuration
-- **Batch Size:** 32
-- **Learning Rate:** 5e-6
-- **Epochs:** 4
-- **Optimizer:** AdamW
-- **Weight Decay:** 0.01
-- **Scheduler:** Linear learning rate scheduler
-### Evaluation Metrics
-The model was evaluated using the following metrics:
-- Precision
-- Recall
-- F1 Score
-- Accuracy
-| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy |
-|-------|--------------|-----------------|-----------|---------|-----|----------|
-| 1 | 0.026000 | 0.026693 | 0.808574 | 0.845563 | 0.826655 | 0.990215 |
-| 2 | 0.019300 | 0.020881 | 0.849764 | 0.879042 | 0.864155 | 0.992203 |
-| 3 | 0.016100 | 0.019111 | 0.859251 | 0.882796 | 0.870865 | 0.992912 |
-| 4 | 0.012200 | 0.019017 | 0.860648 | 0.888844 | 0.874519 | 0.993073 |
-Would you like me to help analyze any trends in these metrics?
-## License
-This model is licensed under the Commons Clause Apache License 2.0. For more details, see the Commons Clause website.
-For another license, contact the author.
-## Author
-Name: Sébastien Campion
-Email: [email protected]
-Date: 2025-01-30
-Version: 0.1
-## Citation
-If you use this model in your work, please cite it as follows:
-```bibtex
-@misc{piiranha2025,
-  author = {Sébastien Campion},
-  title = {PII-RANHA: A Privacy-Preserving Token Classification Model},
-  year = {2025},
-  version = {0.1},
-  url = {https://huggingface.co/sebastien-campion/piiranha},
-}
-```
-## Disclaimer
-This model is provided "as-is" without any guarantees of performance or suitability for specific use cases.
-Always evaluate the model's performance in your specific context before deployment.

 ---
+library_name: transformers
+license: apache-2.0
+base_model: answerdotai/ModernBERT-base
+tags:
+- generated_from_trainer
 metrics:
 - precision
+- recall
 - f1
+- accuracy
+model-index:
+- name: piiranha
+  results: []
 ---
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment. -->
+# piiranha
+This model is a fine-tuned version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) on an unknown dataset.
+It achieves the following results on the evaluation set:
+- Loss: 0.0229
+- Precision: 0.9212
+- Recall: 0.9272
+- F1: 0.9242
+- Accuracy: 0.9953
+## Model description
+More information needed
+## Intended uses & limitations
+More information needed
+## Training and evaluation data
+More information needed
+## Training procedure
+### Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 5e-05
+- train_batch_size: 32
+- eval_batch_size: 32
+- seed: 42
+- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-06 and optimizer_args=No additional optimizer arguments
+- lr_scheduler_type: linear
+- num_epochs: 4
+### Training results
+| Training Loss | Epoch | Step  | Validation Loss | Precision | Recall | F1     | Accuracy |
+|:-------------:|:-----:|:-----:|:---------------:|:---------:|:------:|:------:|:--------:|
+| 0.0171        | 1.0   | 9156  | 0.0179          | 0.8976    | 0.9056 | 0.9016 | 0.9935   |
+| 0.0113        | 2.0   | 18312 | 0.0141          | 0.9155    | 0.9233 | 0.9194 | 0.9948   |
+| 0.005         | 3.0   | 27468 | 0.0157          | 0.9194    | 0.9284 | 0.9239 | 0.9951   |
+| 0.001         | 4.0   | 36624 | 0.0229          | 0.9212    | 0.9272 | 0.9242 | 0.9953   |
+### Framework versions
+- Transformers 4.48.2
+- Pytorch 2.5.1+cu124
+- Datasets 3.2.0
+- Tokenizers 0.21.0

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:adffbd923c19d63b803c6b092b68bd0df391a54e0aff31319a2a5ea3c7488ddc
 size 598489008

 version https://git-lfs.github.com/spec/v1
+oid sha256:ba6f5d418ac086df93d2eb9ca07057eb250dee0b532bba744028a45cad55a4b7
 size 598489008