scampion commited on
Commit
4437a17
·
verified ·
1 Parent(s): dad13e2

End of training

Browse files
Files changed (2) hide show
  1. README.md +48 -116
  2. model.safetensors +1 -1
README.md CHANGED
@@ -1,138 +1,70 @@
1
  ---
2
- datasets:
3
- - ai4privacy/pii-masking-400k
 
 
 
4
  metrics:
5
- - accuracy
6
- - recall
7
  - precision
 
8
  - f1
9
- base_model:
10
- - answerdotai/ModernBERT-base
11
- pipeline_tag: token-classification
12
- tags:
13
- - pii
14
- - privacy
15
- - personal
16
- - identification
17
  ---
18
- # 🐟 PII-RANHA: Privacy-Preserving Token Classification Model
19
-
20
- ## Overview
21
- PII-RANHA is a fine-tuned token classification model based on **ModernBERT-base** from Answer.AI. It is designed to identify and classify Personally Identifiable Information (PII) in text data. The model is trained on the `ai4privacy/pii-masking-400k` dataset and can detect 17 different PII categories, such as account numbers, credit card numbers, email addresses, and more.
22
-
23
- This model is intended for privacy-preserving applications, such as data anonymization, redaction, or compliance with data protection regulations.
24
-
25
- ## Model Details
26
-
27
- ### Model Architecture
28
- - **Base Model**: `answerdotai/ModernBERT-base`
29
- - **Task**: Token Classification
30
- - **Number of Labels**: 18 (17 PII categories + "O" for non-PII tokens)
31
-
32
-
33
- ## Usage
34
-
35
- ### Installation
36
- To use the model, ensure you have the `transformers` and `datasets` libraries installed:
37
-
38
- ```bash
39
- pip install transformers datasets
40
- ```
41
-
42
- Inference Example
43
- Here’s how to load and use the model for PII detection:
44
-
45
- ```python
46
- from transformers import AutoTokenizer, AutoModelForTokenClassification
47
- from transformers import pipeline
48
-
49
- # Load the model and tokenizer
50
- model_name = "scampion/piiranha"
51
- tokenizer = AutoTokenizer.from_pretrained(model_name)
52
- model = AutoModelForTokenClassification.from_pretrained(model_name)
53
-
54
- # Create a token classification pipeline
55
- pii_pipeline = pipeline("token-classification", model=model, tokenizer=tokenizer)
56
-
57
- # Example input
58
- text = "My email is [email protected] and my phone number is 555-123-4567."
59
 
60
- # Detect PII
61
- results = pii_pipeline(text)
62
- for entity in results:
63
- print(f"Entity: {entity['word']}, Label: {entity['entity']}, Score: {entity['score']:.4f}")
64
 
65
- ```
66
 
67
- ```bash
68
- Entity: Ġj, Label: I-ACCOUNTNUM, Score: 0.6445
69
- Entity: ohn, Label: I-ACCOUNTNUM, Score: 0.3657
70
- Entity: ., Label: I-USERNAME, Score: 0.5871
71
- Entity: do, Label: I-USERNAME, Score: 0.5350
72
- Entity: Ġ555, Label: I-ACCOUNTNUM, Score: 0.8399
73
- Entity: -, Label: I-SOCIALNUM, Score: 0.5948
74
- Entity: 123, Label: I-SOCIALNUM, Score: 0.6309
75
- Entity: -, Label: I-SOCIALNUM, Score: 0.6151
76
- Entity: 45, Label: I-SOCIALNUM, Score: 0.3742
77
- Entity: 67, Label: I-TELEPHONENUM, Score: 0.3440
78
- ```
79
 
80
- ## Training Details
81
 
82
- ### Dataset
83
- The model was trained on the ai4privacy/pii-masking-400k dataset, which contains 400,000 examples of text with annotated PII tokens.
84
 
85
- ### Training Configuration
86
- - **Batch Size:** 32
87
- - **Learning Rate:** 5e-6
88
- - **Epochs:** 4
89
- - **Optimizer:** AdamW
90
- - **Weight Decay:** 0.01
91
- - **Scheduler:** Linear learning rate scheduler
92
 
93
- ### Evaluation Metrics
94
- The model was evaluated using the following metrics:
95
- - Precision
96
- - Recall
97
- - F1 Score
98
- - Accuracy
99
 
 
100
 
101
- | Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy |
102
- |-------|--------------|-----------------|-----------|---------|-----|----------|
103
- | 1 | 0.026000 | 0.026693 | 0.808574 | 0.845563 | 0.826655 | 0.990215 |
104
- | 2 | 0.019300 | 0.020881 | 0.849764 | 0.879042 | 0.864155 | 0.992203 |
105
- | 3 | 0.016100 | 0.019111 | 0.859251 | 0.882796 | 0.870865 | 0.992912 |
106
- | 4 | 0.012200 | 0.019017 | 0.860648 | 0.888844 | 0.874519 | 0.993073 |
107
 
108
- Would you like me to help analyze any trends in these metrics?
109
 
110
- ## License
111
- This model is licensed under the Commons Clause Apache License 2.0. For more details, see the Commons Clause website.
112
- For another license, contact the author.
113
 
114
- ## Author
115
- Name: Sébastien Campion
 
 
 
 
 
 
116
 
117
118
 
119
- Date: 2025-01-30
 
 
 
 
 
120
 
121
- Version: 0.1
122
 
123
- ## Citation
124
- If you use this model in your work, please cite it as follows:
125
 
126
- ```bibtex
127
- @misc{piiranha2025,
128
- author = {Sébastien Campion},
129
- title = {PII-RANHA: A Privacy-Preserving Token Classification Model},
130
- year = {2025},
131
- version = {0.1},
132
- url = {https://huggingface.co/sebastien-campion/piiranha},
133
- }
134
- ```
135
-
136
- ## Disclaimer
137
- This model is provided "as-is" without any guarantees of performance or suitability for specific use cases.
138
- Always evaluate the model's performance in your specific context before deployment.
 
1
  ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ base_model: answerdotai/ModernBERT-base
5
+ tags:
6
+ - generated_from_trainer
7
  metrics:
 
 
8
  - precision
9
+ - recall
10
  - f1
11
+ - accuracy
12
+ model-index:
13
+ - name: piiranha
14
+ results: []
 
 
 
 
15
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
18
+ should probably proofread and complete it, then remove this comment. -->
 
 
19
 
20
+ # piiranha
21
 
22
+ This model is a fine-tuned version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) on an unknown dataset.
23
+ It achieves the following results on the evaluation set:
24
+ - Loss: 0.0229
25
+ - Precision: 0.9212
26
+ - Recall: 0.9272
27
+ - F1: 0.9242
28
+ - Accuracy: 0.9953
 
 
 
 
 
29
 
30
+ ## Model description
31
 
32
+ More information needed
 
33
 
34
+ ## Intended uses & limitations
 
 
 
 
 
 
35
 
36
+ More information needed
 
 
 
 
 
37
 
38
+ ## Training and evaluation data
39
 
40
+ More information needed
 
 
 
 
 
41
 
42
+ ## Training procedure
43
 
44
+ ### Training hyperparameters
 
 
45
 
46
+ The following hyperparameters were used during training:
47
+ - learning_rate: 5e-05
48
+ - train_batch_size: 32
49
+ - eval_batch_size: 32
50
+ - seed: 42
51
+ - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-06 and optimizer_args=No additional optimizer arguments
52
+ - lr_scheduler_type: linear
53
+ - num_epochs: 4
54
 
55
+ ### Training results
56
 
57
+ | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
58
+ |:-------------:|:-----:|:-----:|:---------------:|:---------:|:------:|:------:|:--------:|
59
+ | 0.0171 | 1.0 | 9156 | 0.0179 | 0.8976 | 0.9056 | 0.9016 | 0.9935 |
60
+ | 0.0113 | 2.0 | 18312 | 0.0141 | 0.9155 | 0.9233 | 0.9194 | 0.9948 |
61
+ | 0.005 | 3.0 | 27468 | 0.0157 | 0.9194 | 0.9284 | 0.9239 | 0.9951 |
62
+ | 0.001 | 4.0 | 36624 | 0.0229 | 0.9212 | 0.9272 | 0.9242 | 0.9953 |
63
 
 
64
 
65
+ ### Framework versions
 
66
 
67
+ - Transformers 4.48.2
68
+ - Pytorch 2.5.1+cu124
69
+ - Datasets 3.2.0
70
+ - Tokenizers 0.21.0
 
 
 
 
 
 
 
 
 
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:adffbd923c19d63b803c6b092b68bd0df391a54e0aff31319a2a5ea3c7488ddc
3
  size 598489008
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ba6f5d418ac086df93d2eb9ca07057eb250dee0b532bba744028a45cad55a4b7
3
  size 598489008