mvansegbroeck commited on
Commit
da56a50
·
verified ·
1 Parent(s): 6484293

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +164 -0
README.md ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: gliner
6
+ datasets:
7
+ - gretelai/gretel-pii-masking-en-v1
8
+ pipeline_tag: token-classification
9
+ tags:
10
+ - PII
11
+ - PHI
12
+ - GLiNER
13
+ - information extraction
14
+ - encoder
15
+ - entity recognition
16
+ - privacy
17
+ ---
18
+
19
+ # Gretel GLiNER: Fine-Tuned Models for PII/PHI Detection
20
+ This **Gretel GLiNER** model is a fine-tuned version of the GLiNER base model `knowledgator/gliner-bi-small-v1.0`, specifically trained for the detection of Personally Identifiable Information (PII) and Protected Health Information (PHI).
21
+ Gretel GLiNER helps to provide privacy-compliant entity recognition across various industries and document types.
22
+ For more information about the base GLiNER model, including its architecture and general capabilities, please refer to the [GLiNER Model Card](https://huggingface.co/knowledgator/gliner-bi-small-v1.0).
23
+
24
+ The model was fine-tuned on the `gretelai/gretel-pii-masking-en-v1` dataset, which provides a rich and diverse collection of synthetic document snippets containing PII and PHI entities.
25
+
26
+ 1. **Training:** Utilized the training split of the synthetic dataset.
27
+ 2. **Validation:** Monitored performance using the validation set to adjust training parameters.
28
+ 3. **Evaluation:** Assessed final performance on the test set using PII/PHI entity annotations as ground truth.
29
+
30
+ For detailed statistics on the dataset, including domain and entity type distributions, visit the [dataset documentation on Hugging Face](https://huggingface.co/datasets/gretel/gretel-pii-masking-en-v1).
31
+
32
+ ### Model Performance
33
+
34
+ All fine-tuned Gretel GLiNER models demonstrate substantial improvements over their base counterparts in accuracy, precision, recall, and F1 score:
35
+
36
+ | Model | Accuracy | Precision | Recall | F1 Score |
37
+ |---------------------------------------|----------|-----------|--------|----------|
38
+ | gretelai/gretel-gliner-bi-small-v1.0 | 0.89 | 0.98 | 0.91 | 0.94 |
39
+ | gretelai/gretel-gliner-bi-base-v1.0 | 0.91 | 0.98 | 0.92 | 0.95 |
40
+ | gretelai/gretel-gliner-bi-large-v1.0 | 0.91 | 0.99 | 0.93 | 0.95 |
41
+
42
+
43
+ ## Installation & Usage
44
+
45
+ Ensure you have Python installed. Then, install or update the `gliner` package:
46
+
47
+ ```bash
48
+ pip install gliner -U
49
+ ```
50
+
51
+ Load the fine-tuned Gretel GLiNER model using the GLiNER class and the from_pretrained method. Below is an example using the gretelai/gretel-gliner-bi-base-v1.0 model for PII/PHI detection:
52
+
53
+ ```python
54
+ from gliner import GLiNER
55
+
56
+ # Load the fine-tuned GLiNER model
57
+ model = GLiNER.from_pretrained("gretelai/gretel-gliner-bi-small-v1.0")
58
+
59
+ # Sample text containing PII/PHI entities
60
+ text = """
61
+ Purchase Order
62
+ ----------------
63
+ Date: 10/05/2023
64
+ ----------------
65
+ Customer Name: CID-982305
66
+ Billing Address: 1234 Oak Street, Suite 400, Springfield, IL, 62704
67
+ Phone: (312) 555-7890 (555-876-5432)
68
69
+ """
70
+
71
+ # Define the labels for PII/PHI entities
72
+ labels = [
73
+ "medical_record_number",
74
+ "date_of_birth",
75
+ "ssn",
76
+ "date",
77
+ "first_name",
78
+ "email",
79
+ "last_name",
80
+ "customer_id",
81
+ "employee_id",
82
+ "name",
83
+ "street_address",
84
+ "phone_number",
85
+ "ipv4",
86
+ "credit_card_number",
87
+ "license_plate",
88
+ "address",
89
+ "user_name",
90
+ "device_identifier",
91
+ "bank_routing_number",
92
+ "date_time",
93
+ "company_name",
94
+ "unique_identifier",
95
+ "biometric_identifier",
96
+ "account_number",
97
+ "city",
98
+ "certificate_license_number",
99
+ "time",
100
+ "postcode",
101
+ "vehicle_identifier",
102
+ "coordinate",
103
+ "country",
104
+ "api_key",
105
+ "ipv6",
106
+ "password",
107
+ "health_plan_beneficiary_number",
108
+ "national_id",
109
+ "tax_id",
110
+ "url",
111
+ "state",
112
+ "swift_bic",
113
+ "cvv",
114
+ "pin"
115
+ ]
116
+
117
+ # Predict entities with a confidence threshold of 0.7
118
+ entities = model.predict_entities(text, labels, threshold=0.7)
119
+
120
+ # Display the detected entities
121
+ for entity in entities:
122
+ print(f"{entity['text']} => {entity['label']}")
123
+ ```
124
+
125
+ Expected Output:
126
+
127
+
128
+ ```
129
+ CID-982305 => customer_id
130
+ 1234 Oak Street, Suite 400 => street_address
131
+ Springfield => city
132
+ IL => state
133
+ 62704 => postcode
134
+ (312) 555-7890 => phone_number
135
+ 555-876-5432 => phone_number
136
+ [email protected] => email
137
+ ```
138
+
139
+ ## Use Cases
140
+
141
+ Gretel GLiNER is ideal for applications requiring detection and redaction of sensitive information:
142
+
143
+ - Healthcare: Automating the extraction and redaction of patient information from medical records.
144
+ - Finance: Identifying and securing financial data such as account numbers and transaction details.
145
+ - Cybersecurity: Detecting sensitive information in logs and security reports.
146
+ - Legal: Processing contracts and legal documents to protect client information.
147
+ - Data Privacy Compliance: Ensuring data handling processes adhere to regulations like GDPR and HIPAA by accurately identifying PII/PHI.
148
+
149
+ ## Citation and Usage
150
+
151
+ If you use this dataset in your research or applications, please cite it as:
152
+
153
+ ```bibtex
154
+ @dataset{gretel-pii-masking-en-v1,
155
+ author = {Gretel AI},
156
+ title = {GLiNER Models for PII Detection through Fine-Tuning on Gretel-Generated Synthetic Documents},
157
+ year = {2024},
158
+ month = {10},
159
+ publisher = {Gretel},
160
+ howpublished = {https://huggingface.co/gretelai/gretel-pii-masking-en-v1}
161
+ }
162
+ ```
163
+
164
+ For questions, issues, or additional information, please visit our [Synthetic Data Discord](https://gretel.ai/discord) community or reach out to [gretel.ai](https://gretel.ai/).