File size: 2,740 Bytes
63ef06c
ed236e8
63ef06c
 
 
 
 
 
 
 
 
 
 
 
 
 
639bb58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6974e25
639bb58
 
 
054eca3
639bb58
 
054eca3
 
 
 
 
 
 
 
 
 
 
 
639bb58
 
 
054eca3
 
 
639bb58
054eca3
 
 
 
639bb58
 
 
 
 
 
 
 
 
 
 
b9701a5
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
license: cc-by-nc-4.0
datasets:
- ai4privacy/pii-masking-400k
language:
- en
- de
- fr
- it
- es
- nl
base_model:
- iiiorg/piiranha-v1-detect-personal-information
tags:
- NeuralWave
- Hackathon
---
## Overview

This model serves to enhance the precision and accuracy of personal information detection by utilizing a reduced label set compared to its base model. Through this refinement, it aims to provide superior labeling precision for identifying personal information across multiple languages.

---

## Features

- **Improved Precision**: By reducing the label set size from the base model, the model enhances the precision of the labeling procedure, ensuring more reliable identification of sensitive information.

- **Model Versions**:
- **Maximum Accuracy Focus**: This version aims to achieve the highest possible accuracy in the detection process, making it suitable for applications where minimizing errors is crucial.
- **Maximum Precision Focus**: This variant is designed to maximize the precision of the detection, ideal for scenarios where false positives are particularly undesirable.

---

## Installation

To run this model, you will need to install the dependencies:

```bash
pip install torch transformers safetensors
```

---

## Usage


Load and run the model using PyTorch and transformers:

```python
from transformers import AutoModelForTokenClassification, AutoConfig, BertTokenizerFast
from safetensors.torch import load_file

# Load the config
config = AutoConfig.from_pretrained("folder_to_model")

# Initialize the model with the config
model = AutoModelForTokenClassification.from_config(config)

# Load the safetensors weights
state_dict = load_file("folder_to_tensors")

# Load the state dict into the model
model.load_state_dict(state_dict)

# Load the tokenizer
tokenizer = BertTokenizerFast.from_pretrained("google-bert/bert-base-multilingual-cased")

# Load the label mapper if needed
with open("pii_model/label_mapper.json", 'r') as f:
    label_mapper_data = json.load(f)

label_mapper = LabelMapper()
label_mapper.label_to_id = label_mapper_data['label_to_id']
label_mapper.id_to_label = {int(k): v for k, v in label_mapper_data['id_to_label'].items()}
label_mapper.num_labels = label_mapper_data['num_labels']

# Process outputs for analysis...
```

---

## Evaluation

- **Accuracy Model**: Focused on minimizing errors, evaluates to achieve the highest accuracy metrics.
- **Precision Model**: Designed to minimize false positives, optimizing for precision-driven applications.

---

## Disclaimer 
The publisher of this repository is not affiliated with Ai4Privacy and Ai Suisse SA

## Honorary Mention
This repo created during the Hackaton organized by [NeuralWave](https://neuralwave.ch/#/)