|
--- |
|
license: cc-by-nc-nd-4.0 |
|
language: |
|
- az |
|
base_model: |
|
- FacebookAI/xlm-roberta-base |
|
pipeline_tag: token-classification |
|
tags: |
|
- personally |
|
- identifiable |
|
- information |
|
- recognition |
|
- ner |
|
--- |
|
|
|
# PII NER Azerbaijani |
|
|
|
**PII NER Azerbaijani** is a fine-tuned Named Entity Recognition (NER) model based on XLM-RoBERTa. It is trained on Azerbaijani pii data for classification personally identifiable information such as names, dates of birth, cities, addresses, and phone numbers from text. |
|
|
|
## Model Details |
|
|
|
- **Base Model:** XLM-RoBERTa |
|
- **Training Metrics:** |
|
- **Epoch 1:** Training Loss: 0.156, Validation Loss: 0.1309, Precision: 0.7794, Recall: 0.7940, F1: 0.7866, Accuracy: 0.9590 |
|
- **Epoch 2:** Training Loss: 0.1196, Validation Loss: 0.1172, Precision: 0.8042, Recall: 0.8078, F1: 0.8060, Accuracy: 0.9618 |
|
- **Epoch 3:** Training Loss: 0.1069, Validation Loss: 0.1129, Precision: 0.8096, Recall: 0.8213, F1: 0.8154, Accuracy: 0.9639 |
|
|
|
- **Test Metrics:** |
|
- Loss: 0.11616, Precision: 0.80187, Recall: 0.80821, F1: 0.80503, Accuracy: 0.96264 |
|
|
|
## Entities (id2label) |
|
|
|
```python |
|
{ |
|
0: "O", |
|
1: "VEHICLEVRM", |
|
2: "HEIGHT", |
|
3: "USERNAME", |
|
4: "FIRSTNAME", |
|
5: "BUILDINGNUMBER", |
|
6: "SEX", |
|
7: "PHONENUMBER", |
|
8: "CURRENCY", |
|
9: "CREDITCARDISSUER", |
|
10: "CURRENCYNAME", |
|
11: "MAC", |
|
12: "MIDDLENAME", |
|
13: "TIME", |
|
14: "EYECOLOR", |
|
15: "CURRENCYSYMBOL", |
|
16: "GENDER", |
|
17: "URL", |
|
18: "CURRENCYCODE", |
|
19: "ZIPCODE", |
|
20: "CREDITCARDCVV", |
|
21: "JOBTITLE", |
|
22: "PHONEIMEI", |
|
23: "COUNTY", |
|
24: "JOBTYPE", |
|
25: "LITECOINADDRESS", |
|
26: "COMPANYNAME", |
|
27: "ORDINALDIRECTION", |
|
28: "MASKEDNUMBER", |
|
29: "USERAGENT", |
|
30: "LASTNAME", |
|
31: "SSN", |
|
32: "STREET", |
|
33: "SECONDARYADDRESS", |
|
34: "STATE", |
|
35: "ETHEREUMADDRESS", |
|
36: "AMOUNT", |
|
37: "ACCOUNTNUMBER", |
|
38: "CITY", |
|
39: "CREDITCARDNUMBER", |
|
40: "BIC", |
|
41: "EMAIL", |
|
42: "NEARBYGPSCOORDINATE", |
|
43: "PIN", |
|
44: "ACCOUNTNAME", |
|
45: "VEHICLEVIN", |
|
46: "PREFIX", |
|
47: "JOBAREA", |
|
48: "AGE", |
|
49: "PASSWORD", |
|
50: "DOB", |
|
51: "BITCOINADDRESS", |
|
52: "IBAN", |
|
53: "IP", |
|
54: "DATE" |
|
} |
|
``` |
|
|
|
## Usage |
|
|
|
To use the model for spell correction: |
|
|
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
|
model_id = "LocalDoc/private_ner_azerbaijani" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
model = AutoModelForTokenClassification.from_pretrained(model_id) |
|
|
|
test_text = ( |
|
"Salam, mənim adım Əli Hüseynovdur. Doğum tarixim 15.05.1990-dır. Bakı şəhərində, Nizami küçəsində, 25/31 ünvanında yaşayıram. Telefon nömrəm +994552345678-dir." |
|
) |
|
|
|
inputs = tokenizer(test_text, return_tensors="pt", return_offsets_mapping=True) |
|
|
|
offset_mapping = inputs.pop("offset_mapping") |
|
|
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
|
|
predictions = torch.argmax(outputs.logits, dim=2) |
|
|
|
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) |
|
offset_mapping = offset_mapping[0].tolist() |
|
predicted_labels = [model.config.id2label[pred.item()] for pred in predictions[0]] |
|
word_ids = inputs.word_ids(batch_index=0) |
|
|
|
aggregated = [] |
|
prev_word_id = None |
|
for idx, word_id in enumerate(word_ids): |
|
if word_id is None: |
|
continue |
|
if word_id != prev_word_id: |
|
aggregated.append({ |
|
"word_id": word_id, |
|
"tokens": [tokens[idx]], |
|
"offsets": [offset_mapping[idx]], |
|
"label": predicted_labels[idx] |
|
}) |
|
else: |
|
aggregated[-1]["tokens"].append(tokens[idx]) |
|
aggregated[-1]["offsets"].append(offset_mapping[idx]) |
|
prev_word_id = word_id |
|
|
|
entities = [] |
|
current_entity = None |
|
for word in aggregated: |
|
if word["label"] == "O": |
|
if current_entity is not None: |
|
entities.append(current_entity) |
|
current_entity = None |
|
else: |
|
if current_entity is None: |
|
current_entity = { |
|
"type": word["label"], |
|
"start": word["offsets"][0][0], |
|
"end": word["offsets"][-1][1] |
|
} |
|
else: |
|
if word["label"] == current_entity["type"]: |
|
current_entity["end"] = word["offsets"][-1][1] |
|
else: |
|
entities.append(current_entity) |
|
current_entity = { |
|
"type": word["label"], |
|
"start": word["offsets"][0][0], |
|
"end": word["offsets"][-1][1] |
|
} |
|
if current_entity is not None: |
|
entities.append(current_entity) |
|
|
|
for entity in entities: |
|
entity["text"] = test_text[entity["start"]:entity["end"]] |
|
|
|
for entity in entities: |
|
print(entity) |
|
``` |
|
|
|
```json |
|
{'type': 'FIRSTNAME', 'start': 18, 'end': 21, 'text': 'Əli'} |
|
{'type': 'LASTNAME', 'start': 22, 'end': 34, 'text': 'Hüseynovdur.'} |
|
{'type': 'DOB', 'start': 49, 'end': 64, 'text': '15.05.1990-dır.'} |
|
{'type': 'STREET', 'start': 81, 'end': 87, 'text': 'Nizami'} |
|
{'type': 'BUILDINGNUMBER', 'start': 99, 'end': 104, 'text': '25/31'} |
|
{'type': 'PHONENUMBER', 'start': 141, 'end': 159, 'text': '+994552345678-dir.'} |
|
``` |
|
|
|
## License |
|
|
|
This model licensed under the CC BY-NC-ND 4.0 license. |
|
What does this license allow? |
|
|
|
Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. |
|
Non-Commercial: You may not use the material for commercial purposes. |
|
No Derivatives: If you remix, transform, or build upon the material, you may not distribute the modified material. |
|
|
|
For more information, please refer to the <a target="_blank" href="https://creativecommons.org/licenses/by-nc-nd/4.0/">CC BY-NC-ND 4.0 license</a>. |
|
|
|
|
|
## Contact |
|
|
|
For more information, questions, or issues, please contact LocalDoc at [[email protected]]. |