--- license: cc-by-nc-nd-4.0 language: - az base_model: - FacebookAI/xlm-roberta-base pipeline_tag: token-classification tags: - personally - identifiable - information - recognition - ner --- # PII NER Azerbaijani **PII NER Azerbaijani** is a fine-tuned Named Entity Recognition (NER) model based on XLM-RoBERTa. It is trained on Azerbaijani pii data for classification personally identifiable information such as names, dates of birth, cities, addresses, and phone numbers from text. ## Model Details - **Base Model:** XLM-RoBERTa - **Training Metrics:** - **Epoch 1:** Training Loss: 0.156, Validation Loss: 0.1309, Precision: 0.7794, Recall: 0.7940, F1: 0.7866, Accuracy: 0.9590 - **Epoch 2:** Training Loss: 0.1196, Validation Loss: 0.1172, Precision: 0.8042, Recall: 0.8078, F1: 0.8060, Accuracy: 0.9618 - **Epoch 3:** Training Loss: 0.1069, Validation Loss: 0.1129, Precision: 0.8096, Recall: 0.8213, F1: 0.8154, Accuracy: 0.9639 - **Test Metrics:** - Loss: 0.11616, Precision: 0.80187, Recall: 0.80821, F1: 0.80503, Accuracy: 0.96264 ## Entities (id2label) ```python { 0: "O", 1: "VEHICLEVRM", 2: "HEIGHT", 3: "USERNAME", 4: "FIRSTNAME", 5: "BUILDINGNUMBER", 6: "SEX", 7: "PHONENUMBER", 8: "CURRENCY", 9: "CREDITCARDISSUER", 10: "CURRENCYNAME", 11: "MAC", 12: "MIDDLENAME", 13: "TIME", 14: "EYECOLOR", 15: "CURRENCYSYMBOL", 16: "GENDER", 17: "URL", 18: "CURRENCYCODE", 19: "ZIPCODE", 20: "CREDITCARDCVV", 21: "JOBTITLE", 22: "PHONEIMEI", 23: "COUNTY", 24: "JOBTYPE", 25: "LITECOINADDRESS", 26: "COMPANYNAME", 27: "ORDINALDIRECTION", 28: "MASKEDNUMBER", 29: "USERAGENT", 30: "LASTNAME", 31: "SSN", 32: "STREET", 33: "SECONDARYADDRESS", 34: "STATE", 35: "ETHEREUMADDRESS", 36: "AMOUNT", 37: "ACCOUNTNUMBER", 38: "CITY", 39: "CREDITCARDNUMBER", 40: "BIC", 41: "EMAIL", 42: "NEARBYGPSCOORDINATE", 43: "PIN", 44: "ACCOUNTNAME", 45: "VEHICLEVIN", 46: "PREFIX", 47: "JOBAREA", 48: "AGE", 49: "PASSWORD", 50: "DOB", 51: "BITCOINADDRESS", 52: "IBAN", 53: "IP", 54: "DATE" } ``` ## Usage To use the model for spell correction: ```python import torch from transformers import AutoTokenizer, AutoModelForTokenClassification model_id = "LocalDoc/private_ner_azerbaijani" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForTokenClassification.from_pretrained(model_id) test_text = ( "Salam, mənim adım Əli Hüseynovdur. Doğum tarixim 15.05.1990-dır. Bakı şəhərində, Nizami küçəsində, 25/31 ünvanında yaşayıram. Telefon nömrəm +994552345678-dir." ) inputs = tokenizer(test_text, return_tensors="pt", return_offsets_mapping=True) offset_mapping = inputs.pop("offset_mapping") with torch.no_grad(): outputs = model(**inputs) predictions = torch.argmax(outputs.logits, dim=2) tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) offset_mapping = offset_mapping[0].tolist() predicted_labels = [model.config.id2label[pred.item()] for pred in predictions[0]] word_ids = inputs.word_ids(batch_index=0) aggregated = [] prev_word_id = None for idx, word_id in enumerate(word_ids): if word_id is None: continue if word_id != prev_word_id: aggregated.append({ "word_id": word_id, "tokens": [tokens[idx]], "offsets": [offset_mapping[idx]], "label": predicted_labels[idx] }) else: aggregated[-1]["tokens"].append(tokens[idx]) aggregated[-1]["offsets"].append(offset_mapping[idx]) prev_word_id = word_id entities = [] current_entity = None for word in aggregated: if word["label"] == "O": if current_entity is not None: entities.append(current_entity) current_entity = None else: if current_entity is None: current_entity = { "type": word["label"], "start": word["offsets"][0][0], "end": word["offsets"][-1][1] } else: if word["label"] == current_entity["type"]: current_entity["end"] = word["offsets"][-1][1] else: entities.append(current_entity) current_entity = { "type": word["label"], "start": word["offsets"][0][0], "end": word["offsets"][-1][1] } if current_entity is not None: entities.append(current_entity) for entity in entities: entity["text"] = test_text[entity["start"]:entity["end"]] for entity in entities: print(entity) ``` ```json {'type': 'FIRSTNAME', 'start': 18, 'end': 21, 'text': 'Əli'} {'type': 'LASTNAME', 'start': 22, 'end': 34, 'text': 'Hüseynovdur.'} {'type': 'DOB', 'start': 49, 'end': 64, 'text': '15.05.1990-dır.'} {'type': 'STREET', 'start': 81, 'end': 87, 'text': 'Nizami'} {'type': 'BUILDINGNUMBER', 'start': 99, 'end': 104, 'text': '25/31'} {'type': 'PHONENUMBER', 'start': 141, 'end': 159, 'text': '+994552345678-dir.'} ``` ## License This model licensed under the CC BY-NC-ND 4.0 license. What does this license allow? Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. Non-Commercial: You may not use the material for commercial purposes. No Derivatives: If you remix, transform, or build upon the material, you may not distribute the modified material. For more information, please refer to the CC BY-NC-ND 4.0 license. ## Contact For more information, questions, or issues, please contact LocalDoc at [v.resad.89@gmail.com].