NLLB-200 Based Kabardian POS Tagger
This is a fine-tuned version of panagoa/nllb-200-1.3b-kbd-v0.1 for Part-of-Speech (POS) tagging of Kabardian (East Circassian) language. The model was trained on the Kabardian Part-of-Speech Tagging Dataset.
Model Description
The model uses NLLB-200's encoder-decoder architecture to perform sequence-to-sequence POS tagging, where input is a Kabardian sentence and output is the same sentence with Universal POS tags.
Input Format
Plain Kabardian text, for example:
Абы и Iуэхум сэ нэхъыбэ зэрыхэсщIыкIраи сызыукIыр
Output Format
Text with XML-style POS tags:
Абы<PRON> и<PRON> Iуэхум<NOUN> сэ<PRON> нэхъыбэ<ADV> зэрыхэсщIыкIраи<VERB> сызыукIыр<VERB>
Universal POS Tags Used
The model uses the Universal POS tagset:
ADJ
: adjectivesADP
: adpositionsADV
: adverbsAUX
: auxiliariesCCONJ
: coordinating conjunctionsDET
: determinersINTJ
: interjectionsNOUN
: nounsNUM
: numeralsPART
: particlesPRON
: pronounsPROPN
: proper nounsPUNCT
: punctuationSCONJ
: subordinating conjunctionsSYM
: symbolsVERB
: verbsX
: other
Usage
from transformers import AutoTokenizer, M2M100ForConditionalGeneration
# Load model and tokenizer
model_name = "panagoa/nllb-200-based-kbd-pos-tagger"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = M2M100ForConditionalGeneration.from_pretrained(model_name)
# Function for POS tagging
def pos_tag_sentence(text, model, tokenizer, output_format='mixed', max_length=128):
"""
Perform POS tagging on a Kabardian sentence.
Args:
text (str): Input sentence to tag
model: The trained model
tokenizer: The tokenizer
output_format: One of 'mixed' (слово<TAG>), 'words' (только слова),
'tags' (только теги), or 'pairs' (список пар (слово, тег))
max_length (int): Maximum sequence length
Returns:
Based on output_format:
- 'mixed': str with format "слово<TAG> слово<TAG>"
- 'words': list of words
- 'tags': list of tags
- 'pairs': list of (word, tag) tuples
"""
inputs = tokenizer(text,
return_tensors="pt",
max_length=max_length,
truncation=True)
device = model.device
inputs = {k: v.to(device) for k, v in inputs.items()}
outputs = model.generate(
**inputs,
num_beams=2,
do_sample=False,
# temperature=0.7,
# top_p=0.95,
repetition_penalty=1.2,
no_repeat_ngram_size=2,
early_stopping=True,
max_new_tokens=128,
)
# Получаем предсказание модели
result = tokenizer.decode(outputs[0], skip_special_tokens=False)
# Парсим результат
tagged_tokens = result.strip().split()
words = []
tags = []
for token in tagged_tokens:
match = re.match(r'(.+)<([A-Z]+)>', token)
if match:
word, tag = match.groups()
words.append(word)
tags.append(tag)
# Возвращаем результат в запрошенном формате
if output_format == 'mixed':
return result
elif output_format == 'words':
return words
elif output_format == 'tags':
return tags
elif output_format == 'pairs':
return list(zip(words, tags))
else:
raise ValueError(f"Unknown output format: {output_format}")
# Example usage
print(pos_tag_sentence('Шкафым фалъэр дэкъутыхьащ.', model, tokenizer, output_format='mixed'))
</s>kbd_Cyrl Шкафым<NOUN> фалъэр<NOUN> дэкъутыхьащ<VERB>.</s>
print(pos_tag_sentence('Шкафым фалъэр дэкъутыхьащ.', model, tokenizer, output_format='pairs'))
[('Шкафым', 'NOUN'), ('фалъэр', 'NOUN'), ('дэкъутыхьащ', 'VERB')]
Training
The model was trained on the Kabardian Part-of-Speech Tagging Dataset which contains 82,925 annotated sentences. The training used the following configuration:
- Learning rate: 2e-5
- Batch size: 16
- Training epochs: 3
- Max sequence length: 256
- Optimizer: AdamW with weight decay 0.01
Evaluation
The model was evaluated on a held-out test set with the following metrics:
- Tag accuracy: How often individual POS tags are correct
- Sentence accuracy: How often entire sentences are correctly tagged
Limitations
- The model's performance may vary for:
- Very long sentences
- Dialectal variations
- Non-standard orthography
- Specialized or technical vocabulary
- The training data was created using a zero-shot approach, which may introduce some biases
Intended Use
This model is intended for:
- POS tagging of Kabardian text
- Linguistic research on Kabardian
- Development of NLP tools for Kabardian
- Comparative linguistic studies
Citation
If you use this model, please cite both the model and the dataset:
@misc{kbd-pos-tagger,
author = {Panagoa},
title = {NLLB-200 Based Kabardian POS Tagger},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/panagoa/nllb-200-based-kbd-pos-tagger}}
}
License
This model inherits the license of the base NLLB-200 model and the dataset used for fine-tuning.
- Downloads last month
- 2
Model tree for panagoa/nllb-200-based-kbd-pos-tagger
Base model
facebook/nllb-200-distilled-600M