roberta-base-amharic

This model has the same architecture as xlm-roberta-base and was pretrained from scratch using the Amharic subsets of the oscar, mc4, and amharic-sentences-corpus datasets, on a total of 290 Million tokens. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 32k.

The model was trained for 22 hours on an A100 40GB GPU.

It achieves the following results on the evaluation set:

  • Loss: 2.09
  • Perplexity: 8.08

This model has 110 Million parameters and is currently the best Amharic encoder model, beating the 2.5x larger 279 Million parameter xlm-roberta-base multilingual model on Amharic Sentiment Classification and Named Entity Recognition tasks.

How to use

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='rasyosef/roberta-base-amharic')
>>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ <mask> ተቆጥሯል።")

[{'score': 0.40162667632102966,
  'token': 137,
  'token_str': 'ዓመት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል።'},
 {'score': 0.24096301198005676,
  'token': 346,
  'token_str': 'አመት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል።'},
 {'score': 0.15971705317497253,
  'token': 217,
  'token_str': 'ዓመታት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል።'},
 {'score': 0.13074122369289398,
  'token': 733,
  'token_str': 'አመታት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል።'},
 {'score': 0.03847867250442505,
  'token': 194,
  'token_str': 'ዘመን',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዘመን ተቆጥሯል።'}]

Finetuning

This model was finetuned and evaluated on the following Amharic NLP tasks

Finetuned Model Performance

The reported F1 scores are macro averages.

Model Size (# params) Perplexity Sentiment (F1) Named Entity Recognition (F1)
roberta-base-amharic 110M 8.08 0.88 0.78
roberta-medium-amharic 42.2M 11.59 0.84 0.75
bert-medium-amharic 40.5M 13.74 0.83 0.68
bert-small-amharic 27.8M 15.96 0.83 0.68
bert-mini-amharic 10.7M 22.42 0.81 0.64
bert-tiny-amharic 4.18M 71.52 0.79 0.54
xlm-roberta-base 279M 0.83 0.73
afro-xlmr-base 278M 0.83 0.75
afro-xlmr-large 560M 0.86 0.76
am-roberta 443M 0.82 0.69
Downloads last month
268
Safetensors
Model size
111M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for rasyosef/roberta-base-amharic

Finetunes
1 model

Datasets used to train rasyosef/roberta-base-amharic

Collection including rasyosef/roberta-base-amharic