eolang
/

SW-v1

+---
+datasets:
+- xnli
+language:
+- sw
+library_name: transformers
+examples: null
+widget:
+  - text: Uhuru Kenyatta ni rais wa [MASK].
+    example_title: Sentence_1
+  - text: Tumefanya mabadiliko muhimu [MASK] sera zetu za faragha na vidakuzi
+    example_title: Sentence_2
+---
+# SW
+* Pre-trained model on Swahili language using a masked language modeling (MLM) objective.
+## Model description
+This is a transformers model pre-trained on a large corpus of Swahili data in a self-supervised fashion. This means it
+was pre-trained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of
+publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it
+was pre-trained with one objective:
+- Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run
+  the entire masked sentence through the model and has to predict the masked words. This is different from traditional
+  recurrent neural networks (RNNs) that usually see the terms one after the other, or from autoregressive models like
+  GPT which internally masks the future tokens. It allows the model to learn a bidirectional representation of the
+  sentence.
+This way, the model learns an inner representation of the Swahili language that can then be used to extract features
+useful for downstream tasks e.g.
+ * Named Entity Recognition (Token Classification)
+ * Text Classification
+The model is based on the Orginal BERT UNCASED which can be found on [google-research/bert readme](https://github.com/google-research/bert/blob/master/README.md)
+## Intended uses & limitations
+You can use the raw model for masked language modeling, but it's primarily intended to be fine-tuned on a downstream task.
+Check out this variant TUS_NER-sw, a finetuned version of TUS meant for Named Entity Recognition
+### How to use
+You can use this model directly with a pipeline for masked language modeling:
+```python
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+tokenizer = AutoTokenizer.from_pretrained("eolang/SW-v1")
+model = AutoModelForMaskedLM.from_pretrained("eolang/SW-v1")
+text = "Hii ni tovuti ya idhaa ya Kiswahili ya BBC ambayo hukuletea habari na makala kutoka Afrika na kote duniani kwa lugha ya Kiswahili."
+encoded_input = tokenizer(text, return_tensors='pt')
+output = model(**encoded_input)
+```
+### Limitations and Bias
+Even if the training data used for this model could be reasonably neutral, this model can have biased
+predictions. This is something we are still working on improving.