eolang commited on
Commit
30fd8dd
·
1 Parent(s): a57af5f

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -0
README.md ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - xnli
4
+ language:
5
+ - sw
6
+ library_name: transformers
7
+ examples: null
8
+ widget:
9
+ - text: Uhuru Kenyatta ni rais wa [MASK].
10
+ example_title: Sentence_1
11
+ - text: Tumefanya mabadiliko muhimu [MASK] sera zetu za faragha na vidakuzi
12
+ example_title: Sentence_2
13
+ ---
14
+
15
+ # SW
16
+
17
+ * Pre-trained model on Swahili language using a masked language modeling (MLM) objective.
18
+
19
+ ## Model description
20
+
21
+ This is a transformers model pre-trained on a large corpus of Swahili data in a self-supervised fashion. This means it
22
+ was pre-trained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of
23
+ publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it
24
+ was pre-trained with one objective:
25
+
26
+ - Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run
27
+ the entire masked sentence through the model and has to predict the masked words. This is different from traditional
28
+ recurrent neural networks (RNNs) that usually see the terms one after the other, or from autoregressive models like
29
+ GPT which internally masks the future tokens. It allows the model to learn a bidirectional representation of the
30
+ sentence.
31
+
32
+ This way, the model learns an inner representation of the Swahili language that can then be used to extract features
33
+ useful for downstream tasks e.g.
34
+ * Named Entity Recognition (Token Classification)
35
+ * Text Classification
36
+
37
+ The model is based on the Orginal BERT UNCASED which can be found on [google-research/bert readme](https://github.com/google-research/bert/blob/master/README.md)
38
+
39
+
40
+ ## Intended uses & limitations
41
+
42
+ You can use the raw model for masked language modeling, but it's primarily intended to be fine-tuned on a downstream task.
43
+ Check out this variant TUS_NER-sw, a finetuned version of TUS meant for Named Entity Recognition
44
+
45
+ ### How to use
46
+
47
+ You can use this model directly with a pipeline for masked language modeling:
48
+
49
+
50
+ ```python
51
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
52
+
53
+ tokenizer = AutoTokenizer.from_pretrained("eolang/SW-v1")
54
+ model = AutoModelForMaskedLM.from_pretrained("eolang/SW-v1")
55
+
56
+ text = "Hii ni tovuti ya idhaa ya Kiswahili ya BBC ambayo hukuletea habari na makala kutoka Afrika na kote duniani kwa lugha ya Kiswahili."
57
+ encoded_input = tokenizer(text, return_tensors='pt')
58
+ output = model(**encoded_input)
59
+ ```
60
+ ### Limitations and Bias
61
+
62
+ Even if the training data used for this model could be reasonably neutral, this model can have biased
63
+ predictions. This is something we are still working on improving.