DioulaD commited on
Commit
c8e479a
·
verified ·
1 Parent(s): cd1832e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +127 -0
README.md ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ language:
4
+ - wo
5
+ - en
6
+ license: apache-2.0
7
+ pipeline_tag: text2text-generation
8
+ ---
9
+
10
+ # Oolel: A High-Performing Open LLM for Wolof
11
+
12
+ <video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/62e335bbf15e7fce909fe5d4/liiZ1rAkiIgGpgN_jqwq6.mp4"></video>
13
+
14
+
15
+ Despite numerous open-source innovations in large language models, African languages have remained underrepresented.
16
+
17
+ **Soynade Research** is transforming this landscape with Oolel, the first open-source language model for Wolof.
18
+
19
+ Built on the **Qwen 2.5** architecture, Oolel combines state-of-the-art AI technology with deep Wolof linguistic expertise. With careful high-quality curated data, we trained and optimized Oolel for the following tasks:
20
+
21
+ - **RAG** supporting Wolof queries with English, French, or Wolof context.
22
+ - **Bidirectional translation between English and Wolof**
23
+ - **Natural text generation in Wolof**
24
+ - **Math in Wolof**
25
+ - **And many other standard NLP tasks**:
26
+ - Summarization
27
+ - Text edition
28
+ - etc
29
+
30
+ ## 3. Usage
31
+
32
+ **!!! It's important to add your system prompt !!!**
33
+
34
+ Here provides a code snippet with apply_chat_template to show you how to load the tokenizer and model and how to generate contents.
35
+ ```python
36
+
37
+ from transformers import AutoModelForCausalLM, AutoTokenizer
38
+ import torch
39
+
40
+ device = "cuda"
41
+
42
+ model = AutoModelForCausalLM.from_pretrained(
43
+ "soynade-research/Oolel-Small-v0.1",
44
+ torch_dtype = torch.bfloat16,
45
+ device_map="auto")
46
+
47
+ tokenizer = AutoTokenizer.from_pretrained("soynade-research/Oolel-Small-v0.1")
48
+
49
+ def generate_response(messages, max_new_tokens=1024, temperature=0.1):
50
+ text = tokenizer.apply_chat_template(
51
+ messages,
52
+ tokenize=False,
53
+ add_generation_prompt=True
54
+ )
55
+ model_inputs = tokenizer([text], return_tensors="pt").to(device)
56
+ generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=max_new_tokens, temperature=temperature)
57
+
58
+ generated_ids = [
59
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
60
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
61
+ return response
62
+
63
+ ```
64
+
65
+
66
+ **Some tasks examples:**
67
+
68
+ 1. **Translation Tasks**
69
+
70
+ ```python
71
+ system_prompt = "You're a Wolof AI assistant. Please always provide detailed and useful answers to the user queries."
72
+ messages = [
73
+ {"role": "system", "content": system_prompt},
74
+ {"role": "user", "content": "Translate to Wolof: Bassirou Diomaye Faye is the new Senegalese president. He is 44 years old"}
75
+ ]
76
+ print(generate_response(messages))
77
+ ```
78
+
79
+ 2. **Code generation**
80
+
81
+ ```python
82
+ system_prompt = "You're a Wolof AI assistant. Please always provide detailed and useful answers to the user queries"
83
+ messages = [
84
+ {"role": "system", "content": system_prompt},
85
+ {"role": "user", "content": "Bindal ab klaas Python buy wone ni ñuy jëfandikoo dataframe yi ci Pandas"}
86
+ ]
87
+ print(generate_response(messages))
88
+ ```
89
+
90
+ 3. **Problem Solving**
91
+
92
+ ```python
93
+ system_prompt = "You're a Wolof AI assistant. Please always provide detailed and useful answers to the user queries."
94
+ messages = [
95
+ {"role": "system", "content": system_prompt},
96
+ {"role": "user", "content": "Ndax nga mën ma won ni ñuy resolver problème bii: Fatou dafa jënd 3 kilo ceeb, 2 kilo diw ak 5 kilo sukër. Ceeb gi wenn kilo 500 CFA la, diw gi 1200 CFA kilo bi, sukër gi 750 CFA kilo bi. Ñaata la wara fay?"}
97
+ ]
98
+ from pprint import pprint
99
+ pprint(generate_response(messages))
100
+ ```
101
+
102
+
103
+ 4. **Text Generation** (e.g. story generation)
104
+
105
+ ```python
106
+ system_prompt = "You are a skilled Wolof storyteller (Gewël) with deep knowledge of African folktales and traditions. Write engaging stories in Wolof that reflect African cultural values and wisdom."
107
+ messages = [
108
+ {"role": "system", "content": system_prompt},
109
+ {"role": "user", "content": "Bindal ab léeb ci gaynde gi lekk muus mi"}
110
+ ]
111
+ print(generate_response(messages, temperature=0.9))
112
+ ```
113
+
114
+ 5. **Multi-turn conversations**
115
+ Oolel is not optimized for multi-turn conversations, but you can try it!
116
+ ```bash
117
+ messages = [
118
+ {"role": "user", "content": "Wax ma clan mooy CEDEAO ? Ci lan la liggeey?"},
119
+ {"role": "assistant", "content": "CEDEAO mooy 'organisation' gu boole reew yi nekk ci pennc Afrika bi. Mu ngi sukkandiku ci wàll économie, politig, ak déggoo diggante reew yi"},
120
+ {"role": "user", "content": "ñaata reew ñoo ci bokk?"}
121
+ ]
122
+ print(generate_response(messages))
123
+ ```
124
+
125
+ ## Authors
126
+ - [**Yaya SY**](https://x.com/seygalare): NLP Researcher (Efficient Continued Pretraining)
127
+ - [**Dioula DOUCOURE**](https://x.com/DioulaD): Data & NLP Engineer