DioulaD commited on
Commit
f526676
·
verified ·
1 Parent(s): 942caa6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +184 -0
README.md ADDED
@@ -0,0 +1,184 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ language:
4
+ - wo
5
+ - en
6
+ license: apache-2.0
7
+ ---
8
+
9
+ # Oolel: Pioneering Open Source Wolof Language Model
10
+
11
+ <!-- Provide a quick summary of what the model is/does. -->
12
+
13
+
14
+ ## 1. Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+
19
+ **Oolel** is Africa's first open-source language model designed specifically for **Wolof**, one of West Africa's major languages. Created by **Soyno**, it introduces cutting-edge language technology to Wolof speakers, making powerful language technology accessible to everyone.
20
+
21
+ Based on **Qwen 2.5** architecture, Oolel brings together cutting-edge AI advances and Wolof linguistic expertise. This pioneering work stands as a testament to Soyno's commitment to developing AI solutions by Africans, for Africa, marking a significant step toward the continent's technological sovereignty.
22
+
23
+ - **Developed by: Soynade Research (Soyno)**
24
+ - **Supported Languages:** Wolof, English, French
25
+ - **Status:** This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities.
26
+ - **Model Release Date:** Dec 04, 2024
27
+ - **License:** Apache License
28
+ - **Finetuned from model:** Qwen 2.5
29
+
30
+ ## 2. Key Features and Capabilities
31
+ Oolel demonstrates proficiency in:
32
+
33
+ - Bidirectional translation between English and Wolof
34
+ - Natural text generation in Wolof
35
+ - Code generation with Wolof instructions
36
+ - Standard LLM tasks including:
37
+ - Question answering
38
+ - Summarization
39
+ - Contextual understanding
40
+
41
+ ## 3. Usage
42
+
43
+ Here provides a code snippet with apply_chat_template to show you how to load the tokenizer and model and how to generate contents.
44
+
45
+ ### 3.1 With Transformers
46
+ ```python
47
+
48
+ from transformers import AutoModelForCausalLM, AutoTokenizer
49
+ import torch
50
+
51
+ device = "cuda"
52
+
53
+ model = AutoModelForCausalLM.from_pretrained(
54
+ "Soyno/Oolel-v0.1",
55
+ torch_dtype = torch.bfloat16,
56
+ device_map="auto")
57
+
58
+ tokenizer = AutoTokenizer.from_pretrained("Soyno/Oolel-v0.1")
59
+
60
+ def generate_response(messages, max_new_tokens=1024, temperature=0.1):
61
+ text = tokenizer.apply_chat_template(
62
+ messages,
63
+ tokenize=False,
64
+ add_generation_prompt=True
65
+ )
66
+ model_inputs = tokenizer([text], return_tensors="pt").to(device)
67
+ generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=max_new_tokens, temperature=temperature)
68
+
69
+ generated_ids = [
70
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
71
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
72
+ return response
73
+
74
+ ```
75
+
76
+
77
+ **Some tasks examples:**
78
+
79
+ 1. **Translation Tasks**
80
+
81
+ ```bash
82
+ messages = [
83
+ {"role": "user", "content": "Translate to Wolof: Bassirou Diomaye Faye is the new Senegalese president. He is 44 years old"}
84
+ ]
85
+ print(generate_response(messages))
86
+ ```
87
+
88
+ 2. **Code generation**
89
+
90
+ ```python
91
+ messages = [
92
+ {"role": "user", "content": "Bindal ab klaas Python buy wone ni ñuy jëfandikoo dataframe yi ci Pandas"}
93
+ ]
94
+ print(generate_response(messages))
95
+ ```
96
+
97
+ 3. **Problem Solving**
98
+
99
+ ```bash
100
+ system_prompt = "You're a Wolof AI assistant. Please always provide detailed and useful answers to the user queries"
101
+ messages = [
102
+ {"role": "system", "content": system_prompt},
103
+ {"role": "user", "content": "Ndax nga mën ma won ni ñuy resolver problème bii: Fatou dafa jënd 3 kilo ceeb, 2 kilo diw ak 5 kilo sukër. Ceeb gi wenn kilo 500 CFA la, diw gi 1200 CFA kilo bi, sukër gi 750 CFA kilo bi. Ñaata la wara fay?"}
104
+ ]
105
+ from pprint import pprint
106
+ pprint(generate_response(messages))
107
+ ```
108
+
109
+
110
+ 4. **Text Generation** (e.g. story generation)
111
+
112
+ ```python
113
+ system_prompt = "You are a skilled Wolof storyteller (Gewël) with deep knowledge of African folktales and traditions. Write engaging stories in Wolof that reflect African cultural values and wisdom."
114
+
115
+ messages = [
116
+ {"role": "system", "content": system_prompt},
117
+ {"role": "user", "content": "Bindal ab léeb ci gaynde gi lekk muus mi"}
118
+ ]
119
+ print(generate_response(messages, temperature=0.9))
120
+ ```
121
+
122
+ 5. **Multi-turn conversations**
123
+
124
+ ```bash
125
+ messages = [
126
+ {"role": "user", "content": "Wax ma clan mooy CEDEAO ? Ci lan la liggeey?"},
127
+ {"role": "assistant", "content": "CEDEAO mooy 'organisation' gu boole reew yi nekk ci pennc Afrika bi. Mu ngi sukkandiku ci wàll économie, politig, ak déggoo diggante reew yi"},
128
+ {"role": "user", "content": "ñaata reew ñoo ci bokk?"}
129
+ ]
130
+ print(generate_response(messages))
131
+ ```
132
+
133
+ ### 3.2 VLLM
134
+
135
+ ```python
136
+ from transformers import AutoTokenizer
137
+ from vllm import LLM, SamplingParams
138
+
139
+ tokenizer = AutoTokenizer.from_pretrained("Soyno/Oolel-v0.1-Instruct")
140
+
141
+ # Pass the decoding hyperparameters
142
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512)
143
+
144
+ llm = LLM(model="Soyno/Oolel-v0.1-Instruct")
145
+
146
+ prompt = "Kan mooy Youssou Ndour ?"
147
+ messages = [
148
+ {"role": "system", "content": "You are Oolel, created by Soyno. You are a helpful assistant. Please always provide detailed and useful answers to the user queries"},
149
+ {"role": "user", "content": prompt}
150
+ ]
151
+ text = tokenizer.apply_chat_template(
152
+ messages,
153
+ tokenize=False,
154
+ add_generation_prompt=True
155
+ )
156
+
157
+ outputs = llm.generate([text], sampling_params)
158
+
159
+ # Print the outputs.
160
+ for output in outputs:
161
+ prompt = output.prompt
162
+ generated_text = output.outputs[0].text
163
+ print(f"Prompt: {prompt}, Generated text: {generated_text}")
164
+ ```
165
+
166
+ ## 4. Bias, Risks, and Limitations
167
+
168
+ While Oolel marks a significant milestone, we acknowledge its current limitations:
169
+
170
+ - As a first version, the model's performance continues to evolve
171
+ - Training data diversity can be further expanded
172
+ - Specific domain expertise can be enhanced
173
+
174
+ Future developments will focus on:
175
+
176
+ - Enriching training data with comprehensive African historical content
177
+ - Deeper integration of cultural contexts and nuances
178
+ - Improving performance across various linguistic tasks
179
+ - Strengthening the model's ability to handle complex cultural contexts
180
+
181
+
182
+ ## 5. Authors
183
+ - **Yaya SY**: NLP Researcher (Efficient Continued Pretraining)
184
+ - **Dioula DOUCOURE**: Data & NLP Engineer