soynade-research
/

Oolel-v0.1

+---
+library_name: transformers
+language:
+- wo
+- en
+license: apache-2.0
+---
+# Oolel: Pioneering Open Source Wolof Language Model
+<!-- Provide a quick summary of what the model is/does. -->
+## 1. Model Description
+<!-- Provide a longer summary of what this model is. -->
+**Oolel** is Africa's first open-source language model designed specifically for **Wolof**, one of West Africa's major languages. Created by **Soyno**, it introduces cutting-edge language technology to Wolof speakers, making powerful language technology accessible to everyone.
+Based on **Qwen 2.5** architecture, Oolel brings together cutting-edge AI advances and Wolof linguistic expertise. This pioneering work stands as a testament to Soyno's commitment to developing AI solutions by Africans, for Africa, marking a significant step toward the continent's technological sovereignty.
+- **Developed by: Soynade Research (Soyno)**
+- **Supported Languages:**  Wolof, English, French
+- **Status:** This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities.
+- **Model Release Date:** Dec 04, 2024
+- **License:** Apache License
+- **Finetuned from model:** Qwen 2.5
+## 2. Key Features and Capabilities
+Oolel demonstrates proficiency in:
+- Bidirectional translation between English and Wolof
+- Natural text generation in Wolof
+- Code generation with Wolof instructions
+- Standard LLM tasks including:
+  - Question answering
+  - Summarization
+  - Contextual understanding
+## 3. Usage
+Here provides a code snippet with apply_chat_template to show you how to load the tokenizer and model and how to generate contents.
+### 3.1 With Transformers
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+device = "cuda"
+model = AutoModelForCausalLM.from_pretrained(
+    "Soyno/Oolel-v0.1",
+    torch_dtype = torch.bfloat16,
+    device_map="auto")
+tokenizer = AutoTokenizer.from_pretrained("Soyno/Oolel-v0.1")
+def generate_response(messages, max_new_tokens=1024, temperature=0.1):
+    text = tokenizer.apply_chat_template(
+          messages,
+          tokenize=False,
+          add_generation_prompt=True
+)
+    model_inputs = tokenizer([text], return_tensors="pt").to(device)
+    generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=max_new_tokens, temperature=temperature)
+    generated_ids = [
+    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
+    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+    return response
+```
+**Some tasks examples:**
+1. **Translation Tasks**
+```bash
+messages = [
+    {"role": "user", "content": "Translate to Wolof: Bassirou Diomaye Faye is the new Senegalese president. He is 44 years old"}
+]
+print(generate_response(messages))
+```
+2. **Code generation**
+```python
+messages = [
+    {"role": "user", "content": "Bindal ab klaas Python buy wone ni ñuy jëfandikoo dataframe yi ci Pandas"}
+]
+print(generate_response(messages))
+```
+3. **Problem Solving**
+```bash
+system_prompt = "You're a Wolof AI assistant. Please always provide detailed and useful answers to the user queries"
+messages = [
+    {"role": "system", "content": system_prompt},
+    {"role": "user", "content": "Ndax nga mën ma won ni ñuy resolver problème bii: Fatou dafa jënd 3 kilo ceeb, 2 kilo diw ak 5 kilo sukër. Ceeb gi wenn kilo 500 CFA la, diw gi 1200 CFA kilo bi, sukër gi 750 CFA kilo bi. Ñaata la wara fay?"}
+]
+from pprint import pprint
+pprint(generate_response(messages))
+```
+4.  **Text Generation** (e.g. story generation)
+```python
+system_prompt = "You are a skilled Wolof storyteller (Gewël) with deep knowledge of African folktales and traditions. Write engaging stories in Wolof that reflect African cultural values and wisdom."
+messages = [
+    {"role": "system", "content": system_prompt},
+    {"role": "user", "content": "Bindal ab léeb ci gaynde gi lekk muus mi"}
+]
+print(generate_response(messages, temperature=0.9))
+```
+5. **Multi-turn conversations**
+```bash
+messages = [
+   {"role": "user", "content": "Wax ma clan mooy CEDEAO ? Ci lan la liggeey?"},
+   {"role": "assistant", "content": "CEDEAO mooy 'organisation' gu boole reew yi nekk ci pennc Afrika bi. Mu ngi sukkandiku ci wàll économie, politig, ak déggoo diggante reew yi"},
+   {"role": "user", "content": "ñaata reew ñoo ci bokk?"}
+]
+print(generate_response(messages))
+```
+### 3.2 VLLM
+```python
+from transformers import AutoTokenizer
+from vllm import LLM, SamplingParams
+tokenizer = AutoTokenizer.from_pretrained("Soyno/Oolel-v0.1-Instruct")
+# Pass the decoding hyperparameters
+sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512)
+llm = LLM(model="Soyno/Oolel-v0.1-Instruct")
+prompt = "Kan mooy Youssou Ndour ?"
+messages = [
+    {"role": "system", "content": "You are Oolel, created by Soyno. You are a helpful assistant. Please always provide detailed and useful answers to the user queries"},
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+outputs = llm.generate([text], sampling_params)
+# Print the outputs.
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt}, Generated text: {generated_text}")
+```
+## 4. Bias, Risks, and Limitations
+While Oolel marks a significant milestone, we acknowledge its current limitations:
+- As a first version, the model's performance continues to evolve
+- Training data diversity can be further expanded
+- Specific domain expertise can be enhanced
+Future developments will focus on:
+- Enriching training data with comprehensive African historical content
+- Deeper integration of cultural contexts and nuances
+- Improving performance across various linguistic tasks
+- Strengthening the model's ability to handle complex cultural contexts
+## 5. Authors
+- **Yaya SY**: NLP Researcher (Efficient Continued Pretraining)
+- **Dioula DOUCOURE**: Data & NLP Engineer