--- library_name: transformers language: - wo - en license: apache-2.0 --- # Oolel: Pioneering Open Source Wolof Language Model ## 1. Model Description **Oolel** is Africa's first open-source language model designed specifically for **Wolof**, one of West Africa's major languages. Created by **Soyno**, it introduces cutting-edge language technology to Wolof speakers, making powerful language technology accessible to everyone. Based on **Qwen 2.5** architecture, Oolel brings together cutting-edge AI advances and Wolof linguistic expertise. This pioneering work stands as a testament to Soyno's commitment to developing AI solutions by Africans, for Africa, marking a significant step toward the continent's technological sovereignty. - **Developed by: Soynade Research (Soyno)** - **Supported Languages:** Wolof, English, French - **Status:** This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities. - **Model Release Date:** Dec 04, 2024 - **License:** Apache License - **Finetuned from model:** Qwen 2.5 ## 2. Key Features and Capabilities Oolel demonstrates proficiency in: - Bidirectional translation between English and Wolof - Natural text generation in Wolof - Code generation with Wolof instructions - Standard LLM tasks including: - Question answering - Summarization - Contextual understanding ## 3. Usage Here provides a code snippet with apply_chat_template to show you how to load the tokenizer and model and how to generate contents. ### 3.1 With Transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch device = "cuda" model = AutoModelForCausalLM.from_pretrained( "Soyno/Oolel-v0.1", torch_dtype = torch.bfloat16, device_map="auto") tokenizer = AutoTokenizer.from_pretrained("Soyno/Oolel-v0.1") def generate_response(messages, max_new_tokens=1024, temperature=0.1): text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(device) generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=max_new_tokens, temperature=temperature) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] return response ``` **Some tasks examples:** 1. **Translation Tasks** ```bash messages = [ {"role": "user", "content": "Translate to Wolof: Bassirou Diomaye Faye is the new Senegalese president. He is 44 years old"} ] print(generate_response(messages)) ``` 2. **Code generation** ```python messages = [ {"role": "user", "content": "Bindal ab klaas Python buy wone ni ñuy jëfandikoo dataframe yi ci Pandas"} ] print(generate_response(messages)) ``` 3. **Problem Solving** ```bash system_prompt = "You're a Wolof AI assistant. Please always provide detailed and useful answers to the user queries" messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": "Ndax nga mën ma won ni ñuy resolver problème bii: Fatou dafa jënd 3 kilo ceeb, 2 kilo diw ak 5 kilo sukër. Ceeb gi wenn kilo 500 CFA la, diw gi 1200 CFA kilo bi, sukër gi 750 CFA kilo bi. Ñaata la wara fay?"} ] from pprint import pprint pprint(generate_response(messages)) ``` 4. **Text Generation** (e.g. story generation) ```python system_prompt = "You are a skilled Wolof storyteller (Gewël) with deep knowledge of African folktales and traditions. Write engaging stories in Wolof that reflect African cultural values and wisdom." messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": "Bindal ab léeb ci gaynde gi lekk muus mi"} ] print(generate_response(messages, temperature=0.9)) ``` 5. **Multi-turn conversations** ```bash messages = [ {"role": "user", "content": "Wax ma clan mooy CEDEAO ? Ci lan la liggeey?"}, {"role": "assistant", "content": "CEDEAO mooy 'organisation' gu boole reew yi nekk ci pennc Afrika bi. Mu ngi sukkandiku ci wàll économie, politig, ak déggoo diggante reew yi"}, {"role": "user", "content": "ñaata reew ñoo ci bokk?"} ] print(generate_response(messages)) ``` ### 3.2 VLLM ```python from transformers import AutoTokenizer from vllm import LLM, SamplingParams tokenizer = AutoTokenizer.from_pretrained("Soyno/Oolel-v0.1-Instruct") # Pass the decoding hyperparameters sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512) llm = LLM(model="Soyno/Oolel-v0.1-Instruct") prompt = "Kan mooy Youssou Ndour ?" messages = [ {"role": "system", "content": "You are Oolel, created by Soyno. You are a helpful assistant. Please always provide detailed and useful answers to the user queries"}, {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) outputs = llm.generate([text], sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt}, Generated text: {generated_text}") ``` ## 4. Bias, Risks, and Limitations While Oolel marks a significant milestone, we acknowledge its current limitations: - As a first version, the model's performance continues to evolve - Training data diversity can be further expanded - Specific domain expertise can be enhanced Future developments will focus on: - Enriching training data with comprehensive African historical content - Deeper integration of cultural contexts and nuances - Improving performance across various linguistic tasks - Strengthening the model's ability to handle complex cultural contexts ## 5. Authors - **Yaya SY**: NLP Researcher (Efficient Continued Pretraining) - **Dioula DOUCOURE**: Data & NLP Engineer