|
--- |
|
library_name: transformers |
|
language: |
|
- wo |
|
- en |
|
license: apache-2.0 |
|
pipeline_tag: text2text-generation |
|
--- |
|
|
|
# Oolel: A High-Performing Open LLM for Wolof |
|
|
|
<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/62e335bbf15e7fce909fe5d4/liiZ1rAkiIgGpgN_jqwq6.mp4"></video> |
|
|
|
|
|
Despite numerous open-source innovations in large language models, African languages have remained underrepresented. |
|
|
|
**Soynade Research** is transforming this landscape with Oolel, the first open-source language model for Wolof. |
|
|
|
Built on the **Qwen 2.5** architecture, Oolel combines state-of-the-art AI technology with deep Wolof linguistic expertise. With careful high-quality curated data, we trained and optimized Oolel for the following tasks: |
|
|
|
- **RAG** supporting Wolof queries with English, French, or Wolof context. |
|
- **Bidirectional translation between English and Wolof** |
|
- **Natural text generation in Wolof** |
|
- **Math in Wolof** |
|
- **And many other standard NLP tasks**: |
|
- Summarization |
|
- Text edition |
|
- etc |
|
|
|
## 3. Usage |
|
|
|
**!!! It's important to add your system prompt !!!** |
|
|
|
Here provides a code snippet with apply_chat_template to show you how to load the tokenizer and model and how to generate contents. |
|
```python |
|
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
import torch |
|
|
|
device = "cuda" |
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
"soynade-research/Oolel-v0.1", |
|
torch_dtype = torch.bfloat16, |
|
device_map="auto") |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("soynade-research/Oolel-v0.1") |
|
|
|
def generate_response(messages, max_new_tokens=1024, temperature=0.1): |
|
text = tokenizer.apply_chat_template( |
|
messages, |
|
tokenize=False, |
|
add_generation_prompt=True |
|
) |
|
model_inputs = tokenizer([text], return_tensors="pt").to(device) |
|
generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=max_new_tokens, temperature=temperature) |
|
|
|
generated_ids = [ |
|
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)] |
|
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
return response |
|
|
|
``` |
|
|
|
|
|
**Some tasks examples:** |
|
|
|
1. **Translation Tasks** |
|
|
|
```python |
|
system_prompt = "You're a Wolof AI assistant. Please always provide detailed and useful answers to the user queries." |
|
messages = [ |
|
{"role": "system", "content": system_prompt}, |
|
{"role": "user", "content": "Translate to Wolof: Bassirou Diomaye Faye is the new Senegalese president. He is 44 years old"} |
|
] |
|
print(generate_response(messages)) |
|
``` |
|
|
|
2. **Code generation** |
|
|
|
```python |
|
system_prompt = "You're a Wolof AI assistant. Please always provide detailed and useful answers to the user queries" |
|
messages = [ |
|
{"role": "system", "content": system_prompt}, |
|
{"role": "user", "content": "Bindal ab klaas Python buy wone ni ñuy jëfandikoo dataframe yi ci Pandas"} |
|
] |
|
print(generate_response(messages)) |
|
``` |
|
|
|
3. **Problem Solving** |
|
|
|
```python |
|
system_prompt = "You're a Wolof AI assistant. Please always provide detailed and useful answers to the user queries." |
|
messages = [ |
|
{"role": "system", "content": system_prompt}, |
|
{"role": "user", "content": "Ndax nga mën ma won ni ñuy resolver problème bii: Fatou dafa jënd 3 kilo ceeb, 2 kilo diw ak 5 kilo sukër. Ceeb gi wenn kilo 500 CFA la, diw gi 1200 CFA kilo bi, sukër gi 750 CFA kilo bi. Ñaata la wara fay?"} |
|
] |
|
from pprint import pprint |
|
pprint(generate_response(messages)) |
|
``` |
|
|
|
|
|
4. **Text Generation** (e.g. story generation) |
|
|
|
```python |
|
system_prompt = "You are a skilled Wolof storyteller (Gewël) with deep knowledge of African folktales and traditions. Write engaging stories in Wolof that reflect African cultural values and wisdom." |
|
messages = [ |
|
{"role": "system", "content": system_prompt}, |
|
{"role": "user", "content": "Bindal ab léeb ci gaynde gi lekk muus mi"} |
|
] |
|
print(generate_response(messages, temperature=0.9)) |
|
``` |
|
|
|
5. **Multi-turn conversations** |
|
Oolel is not optimized for multi-turn conversations, but you can try it! |
|
```bash |
|
messages = [ |
|
{"role": "user", "content": "Wax ma clan mooy CEDEAO ? Ci lan la liggeey?"}, |
|
{"role": "assistant", "content": "CEDEAO mooy 'organisation' gu boole reew yi nekk ci pennc Afrika bi. Mu ngi sukkandiku ci wàll économie, politig, ak déggoo diggante reew yi"}, |
|
{"role": "user", "content": "ñaata reew ñoo ci bokk?"} |
|
] |
|
print(generate_response(messages)) |
|
``` |
|
|
|
## Authors |
|
- [**Yaya SY**](https://x.com/seygalare): NLP Researcher (Efficient Continued Pretraining) |
|
- [**Dioula DOUCOURE**](https://x.com/DioulaD): Data & NLP Engineer |