File size: 4,615 Bytes
f526676
 
 
 
 
 
afc75aa
f526676
 
97a4d7f
f526676
52cb069
 
 
97a4d7f
f526676
97a4d7f
f526676
97b1677
f526676
97a4d7f
 
 
 
 
 
 
 
f526676
 
 
5a21935
 
f526676
 
 
 
 
 
 
 
 
2d2023b
f526676
 
 
2d2023b
f526676
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7bb1a9c
 
f526676
7bb1a9c
f526676
 
 
 
 
 
 
 
7bb1a9c
f526676
7bb1a9c
f526676
 
 
 
 
 
 
7bb1a9c
 
f526676
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a480e27
f526676
 
 
 
 
 
 
 
 
f9a2086
147e253
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
library_name: transformers
language:
- wo
- en
license: apache-2.0
pipeline_tag: text2text-generation
---

# Oolel: A High-Performing Open LLM for Wolof

<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/62e335bbf15e7fce909fe5d4/liiZ1rAkiIgGpgN_jqwq6.mp4"></video>


Despite numerous open-source innovations in large language models, African languages have remained underrepresented.

**Soynade Research** is transforming this landscape with Oolel, the first open-source language model for Wolof.

Built on the **Qwen 2.5** architecture, Oolel combines state-of-the-art AI technology with deep Wolof linguistic expertise. With careful high-quality curated data, we trained and optimized Oolel for the following tasks:

- **RAG** supporting Wolof queries with English, French, or Wolof context.
- **Bidirectional translation between English and Wolof**
- **Natural text generation in Wolof**
- **Math in Wolof**
- **And many other standard NLP tasks**:
    - Summarization
    - Text edition
    - etc

## 3. Usage

**!!! It's important to add your system prompt !!!**

Here provides a code snippet with apply_chat_template to show you how to load the tokenizer and model and how to generate contents.
```python

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda" 

model = AutoModelForCausalLM.from_pretrained(
    "soynade-research/Oolel-v0.1",
    torch_dtype = torch.bfloat16,
    device_map="auto")

tokenizer = AutoTokenizer.from_pretrained("soynade-research/Oolel-v0.1")

def generate_response(messages, max_new_tokens=1024, temperature=0.1):
    text = tokenizer.apply_chat_template(
          messages,
          tokenize=False,
          add_generation_prompt=True
)
    model_inputs = tokenizer([text], return_tensors="pt").to(device)
    generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=max_new_tokens, temperature=temperature)
    
    generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return response

```


**Some tasks examples:**

1. **Translation Tasks**

```python
system_prompt = "You're a Wolof AI assistant. Please always provide detailed and useful answers to the user queries."
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "Translate to Wolof: Bassirou Diomaye Faye is the new Senegalese president. He is 44 years old"}
]
print(generate_response(messages))
```

2. **Code generation**

```python
system_prompt = "You're a Wolof AI assistant. Please always provide detailed and useful answers to the user queries"
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "Bindal ab klaas Python buy wone ni ñuy jëfandikoo dataframe yi ci Pandas"}
]
print(generate_response(messages))
```

3. **Problem Solving**

```python
system_prompt = "You're a Wolof AI assistant. Please always provide detailed and useful answers to the user queries."
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "Ndax nga mën ma won ni ñuy resolver problème bii: Fatou dafa jënd 3 kilo ceeb, 2 kilo diw ak 5 kilo sukër. Ceeb gi wenn kilo 500 CFA la, diw gi 1200 CFA kilo bi, sukër gi 750 CFA kilo bi. Ñaata la wara fay?"}
]
from pprint import pprint
pprint(generate_response(messages))
```

  
4.  **Text Generation** (e.g. story generation)

```python
system_prompt = "You are a skilled Wolof storyteller (Gewël) with deep knowledge of African folktales and traditions. Write engaging stories in Wolof that reflect African cultural values and wisdom."
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "Bindal ab léeb ci gaynde gi lekk muus mi"}
]
print(generate_response(messages, temperature=0.9))
```

5. **Multi-turn conversations**
Oolel is not optimized for multi-turn conversations, but you can try it!
```bash
messages = [
   {"role": "user", "content": "Wax ma clan mooy CEDEAO ? Ci lan la liggeey?"},
   {"role": "assistant", "content": "CEDEAO mooy 'organisation' gu boole reew yi nekk ci pennc Afrika bi. Mu ngi sukkandiku ci wàll économie, politig, ak déggoo diggante reew yi"},
   {"role": "user", "content": "ñaata reew ñoo ci bokk?"}
]
print(generate_response(messages))
```

## Authors
- [**Yaya SY**](https://x.com/seygalare): NLP Researcher (Efficient Continued Pretraining)
- [**Dioula DOUCOURE**](https://x.com/DioulaD): Data & NLP Engineer