Oolel-v0.1 / README.md

Update README.md

afc75aa verified 9 months ago

4.62 kB

	---
	library_name: transformers
	language:
	- wo
	- en
	license: apache-2.0
	pipeline_tag: text2text-generation
	---

	# Oolel: A High-Performing Open LLM for Wolof

	<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/62e335bbf15e7fce909fe5d4/liiZ1rAkiIgGpgN_jqwq6.mp4"></video>


	Despite numerous open-source innovations in large language models, African languages have remained underrepresented.

	Soynade Research is transforming this landscape with Oolel, the first open-source language model for Wolof.

	Built on the Qwen 2.5 architecture, Oolel combines state-of-the-art AI technology with deep Wolof linguistic expertise. With careful high-quality curated data, we trained and optimized Oolel for the following tasks:

	- RAG supporting Wolof queries with English, French, or Wolof context.
	- Bidirectional translation between English and Wolof
	- Natural text generation in Wolof
	- Math in Wolof
	- And many other standard NLP tasks:
	- Summarization
	- Text edition
	- etc

	## 3. Usage

	!!! It's important to add your system prompt !!!

	Here provides a code snippet with apply_chat_template to show you how to load the tokenizer and model and how to generate contents.
	```python

	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	device = "cuda"

	model = AutoModelForCausalLM.from_pretrained(
	"soynade-research/Oolel-v0.1",
	torch_dtype = torch.bfloat16,
	device_map="auto")

	tokenizer = AutoTokenizer.from_pretrained("soynade-research/Oolel-v0.1")

	def generate_response(messages, max_new_tokens=1024, temperature=0.1):
	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)
	model_inputs = tokenizer([text], return_tensors="pt").to(device)
	generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=max_new_tokens, temperature=temperature)

	generated_ids = [
	output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
	response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
	return response

	```


	Some tasks examples:

	1. Translation Tasks

	```python
	system_prompt = "You're a Wolof AI assistant. Please always provide detailed and useful answers to the user queries."
	messages = [
	{"role": "system", "content": system_prompt},
	{"role": "user", "content": "Translate to Wolof: Bassirou Diomaye Faye is the new Senegalese president. He is 44 years old"}
	]
	print(generate_response(messages))
	```

	2. Code generation

	```python
	system_prompt = "You're a Wolof AI assistant. Please always provide detailed and useful answers to the user queries"
	messages = [
	{"role": "system", "content": system_prompt},
	{"role": "user", "content": "Bindal ab klaas Python buy wone ni ñuy jëfandikoo dataframe yi ci Pandas"}
	]
	print(generate_response(messages))
	```

	3. Problem Solving

	```python
	system_prompt = "You're a Wolof AI assistant. Please always provide detailed and useful answers to the user queries."
	messages = [
	{"role": "system", "content": system_prompt},
	{"role": "user", "content": "Ndax nga mën ma won ni ñuy resolver problème bii: Fatou dafa jënd 3 kilo ceeb, 2 kilo diw ak 5 kilo sukër. Ceeb gi wenn kilo 500 CFA la, diw gi 1200 CFA kilo bi, sukër gi 750 CFA kilo bi. Ñaata la wara fay?"}
	]
	from pprint import pprint
	pprint(generate_response(messages))
	```


	4. Text Generation (e.g. story generation)

	```python
	system_prompt = "You are a skilled Wolof storyteller (Gewël) with deep knowledge of African folktales and traditions. Write engaging stories in Wolof that reflect African cultural values and wisdom."
	messages = [
	{"role": "system", "content": system_prompt},
	{"role": "user", "content": "Bindal ab léeb ci gaynde gi lekk muus mi"}
	]
	print(generate_response(messages, temperature=0.9))
	```

	5. Multi-turn conversations
	Oolel is not optimized for multi-turn conversations, but you can try it!
	```bash
	messages = [
	{"role": "user", "content": "Wax ma clan mooy CEDEAO ? Ci lan la liggeey?"},
	{"role": "assistant", "content": "CEDEAO mooy 'organisation' gu boole reew yi nekk ci pennc Afrika bi. Mu ngi sukkandiku ci wàll économie, politig, ak déggoo diggante reew yi"},
	{"role": "user", "content": "ñaata reew ñoo ci bokk?"}
	]
	print(generate_response(messages))
	```

	## Authors
	- [Yaya SY](https://x.com/seygalare): NLP Researcher (Efficient Continued Pretraining)
	- [Dioula DOUCOURE](https://x.com/DioulaD): Data & NLP Engineer