Darija Chatbot Arena: Making LLMs Compete in the Moroccan Dialect

Community Article Published February 10, 2025

Upvote

AymaneElfirdo Aymane El Firdoussi

atlasia

abdeljalilELmajjodi abdeljalil_elma

atlasia

Ihssane123 Ihssane Nedjaoui

atlasia

Link to the space: https://huggingface.co/spaces/atlasia/darija-chatbot-arena

Abstract

We introduce Darija Chatbot Arena, an innovative platform designed to facilitate the comparison of responses from various Large Language Models (LLMs) on a diverse set of prompts in Darija, the Moroccan Arabic dialect. Our platform aims to provide a comprehensive evaluation of LLMs' capabilities in understanding and generating responses in Darija, a language that remains underrepresented in the AI landscape.

Figure 1: Example of an Arena Battle. We notice that LLMs struggle in some domains such as Idioms.

In this blog post, we present our initial findings, which include an analysis of model performance and the release of an evolving leaderboard, ranked using the Elo rating system. This rating system is broadly recognized for its effectiveness in ranking competitors based on relative performance, and offers an objective measure to compare LLMs based on user feedback. Our goal is to foster collaboration and engagement while advancing the development of AI systems that better serve the Moroccan community and beyond.

We invite the broader Moroccan community, including researchers, language enthusiasts, and native speakers, to actively participate by rating model responses to a curated set of prompts. Your contributions will help refine the rankings and provide valuable insights into the strengths and weaknesses of different models in handling the unique linguistic nuances of Darija.

Figure 2: Battles Leaderboard.

Introduction

Large Language Models (LLMs) have revolutionized natural language processing, enabling impressive capabilities across numerous languages and domains. However, most state-of-the-art LLMs struggle with underrepresented languages and dialects, such as Moroccan Arabic, commonly known as Darija. To address this gap, we introduce Darija ChatBot Arena, a community-driven initiative to evaluate and compare the performance of leading LLMs on Darija.

Why Darija?

Darija is a spoken variety of Arabic unique to Morocco. Although it is characterized by its rich blend of Arabic, Berber, French, and Spanish influences, it lacks significant representation in NLP datasets and benchmarks, making it a challenging language for LLMs to handle effectively.

Objectives

The primary goals of this project are:

Benchmarking Performance: Evaluate how different LLMs perform in understanding and generating Darija text.
Fostering Research: Encourage the development of Darija-specific models and datasets.
Community Engagement: Involve native speakers and NLP enthusiasts in fine-tuning and feedback collection.

How it works ?

Inspired by similar projects, Darija ChatBot Arena provides a platform where users can interact with various LLMs in order to compare their responses to prompts in Darija. Users can:

1- Select Prompts: Select a random prompt (by rolling the dice) from a set of nearly 300 phrases and receive responses from two competing LLMs.

2- Vote on Responses: Choose the most accurate, fluent and culturally aligned response.

3- Analyze Results: View aggregated results on the Leaderboard to identify the most effective models.

We currently support models from leading organizations including:

Google: Gemini-1.5
Meta: Llama-3.3-70B-Instruct / Llama-3.1-Nemotron / Llama-3.1-405B / Llama-3-8B-Instruct
Anthropic: Claude-3.5-Sonnet
OpenAI: ChatGPT-4o-Latest / GPT-4o / GPT-4o-Mini
Ali Baba: Qwen-2.5-72B-Instruct / QwQ-32B-Preview
xAI: Grok-beta
Cohere: C4AI-Command-R-Plus
Deepseek: Deepseek-V3

Other models will also be added in future versions of the arena.

Prompts

Figure 3: Histogram of the number of prompts per class.

The histogram highlights the distribution of prompts across different classes in the dataset. This is useful for evaluating language models on a diverse set of conversational inputs in Moroccan Arabic. We chose the "General knowledge" class to be the dominant in order to assess each model's broad Darija language understanding capabilities, and indicate whether it has learned common facts, concepts, and linguistic conventions relevant to everyday Moroccan life.

The "Idioms" class is crucial for evaluating a model's suitability for real-world Moroccan Arabic applications. As we all know, Moroccan Arabic is rich in idiomatic language, and a model's ability to understand and generate common Moroccan idioms is a key indicator of its cultural and linguistic competence. Strong performance on these prompts would suggest the model has internalized the nuanced, context-dependent meanings of Moroccan Arabic idioms. Weak results could signal gaps in the model's grasp of this important aspect of the language. Analyzing a model's handling of Moroccan Arabic idioms provides valuable insights into its overall capability to engage in natural, fluid conversations in this domain.

"Cultural knowledge", ”Religion” and "Geography" classes are also crucial, as they indicate the model's familiarity with Moroccan cultural values, history, and regional knowledge. These prompt types test the model's knowledge of the cultural references and geographic information specific to Morocco.

Additionally, because Morocco is very well-known for its delicious cuisine and world-class sports teams, it is necessary to evaluate the models on these domains as they shape the identity of Morocco. This is why we have included some samples from the ”Gastronomy” and ”Sports” classes. Note that our athletes both excel in the world championships and enjoy delicious healthy meals.

Finally, classes like "Long sentences", ”Humor” and "Mixed language" test the model's ability to handle complex linguistic phenomena common in Moroccan Arabic. Strong results here would indicate the model has internalized Moroccan-specific syntax, idioms, and code-switching patterns. Analyzing model performance holistically across this diverse set of prompt classes can provide rich insights into the model's suitability for real-world Moroccan Arabic applications.

Analyzing the leaderboard

Elo score rating

We rank the models based on their Elo score, which is a widely used metric to rank agents in such battles.

Figure 4: Performance of each model in the arena computed with Elo score.

This leaderboard ranks our selected AI models based on their performance in understanding Moroccan Arabic prompts listed above, as well as generating accurate responses. Each model's Elo score reflects its ability to consistently outperform others in head-to-head comparisons, with higher scores indicating more frequent victories. ChatGPT-4.0-Latest leads the rankings, showcasing exceptional proficiency in understanding and generating responses in Darija, followed closely by Gemini-1.5-Pro and GPT-4.0. Claude-3.5-Sonnet and GPT-4.0-Mini also perform well, securing high positions. Mid-tier models like Llama-3.1-Nemotron and Grok-beta demonstrate competitive capabilities, while models such as Meta-Llama-3.8B Instruct and QWQ-32B Preview are positioned lower on the chart.

Win matrix

Here we show the win rate matrix for some top models in Darija to illustrate how they compete one against another.

Figure 5: Model to model win-rate matrix for 6 selected models (with the most votes).

This win matrix reveals interesting dynamics in model performance, showcasing how different models perform against each other in head-to-head comparisons. ChatGPT-4.0-Latest emerges as the most performant, consistently outperforming others with a high win rate across almost all matchups. Models like Claude-3.5-Sonnet and Gemini-1.5-Pro show competitive performance but are relatively weaker against stronger models, indicating room for improvement in certain areas. On the other hand, models such as Grok-beta and Llama-3.1-405B have significantly lower win rates against most competitors, suggesting they may be less capable in the scenarios tested. These results provide valuable insights into the relative strengths of the models, helping to identify which models are most suitable for specific applications or which need further refinement.

Conclusion and Future work

To sum up, our Darija Chatbot Arena aims at evaluating the performance of different state of the art models based on human feedback in the absence of a specific benchmark for that sake. It overall shows that few LLMs can understand complex Moroccan Arabic sentences and queries, and the remaining need a fine-tuning step beforehand. Our goal in the next few days will be to add more models to the Arena such as Atlas-Chat, Fanar... and include a larger batch of prompts to make the evaluation more diverse and touches as many areas in the Moroccan culture as possible.

Acknowledgment

We are grateful for all the project collaborators: Aymane El Firdoussi, Abdeljalil El Majjodi, Ihssane Nedjaoui, Zaid Chiech, Miloud Belarebia, Yousef Khoubrane, Ali El Filali, Badr Barbara, Hafsaa Ouifak, Imane Momayiz, Mounir Afifi, Ouael Ettouileb, Oumnia Ennaji, Nouamane Tazi, Khaoula Alaoui Belghiti, Oumayma Essarhi and Adnan Anouzla.

Join Us

Website: https://www.atlasia.ma/
HuggingFace community: https://huggingface.co/atlasia

Citation

@article{atlasia2025darija-chatbot-arena,
  title={Darija Chatbot Arena: Making LLMs Compete in the Moroccan Dialect},
  author={Aymane El Firdoussi and Abdeljalil El Majjodi and Ihssane Nedjaoui},
  year={2025},
  url={https://huggingface.co/blog/atlasia/darija-chatbot-arena}
  organization={AtlasIA}
}

Appendix

Figure 6: Full win-rate matrix.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote