File size: 7,738 Bytes
efeee6d
314f91a
95f85ed
efeee6d
 
 
 
 
 
314f91a
efeee6d
 
943f952
53e53d2
 
 
 
74baf0e
53e53d2
4e23ba0
efeee6d
 
53e53d2
58733e4
efeee6d
8c49cb6
3a6dcfd
53e53d2
 
 
3a6dcfd
53e53d2
3a6dcfd
53e53d2
 
 
3a6dcfd
0227006
 
efeee6d
0227006
d313dbd
 
 
35243bf
 
 
 
 
a64b482
 
35243bf
fd6bbaa
35243bf
fc37e32
35243bf
77ff79b
35243bf
 
fc37e32
 
 
 
 
 
 
 
 
d16cee2
d313dbd
 
8c49cb6
d313dbd
 
 
 
 
 
 
 
 
8c49cb6
b323764
d313dbd
 
 
 
 
 
 
 
b323764
d313dbd
 
 
 
8c49cb6
 
d16cee2
58733e4
2a73469
 
217b585
53e53d2
06cce30
53e53d2
06cce30
 
53e53d2
06cce30
9833cdb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
from dataclasses import dataclass
from enum import Enum

@dataclass
class Task:
    benchmark: str
    metric: str
    col_name: str


# Init: to update with your specific keys
class Tasks(Enum):
    # task_key in the json file, metric_key in the json file, name to display in the leaderboard 
    task0 = Task("BBH", "metric_name", "BBH")
    task1 = Task("GPQA", "metric_name", "GPQA")
    task2 = Task("IFEval", "metric_name", "IFEval")
    task3 = Task("MUSR", "metric_name", "MUSR")
    task4 = Task("GSM8K", "metric_name", "GSM8K")
    task5 = Task("MMMLU-fr", "metric_name", "MMMLU-fr")
    

# Your leaderboard name
TITLE = """<h1 align="center" id="space-title"> OpenLLM French leaderboard 🇫🇷</h1>"""

# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """

Bienvenue sur le Leaderboard des LLM en français, une plateforme pionnière dédiée à l'évaluation des grands modèles de langage (LLM) en français. Alors que les LLM multilingues progressent, ma mission est de mettre en lumière spécifiquement les modèles qui excellent en langue française, 
en fournissant des benchmarks qui stimulent les avancées dans les LLM en français et l'IA générative pour la langue française. Le Leaderboard utilise ce lien (https://huggingface.co/collections/le-leadboard/openllmfrenchleadboard-jeu-de-donnees-67126437539a23c65554fd88) pour ses benchmarks soigneusement sélectionnés. Les évaluations sont générées et vérifiées à la fois par GPT-4 et par annotation humaine, 
rendant ainsi ce Leaderboard l'outil le plus précieux et le plus précis pour l'évaluation des LLM en français.

🚀 Soumettez votre Modèle 🚀

Vous avez un LLM en français ? Soumettez-le pour évaluation (Actuellement manuelle, faute de ressources ! En espérant automatiser ce processus avec le soutien de la communauté !), en utilisant le Eleuther AI Language Model Evaluation Harness pour une analyse approfondie des performances. Apprenez-en plus et contribuez aux avancées de l'IA en français sur la page "À propos".

Rejoignez l'avant-garde de la technologie linguistique en français. Soumettez votre modèle et faisons progresser ensemble les LLM en français !

"""

# Which evaluations are you running? how can people reproduce what you have?
LLM_BENCHMARKS_TEXT = f"""
## How it works

## Reproducibility

I use LM-Evaluation-Harness-Turkish, a version of the LM Evaluation Harness adapted for Turkish datasets, to ensure our leaderboard results are both reliable and replicable. Please see https://github.com/malhajar17/lm-evaluation-harness_turkish for more information

## How to Reproduce Results:

1) Set Up the repo: Clone the "lm-evaluation-harness_turkish" from https://github.com/malhajar17/lm-evaluation-harness_turkish and follow the installation instructions.
2) Run Evaluations: To get the results as on the leaderboard (Some tests might show small variations), use the following command, adjusting for your model. For example, with the Trendyol model:
```python
lm_eval --model vllm --model_args pretrained=Orbina/Orbita-v0.1 --tasks mmlu_tr_v0.2,arc_tr-v0.2,gsm8k_tr-v0.2,hellaswag_tr-v0.2,truthfulqa_v0.2,winogrande_tr-v0.2  --output /workspace/Orbina/Orbita-v0.1
```
3) Report Results: The results file generated is then uploaded to the OpenLLM Turkish Leaderboard.

## Notes:

- I currently use "vllm" which might differ slightly as per the LM Evaluation Harness.
- All the tests are using the same configuration used in the original OpenLLMLeadboard preciesly 

The tasks and few shots parameters are:
- ARC: 25-shot, *arc-challenge* (`acc_norm`)
- HellaSwag: 10-shot, *hellaswag* (`acc_norm`)
- TruthfulQA: 0-shot, *truthfulqa-mc* (`mc2`)
- MMLU: 5-shot, *hendrycksTest-abstract_algebra,hendrycksTest-anatomy,hendrycksTest-astronomy,hendrycksTest-business_ethics,hendrycksTest-clinical_knowledge,hendrycksTest-college_biology,hendrycksTest-college_chemistry,hendrycksTest-college_computer_science,hendrycksTest-college_mathematics,hendrycksTest-college_medicine,hendrycksTest-college_physics,hendrycksTest-computer_security,hendrycksTest-conceptual_physics,hendrycksTest-econometrics,hendrycksTest-electrical_engineering,hendrycksTest-elementary_mathematics,hendrycksTest-formal_logic,hendrycksTest-global_facts,hendrycksTest-high_school_biology,hendrycksTest-high_school_chemistry,hendrycksTest-high_school_computer_science,hendrycksTest-high_school_european_history,hendrycksTest-high_school_geography,hendrycksTest-high_school_government_and_politics,hendrycksTest-high_school_macroeconomics,hendrycksTest-high_school_mathematics,hendrycksTest-high_school_microeconomics,hendrycksTest-high_school_physics,hendrycksTest-high_school_psychology,hendrycksTest-high_school_statistics,hendrycksTest-high_school_us_history,hendrycksTest-high_school_world_history,hendrycksTest-human_aging,hendrycksTest-human_sexuality,hendrycksTest-international_law,hendrycksTest-jurisprudence,hendrycksTest-logical_fallacies,hendrycksTest-machine_learning,hendrycksTest-management,hendrycksTest-marketing,hendrycksTest-medical_genetics,hendrycksTest-miscellaneous,hendrycksTest-moral_disputes,hendrycksTest-moral_scenarios,hendrycksTest-nutrition,hendrycksTest-philosophy,hendrycksTest-prehistory,hendrycksTest-professional_accounting,hendrycksTest-professional_law,hendrycksTest-professional_medicine,hendrycksTest-professional_psychology,hendrycksTest-public_relations,hendrycksTest-security_studies,hendrycksTest-sociology,hendrycksTest-us_foreign_policy,hendrycksTest-virology,hendrycksTest-world_religions* (average of all the results `acc`)
- Winogrande: 5-shot, *winogrande* (`acc`)
- GSM8k: 5-shot, *gsm8k* (`acc`)

"""

EVALUATION_QUEUE_TEXT = """
## Some good practices before submitting a model

### 1) Make sure you can load your model and tokenizer using AutoClasses:
```python
from transformers import AutoConfig, AutoModel, AutoTokenizer
config = AutoConfig.from_pretrained("your model name", revision=revision)
model = AutoModel.from_pretrained("your model name", revision=revision)
tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
```
If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded.

Note: make sure your model is public!
Note: if your model needs `use_remote_code=True`, we do not support this option yet but we are working on adding it, stay posted!

### 2) Convert your model weights to [safetensors](https://huggingface.co/docs/safetensors/index)
It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`!

### 3) Make sure your model has an open license!
This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model 🤗

### 4) Fill up your model card
When we add extra information about models to the leaderboard, it will be automatically taken from the model card

## In case of model failure
If your model is displayed in the `FAILED` category, its execution stopped.
Make sure you have followed the above steps first.
If everything is done, check you can launch the EleutherAIHarness on your model locally, using the above command without modifications (you can add `--limit` to limit the number of examples per task).
"""

CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""
@misc{openllm-French-leaderboard,
  author = {Mohamad Alhajar},
  title = {Open LLM French Leaderboard v0.2},
  year = {2024},
  publisher = {Mohamad Alhajar},
  howpublished = "\url{https://huggingface.co/spaces/le-leadboard/OpenLLMFrenchLeaderboard}"
}
"""