OpenLLMFrenchLeaderboard

Running

App Files Files Community

malhajar commited on Oct 21, 2024

Commit

be8be25

verified ·

1 Parent(s): 685b0e2

Update src/display/about.py

Browse files

Files changed (1) hide show

src/display/about.py +3 -3

src/display/about.py CHANGED Viewed

@@ -44,10 +44,10 @@ LLM_BENCHMARKS_TEXT = f"""
 ## Reproductibilité
 Nous utilisons une version adaptée de LM Evaluation Harness [github](https://github.com/EleutherAI/lm-evaluation-harness) pour garantir que les résultats de notre classement sont à la fois fiables et reproductibles.
 ## Comment reproduire les résultats :
-1) Configurer le dépôt : Clonez le "XXXX" depuis https://github.com/xxx et suivez les instructions d'installation.
 2) Effectuer les évaluations : Pour obtenir les mêmes résultats que ceux du classement (certains tests peuvent montrer de petites variations), utilisez la commande suivante, en l'ajustant à votre modèle. Par exemple, avec le modèle Trendyol :
 ```python
-lm_eval --model vllm --model_args pretrained=Orbina/Orbita-v0.1 --tasks mmlu_tr_v0.2,arc_tr-v0.2,gsm8k_tr-v0.2,hellaswag_tr-v0.2,truthfulqa_v0.2,winogrande_tr-v0.2  --output /workspace/Orbina/Orbita-v0.1
 ```
 ## Remarques :
 - J'utilise actuellement "vllm", qui pourrait différer légèrement par rapport à LM Evaluation Harness.
@@ -56,7 +56,7 @@ Les tâches et les paramètres de few-shot sont :
 - BBH : 3-shot, *Big-Bench-Hard* (`acc_norm`)
 - IFEval : 0-shot, *Instruction Following Evaluation* (inst_level_strict_acc,none et prompt_level_strict_acc,none)
 - GPQA : 0-shot, *Generalized Purpose Question Answering* (`acc_norm`)
-- MMLU : 5-shot, *hendrycksTest-abstract_algebra,hendrycksTest-anatomy,hendrycksTest-astronomy,hendrycksTest-business_ethics,hendrycksTest-clinical_knowledge,hendrycksTest-college_biology,hendrycksTest-college_chemistry,hendrycksTest-college_computer_science,hendrycksTest-college_mathematics,hendrycksTest-college_medicine,hendrycksTest-college_physics,hendrycksTest-computer_security,hendrycksTest-conceptual_physics,hendrycksTest-econometrics,hendrycksTest-electrical_engineering,hendrycksTest-elementary_mathematics,hendrycksTest-formal_logic,hendrycksTest-global_facts,hendrycksTest-high_school_biology,hendrycksTest-high_school_chemistry,hendrycksTest-high_school_computer_science,hendrycksTest-high_school_european_history,hendrycksTest-high_school_geography,hendrycksTest-high_school_government_and_politics,hendrycksTest-high_school_macroeconomics,hendrycksTest-high_school_mathematics,hendrycksTest-high_school_microeconomics,hendrycksTest-high_school_physics,hendrycksTest-high_school_psychology,hendrycksTest-high_school_statistics,hendrycksTest-high_school_us_history,hendrycksTest-high_school_world_history,hendrycksTest-human_aging,hendrycksTest-human_sexuality,hendrycksTest-international_law,hendrycksTest-jurisprudence,hendrycksTest-logical_fallacies,hendrycksTest-machine_learning,hendrycksTest-management,hendrycksTest-marketing,hendrycksTest-medical_genetics,hendrycksTest-miscellaneous,hendrycksTest-moral_disputes,hendrycksTest-moral_scenarios,hendrycksTest-nutrition,hendrycksTest-philosophy,hendrycksTest-prehistory,hendrycksTest-professional_accounting,hendrycksTest-professional_law,hendrycksTest-professional_medicine,hendrycksTest-professional_psychology,hendrycksTest-public_relations,hendrycksTest-security_studies,hendrycksTest-sociology,hendrycksTest-us_foreign_policy,hendrycksTest-virology,hendrycksTest-world_religions* (average of all the results `acc`)
 - MuSR : 5-shot, *MuSR* (`acc_norm`)
 - GSM8k : 5-shot, *gsm8k* (`acc`)
 """

 ## Reproductibilité
 Nous utilisons une version adaptée de LM Evaluation Harness [github](https://github.com/EleutherAI/lm-evaluation-harness) pour garantir que les résultats de notre classement sont à la fois fiables et reproductibles.
 ## Comment reproduire les résultats :
+1) Configurer le dépôt : Clonez le "lm-evaluation-harness-multilingual" depuis [lm-evaluation-harness-multilingual](https://github.com/mohamedalhajjar/lm-evaluation-harness-multilingual) et suivez les instructions d'installation.
 2) Effectuer les évaluations : Pour obtenir les mêmes résultats que ceux du classement (certains tests peuvent montrer de petites variations), utilisez la commande suivante, en l'ajustant à votre modèle. Par exemple, avec le modèle Trendyol :
 ```python
+lm_eval --model vllm --model_args="pretrained=OpenLLM-France/Claire-7B-FR-Instruct-0.1,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8,data_parallel_size=4" --tasks=leaderboard-fr --batch_size=auto
 ```
 ## Remarques :
 - J'utilise actuellement "vllm", qui pourrait différer légèrement par rapport à LM Evaluation Harness.
 - BBH : 3-shot, *Big-Bench-Hard* (`acc_norm`)
 - IFEval : 0-shot, *Instruction Following Evaluation* (inst_level_strict_acc,none et prompt_level_strict_acc,none)
 - GPQA : 0-shot, *Generalized Purpose Question Answering* (`acc_norm`)
+- MMLU : 5-shot, (average of all the results `acc`)
 - MuSR : 5-shot, *MuSR* (`acc_norm`)
 - GSM8k : 5-shot, *gsm8k* (`acc`)
 """