Spaces:
Restarting
Restarting
from dataclasses import dataclass | |
from enum import Enum | |
class Task: | |
benchmark: str | |
metric: str | |
col_name: str | |
# Init: to update with your specific keys | |
class Tasks(Enum): | |
# task_key in the json file, metric_key in the json file, name to display in the leaderboard | |
task0 = Task("agree_cs", "accuracy", "agree_cs") | |
task1 = Task("anli_cs", "accuracy", "anli_cs") | |
task2 = Task("arc_challenge_cs", "accuracy", "arc_challenge_cs") | |
task3 = Task("arc_easy_cs", "accuracy", "arc_easy_cs") | |
task4 = Task("belebele_cs", "accuracy", "belebele_cs") | |
task5 = Task("ctkfacts_cs", "accuracy", "ctkfacts_cs") | |
task6 = Task("czechnews_cs", "accuracy", "czechnews_cs") | |
task7 = Task("fb_comments_cs", "accuracy", "fb_comments_cs") | |
task8 = Task("gsm8k_cs", "accuracy", "gsm8k_cs") | |
task9 = Task("klokanek_cs", "accuracy", "klokanek_cs") | |
task10 = Task("mall_reviews_cs", "accuracy", "mall_reviews_cs") | |
task11 = Task("mmlu_cs", "accuracy", "mmlu_cs") | |
task12 = Task("sqad_cs", "accuracy", "sqad_cs") | |
task13 = Task("subjectivity_cs", "accuracy", "subjectivity_cs") | |
task14 = Task("truthfulqa_cs", "accuracy", "truthfulqa_cs") | |
# Your leaderboard name | |
TITLE = """<h1 align="center" id="space-title">🇨🇿 CzechBench Leaderboard</h1>""" | |
# What does your leaderboard evaluate? | |
INTRODUCTION_TEXT = """ | |
The goal of the CzechBench project is to provide a comprehensive and practical benchmark for evaluating Czech language models. | |
Our [evaluation suite](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench#readme) | |
currently consists of 15 individual tasks, leveraging pre-existing Czech datasets together with new machine translations of popular LLM benchmarks, | |
including ARC, GSM8K, MMLU, and TruthfulQA. | |
Key Features and Benefits: | |
- **Tailored for the Czech Language:** The benchmark includes both original Czech datasets and adapted versions of international datasets, ensuring relevant evaluation of model performance in the Czech context. | |
- **Wide Range of Tasks:** It contains 15 different tasks that cover various aspects of language understanding and text generation, enabling a comprehensive assessment of the model's capabilities. | |
- **Universal model support:** The universal text-to-text evaluation approach adopted in CzechBench allows for direct comparison of models with varying levels of internal access, including commercial APIs. | |
- **Ease of Use:** The benchmark is designed to be easily integrated into your development process, saving time and resources during model testing and improvement. | |
- **Up-to-date and Relevant:** We regularly update our datasets to reflect the latest findings and trends in language model development. | |
By using CzechBench, you will gain deep insights into the strengths and weaknesses of your models, allowing you to better focus on key areas for optimization. | |
This will not only improve the performance of your models but also enhance their real-world deployment in various Czech contexts. | |
Below, you can find the up-to-date loaderboard of models evaluated on CzechBench. | |
For more information on the included benchmarks and instructions on evaluating your own models, please visit the "About" section below. | |
""" | |
# Czech-Bench is developed by <a href="https://huggingface.co/CIIRC-NLP">CIIRC-NLP</a>. | |
# Which evaluations are you running? how can people reproduce what you have? | |
LLM_BENCHMARKS_TEXT = f""" | |
## Basic Information | |
The CzechBench evaluation suite is hosted on [GitHub](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench#readme). | |
It is implemented on top of the popular [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) framework, which provides extensive model compatibility and optimal evaluation efficiency. | |
All currently supported benchmarks are listed in the table below: | |
| Dataset | Language | Task type | Metrics | Samples | Task ID | | |
| ------------------------------------------------------------ | ----------------------------- | -------------------------- | -------------- | ------: | --------------- | | |
| [AGREE](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/agree_cs) | CS (Original) | Subject-verb agreement | Acc | 627 | agree_cs | | |
| [ANLI](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/anli_cs) | CS (Translated) | Natural Language Inference | Acc, Macro F1 | 1200 | anli_cs | | |
| [ARC Challenge](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/arc_cs) | CS (Translated) | Knowledge-Based QA | Acc | 1172 | arc_cs | | |
| [ARC Easy](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/arc_cs) | CS (Translated) | Knowledge-Based QA | Acc | 2376 | arc_cs | | |
| [Belebele](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/belebele_cs) | CS (Professional translation) | Reading Comprehension / QA | Acc | 895 | belebele_cs | | |
| [CTKFacts](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/ctkfacts_cs) | CS (Original) | Natural Language Inference | Acc, Macro F1 | 558 | ctkfacts_cs | | |
| [Czech News](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/czechnews_cs) | CS (Original) | News Topic Classification | Acc, Macro F1 | 1000 | czechnews_cs | | |
| [Facebook Comments](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/fb_comments_cs) | CS (Original) | Sentiment Analysis | Acc, Macro F1 | 1000 | fb_comments_cs | | |
| [GSM8K](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/gsm8k_cs) | CS (Translated) | Mathematical inference | EM Acc | 1319 | gsm8k_cs | | |
| [Klokánek](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/klokanek_cs) | CS (Original) | Math/Logical Inference | Acc | 808 | klokanek_cs | | |
| [Mall Reviews](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/mall_reviews_cs) | CS (Original) | Sentiment Analysis | Acc, Macro F1 | 3000 | mall_reviews_cs | | |
| [MMLU](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/mmlu_cs) | CS (Translated) | Knowledge-Based QA | Acc | 12408 | mmlu_cs | | |
| [SQAD](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/sqad_cs) | CS (Original) | Reading Comprehension / QA | EM Acc, BoW F1 | 843 | sqad_cs | | |
| [Subjectivity](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/subjectivity_cs) | CS (Original) | Subjectivity Analysis | Acc, Macro F1 | 2000 | subjectivity_cs | | |
| [TruthfulQA](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/truthfulqa_cs) | CS (Translated) | Knowledge-Based QA | Acc | 813 | truthfulqa_cs | | |
## Evaluation Process | |
### 1. Install CzechBench: | |
``` | |
git clone https://github.com/jirkoada/czechbench_eval_harness.git | |
cd czechbench_eval_harness | |
pip install -e “.[api]” | |
``` | |
### 2. Run evaluation | |
* `export MODEL=your_model_name` where your_model_name is HF path for public model. For example: `export MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct` | |
* `export OUTPUT_PATH=my_output_path` where my_output_path is directory for evaluation reports | |
Run following command (you can adjust parameters like batch_size or device): | |
``` | |
lm_eval --model hf \\ | |
--model_args pretrained=$MODEL \\ | |
--tasks czechbench_tasks \\ | |
--device cuda:0 \\ | |
--batch_size 1 \\ | |
--write_out \\ | |
--log_samples \\ | |
--output_path $OUTPUT_PATH \\ | |
--apply_chat_template \\ | |
``` | |
For advanced usage instructions, please inspect the [CzechBench README on GitHub](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench#readme) | |
or the official [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) documentation. | |
### 3. Upload results to Leaderboard | |
Inside the `$OUTPUT_PATH` directory, you can find the file `results.json`. | |
To submit your evaluation results to our leaderboard, please visit the "Submit here!" section above and upload your `results.json` file. | |
""" | |
EVALUATION_QUEUE_TEXT = """ | |
""" | |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" | |
CITATION_BUTTON_TEXT = r""" | |
""" | |