|
from dataclasses import dataclass |
|
from enum import Enum |
|
|
|
@dataclass |
|
class Task: |
|
benchmark: str |
|
metric: str |
|
col_name: str |
|
|
|
|
|
|
|
|
|
class Tasks(Enum): |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
task0 = Task('agieval-acc', 'accuracy', 'AGIEval Mean (Min, Max)') |
|
task1 = Task('agieval-cr', 'consistency', 'AGIEval CR') |
|
task2 = Task('mmlu_pro-acc', 'accuracy', 'MMLU-Pro Mean (Min, Max)') |
|
task3 = Task('mmlu_pro-cr', 'consistency', 'MMLU-Pro CR') |
|
task4 = Task('math-acc', 'accuracy', 'Math Mean (Min, Max)') |
|
task5 = Task('math-cr', 'consistency', 'Math CR') |
|
|
|
|
|
NUM_FEWSHOT = 0 |
|
|
|
|
|
|
|
|
|
TITLE = """<h1 align="center" id="space-title">SCORE Leaderboard</h1>""" |
|
|
|
|
|
INTRODUCTION_TEXT = """ |
|
We introduce <b>SCORE</b> - an open and holistic evaluation framework for LLMs centered on robustness i.e. the ability to produce consistent responses when the input is rephrased |
|
or presented in a slightly different way. Prediction consistency is particularly crucial for factual questions where an objective answer exists. Note that it is expected |
|
that the predictions are equivalent and not necessarily correct. Models are evaluated multiple times in equivalent setups and accuracy range along with prediction |
|
consistency rate is reported. Contrary to a single accuracy metrics (often derived from an optimized setup) reported during model releases, this better simulates human |
|
interaction setups and provides better estimate of real world performance. Furthermore, models are evaluated using the same setup which makes model comparison possible. |
|
|
|
<h1 align="center" id="space-title">Tasks</h1> |
|
<b>Prompt Robustness</b> - Models are evaluated on ten different prompts. For multiple choice question (MCQ) datasets, prompts ask the model to choose the right option |
|
letter. For MATH, prompts ask the model to solve the problem. The prompt set is diverse enough to cover various content and formatting styles that the model may encounter |
|
in real life, they are not adversarial or tuned in any way. Prompts are semantically close, vary by instruction and level of response details. Prompts end with final |
|
answer formatting instructions. We include both CoT and non-CoT prompts and vary the placement of the question in the prompt to be either in the beginning, in the middle, |
|
or at the end of the prompt. |
|
|
|
<b>Non Greedy Inference</b> - We study the effect of random seed during non-greedy inference. For factual questions the model's underlying distribution should be sharp enough |
|
to be independent of the random seed for the next token sampling. There is an inherent randomness in the answer generation process, which may affect the "path" model takes to arrive at an answer. |
|
|
|
<b>Choice Order Robustness</b> - We test models against changes in the order of choices for MCQ datasets. We swap the order of choices and ensure the correct answer |
|
is always the same option (all correct answers are A or B, etc). Changing the order of choices does not change the input's semantics, and it is expected that the models |
|
will be robust against such minimal change. |
|
|
|
<h1 align="center" id="space-title">Datasets</h1> |
|
<b>MMLU Pro</b> - Massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. <br> |
|
<b>AGIEval</b> - Dataset specifically designed to assess foundation model in the context of human-centric standardized exams, such as college entrance exams, law school admission tests, math competitions, and lawyer qualification tests. <br> |
|
<b>MATH</b> - Challenging competition mathematics problems <br> |
|
|
|
<h1 align="center" id="space-title">Metrics</h1> |
|
<b>Accuracy</b> - We report macro accuracy for MMLU Pro and micro accuracy for AGIEval and MATH. |
|
For all datasets, average (minimum, maximum) accuracy across all experiments is reported<br> |
|
<b>Consistency Rate</b> - We use the consistency rate (CR) to measure the stability of model predictions. |
|
CR calculates the proportion of consistent prediction pairs for each data point. |
|
""" |
|
|
|
|
|
LLM_BENCHMARKS_TEXT = f""" |
|
## How to Evaluate on SCORE? |
|
|
|
To evaluate your model on the SCORE benchmark, you can use [LM-EVALUATION-HARNESS](https://github.com/EleutherAI/lm-evaluation-harness). |
|
The tasks are available under the following groups: |
|
* score_robustness_mmlu_pro |
|
* score_robustness_agieval |
|
* score_robustness_math |
|
|
|
Th numbers in the leaderboard are the average for across tasks for each dataset. |
|
|
|
More details could be found in the [README of the Score task](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/score) as well as the [official repository](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) |
|
""" |
|
|
|
EVALUATION_QUEUE_TEXT = """ |
|
## Some good practices before submitting a model |
|
|
|
### 1) Make sure you can load your model and tokenizer using AutoClasses: |
|
```python |
|
from transformers import AutoConfig, AutoModel, AutoTokenizer |
|
config = AutoConfig.from_pretrained("your model name", revision=revision) |
|
model = AutoModel.from_pretrained("your model name", revision=revision) |
|
tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision) |
|
``` |
|
If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded. |
|
|
|
Note: make sure your model is public! |
|
Note: if your model needs `use_remote_code=True`, we do not support this option yet but we are working on adding it, stay posted! |
|
|
|
### 2) Convert your model weights to [safetensors](https://huggingface.co/docs/safetensors/index) |
|
It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`! |
|
|
|
### 3) Make sure your model has an open license! |
|
This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model 🤗 |
|
|
|
### 4) Fill up your model card |
|
When we add extra information about models to the leaderboard, it will be automatically taken from the model card |
|
|
|
## In case of model failure |
|
If your model is displayed in the `FAILED` category, its execution stopped. |
|
Make sure you have followed the above steps first. |
|
If everything is done, check you can launch the EleutherAIHarness on your model locally, using the above command without modifications (you can add `--limit` to limit the number of examples per task). |
|
""" |
|
|
|
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" |
|
CITATION_BUTTON_TEXT = r""" |
|
""" |
|
|