Spaces:

nvidia
/

llm-robustness-leaderboard

Running

App Files Files Community

llm-robustness-leaderboard / src /about.py

ekmb

txt change

ca00816 about 2 months ago

raw

history blame contribute delete

7.31 kB

	from dataclasses import dataclass
	from enum import Enum

	@dataclass
	class Task:
	benchmark: str
	metric: str
	col_name: str


	# Select your tasks here
	# ---------------------------------------------------
	class Tasks(Enum):
	# task_key in the json file, metric_key in the json file, name to display in the leaderboard
	# task0 = Task("anli_r1", "acc", "ANLI")
	# task1 = Task("logiqa", "acc_norm", "LogiQA")

	# task0 = Task('agieval', 'accuracy', 'AGIEval_acc')
	# task1 = Task('agieval', 'consistency', 'AGIEval_CR')
	# task2 = Task('mmlu_pro', 'accuracy', 'MMLU-Pro_acc')
	# task3 = Task('mmlu_pro', 'consistency', 'MMLU-Pro_CR')
	# task4 = Task('math', 'accuracy', 'Math_acc')
	# task5 = Task('math', 'consistency', 'Math_CR')

	task0 = Task('agieval-acc', 'accuracy', 'AGIEval Mean (Min, Max)')
	task1 = Task('agieval-cr', 'consistency', 'AGIEval CR')
	task2 = Task('mmlu_pro-acc', 'accuracy', 'MMLU-Pro Mean (Min, Max)')
	task3 = Task('mmlu_pro-cr', 'consistency', 'MMLU-Pro CR')
	task4 = Task('math-acc', 'accuracy', 'Math Mean (Min, Max)')
	task5 = Task('math-cr', 'consistency', 'Math CR')


	NUM_FEWSHOT = 0 # Change with your few shot
	# ---------------------------------------------------


	# Your leaderboard name
	TITLE = """<h1 align="center" id="space-title">SCORE Leaderboard</h1>"""

	# What does your leaderboard evaluate?
	INTRODUCTION_TEXT = """
	We introduce <b>SCORE</b> - an open and holistic evaluation framework for LLMs centered on robustness i.e. the ability to produce consistent responses when the input is rephrased
	or presented in a slightly different way. Prediction consistency is particularly crucial for factual questions where an objective answer exists. Note that it is expected
	that the predictions are equivalent and not necessarily correct. Models are evaluated multiple times in equivalent setups and accuracy range along with prediction
	consistency rate is reported. Contrary to a single accuracy metrics (often derived from an optimized setup) reported during model releases, this better simulates human
	interaction setups and provides better estimate of real world performance. Furthermore, models are evaluated using the same setup which makes model comparison possible.

	<h1 align="center" id="space-title">Tasks</h1>
	<b>Prompt Robustness</b> - Models are evaluated on ten different prompts. For multiple choice question (MCQ) datasets, prompts ask the model to choose the right option
	letter. For MATH, prompts ask the model to solve the problem. The prompt set is diverse enough to cover various content and formatting styles that the model may encounter
	in real life, they are not adversarial or tuned in any way. Prompts are semantically close, vary by instruction and level of response details. Prompts end with final
	answer formatting instructions. We include both CoT and non-CoT prompts and vary the placement of the question in the prompt to be either in the beginning, in the middle,
	or at the end of the prompt.

	<b>Non Greedy Inference</b> - We study the effect of random seed during non-greedy inference. For factual questions the model's underlying distribution should be sharp enough
	to be independent of the random seed for the next token sampling. There is an inherent randomness in the answer generation process, which may affect the "path" model takes to arrive at an answer.

	<b>Choice Order Robustness</b> - We test models against changes in the order of choices for MCQ datasets. We swap the order of choices and ensure the correct answer
	is always the same option (all correct answers are A or B, etc). Changing the order of choices does not change the input's semantics, and it is expected that the models
	will be robust against such minimal change.

	<h1 align="center" id="space-title">Datasets</h1>
	<b>MMLU Pro</b> - Massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. <br>
	<b>AGIEval</b> - Dataset specifically designed to assess foundation model in the context of human-centric standardized exams, such as college entrance exams, law school admission tests, math competitions, and lawyer qualification tests. <br>
	<b>MATH</b> - Challenging competition mathematics problems <br>

	<h1 align="center" id="space-title">Metrics</h1>
	<b>Accuracy</b> - We report macro accuracy for MMLU Pro and micro accuracy for AGIEval and MATH.
	For all datasets, average (minimum, maximum) accuracy across all experiments is reported<br>
	<b>Consistency Rate</b> - We use the consistency rate (CR) to measure the stability of model predictions.
	CR calculates the proportion of consistent prediction pairs for each data point.
	"""

	# Which evaluations are you running? how can people reproduce what you have?
	LLM_BENCHMARKS_TEXT = f"""
	## How to Evaluate on SCORE?

	To evaluate your model on the SCORE benchmark, you can use [LM-EVALUATION-HARNESS](https://github.com/EleutherAI/lm-evaluation-harness).
	The tasks are available under the following groups:
	* score_robustness_mmlu_pro
	* score_robustness_agieval
	* score_robustness_math

	Th numbers in the leaderboard are the average for across tasks for each dataset.

	More details could be found in the [README of the Score task](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/score) as well as the [official repository](https://github.com/EleutherAI/lm-evaluation-harness/tree/main)
	"""

	EVALUATION_QUEUE_TEXT = """
	## Some good practices before submitting a model

	### 1) Make sure you can load your model and tokenizer using AutoClasses:
	```python
	from transformers import AutoConfig, AutoModel, AutoTokenizer
	config = AutoConfig.from_pretrained("your model name", revision=revision)
	model = AutoModel.from_pretrained("your model name", revision=revision)
	tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
	```
	If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded.

	Note: make sure your model is public!
	Note: if your model needs `use_remote_code=True`, we do not support this option yet but we are working on adding it, stay posted!

	### 2) Convert your model weights to [safetensors](https://huggingface.co/docs/safetensors/index)
	It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`!

	### 3) Make sure your model has an open license!
	This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model 🤗

	### 4) Fill up your model card
	When we add extra information about models to the leaderboard, it will be automatically taken from the model card

	## In case of model failure
	If your model is displayed in the `FAILED` category, its execution stopped.
	Make sure you have followed the above steps first.
	If everything is done, check you can launch the EleutherAIHarness on your model locally, using the above command without modifications (you can add `--limit` to limit the number of examples per task).
	"""

	CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
	CITATION_BUTTON_TEXT = r"""
	"""