Spaces:

TIGER-Lab
/

MMLU-Pro

Running on CPU Upgrade

App Files Files Community

MMLU-Pro / utils.py

MrLight

Update utils.py

8d679bb verified about 1 year ago

raw

history blame

5.67 kB

	import pandas as pd
	import gradio as gr
	import csv
	import json
	import os
	import shutil
	from huggingface_hub import Repository

	HF_TOKEN = os.environ.get("HF_TOKEN")

	SUBJECTS = ["Biology", "Business", "Chemistry", "Computer Science", "Economics", "Engineering",
	"Health", "History", "Law", "Math", "Philosophy", "Physics", "Psychology", "Other"]

	MODEL_INFO = [
	"Models",
	"Overall",
	"Biology", "Business", "Chemistry", "Computer Science", "Economics", "Engineering",
	"Health", "History", "Law", "Math", "Philosophy", "Physics", "Psychology", "Other"]

	DATA_TITLE_TYPE = ['markdown', 'number', 'number', 'number', 'number', 'number', 'number',
	'number', 'number', 'number', 'number', 'number', 'number', 'number',
	'number', 'number']

	SUBMISSION_NAME = "mmlu_pro_leaderboard_submission"
	SUBMISSION_URL = os.path.join("https://huggingface.co/datasets/TIGER-Lab/", SUBMISSION_NAME)
	CSV_DIR = "./mmlu_pro_leaderboard_submission/results.csv"

	COLUMN_NAMES = MODEL_INFO

	LEADERBOARD_INTRODUCTION = """# MMLU-Pro Leaderboard

	Welcome to the MMLU-Pro leaderboard, showcasing the performance of various advanced language models on the MMLU-Pro dataset. The MMLU-Pro dataset is an enhanced version of the original MMLU, specifically engineered to offer a more rigorous and realistic evaluation environment..

	The MMLU-Pro dataset consists of approximately 12,000 intricate questions that challenge the comprehension and reasoning abilities of LLMs. Below you can find the accuracies of different models tested on this dataset.

	## 1. What's new about MMLU-Pro

	Compared to the original MMLU, there are three major differences:

	- The original MMLU dataset only contains 4 options, MMLU-Pro increases it to 10 options. The increase in options will make the evaluation more realistic and challenging. The random guessing will lead to a much lower score.
	- The original MMLU dataset contains mostly knowledge-driven questions without requiring much reasoning. Therefore, PPL results are normally better than CoT. In our dataset, we increase the problem difficulty and integrate more reasoning-focused problems. In MMLU-Pro, CoT can be 20% higher than PPL.
	- Due to the increase of options, we found that the model performance becomes more robust. For example, Llama-2-7B performance variance on MMLU-Pro is within 1% with several different prompts. In contrast, the performance variance on original MMLU can be as huge as 4-5%.

	## 2. Dataset Summary

	- Questions and Options: Each question within the dataset typically has ten multiple-choice options, except for some that were reduced during the manual review process to remove unreasonable choices. This increase from the original four options per question is designed to enhance complexity and robustness, necessitating deeper reasoning to discern the correct answer among a larger pool of potential distractors.

	- Sources: The dataset consolidates questions from several sources:
	- Original MMLU Questions: Part of the dataset is coming from the original MMLU dataset. We remove the trivial and ambiguous questions.
	- STEM Website: Hand picking high-quality STEM problems from the Internet.
	- TheoremQA: High-quality human-annotated questions requiring theorems to solve.
	- Scibench: Science questions from college exams.

	For detailed information about the dataset, visit our page on Hugging Face: MMLU-Pro at Hugging Face. If you are interested in replicating these results or wish to evaluate your models using our dataset, access our evaluation scripts available on GitHub: TIGER-AI-Lab/MMLU-Pro.
	"""

	TABLE_INTRODUCTION = """
	"""

	LEADERBOARD_INFO = """
	We list the information of the used datasets as follows:<br>

	"""

	CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
	CITATION_BUTTON_TEXT = r""""""

	SUBMIT_INTRODUCTION = """# Submit on Science Leaderboard Introduction

	## ⚠ Please note that you need to submit the json file with following format:

	```json
	{
	"Model": "[MODEL_NAME]",
	"Overall": 0.5678,
	"Biology": 0.1234,
	"Business": 0.4567,
	...,
	"Other: 0.3456"
	}
	```
	After submitting, you can click the "Refresh" button to see the updated leaderboard (it may takes few seconds).

	"""


	def get_df():
	repo = Repository(local_dir=SUBMISSION_NAME, clone_from=SUBMISSION_URL, use_auth_token=HF_TOKEN)
	repo.git_pull()
	df = pd.read_csv(CSV_DIR)
	df = df.sort_values(by=['Overall'], ascending=False)
	return df[COLUMN_NAMES]


	def add_new_eval(
	input_file,
	):
	if input_file is None:
	return "Error! Empty file!"

	upload_data = json.loads(input_file)
	print("upload_data:\n", upload_data)
	data_row = [f'{upload_data["Model"]}', upload_data['Overall']]
	for subject in SUBJECTS:
	data_row += [upload_data[subject]]
	print("data_row:\n", data_row)
	submission_repo = Repository(local_dir=SUBMISSION_NAME, clone_from=SUBMISSION_URL,
	use_auth_token=HF_TOKEN, repo_type="dataset")
	submission_repo.git_pull()

	already_submitted = []
	with open(CSV_DIR, mode='r') as file:
	reader = csv.reader(file, delimiter=',')
	for row in reader:
	already_submitted.append(row[0])

	if data_row[0] not in already_submitted:
	with open(CSV_DIR, mode='a', newline='') as file:
	writer = csv.writer(file)
	writer.writerow(data_row)

	submission_repo.push_to_hub()
	print('Submission Successful')
	else:
	print('The entry already exists')


	def refresh_data():
	return get_df()