Last commit not found
from dataclasses import dataclass | |
from enum import Enum | |
class Task: | |
benchmark: str | |
metric: str | |
col_name: str | |
# Init: to update with your specific keys | |
class Tasks(Enum): | |
# task_key in the json file, metric_key in the json file, name to display in the leaderboard | |
task0 = Task("SAS", "weighted_accuracy", "SAS") | |
task1 = Task("SPP", "weighted_accuracy", "SPP") | |
task2 = Task("EDP", "weighted_accuracy", "EDP") | |
task3 = Task("TSP_D", "weighted_accuracy", "TSP_D") | |
task4 = Task("GCP_D", "weighted_accuracy", "GCP_D") | |
task5 = Task("KSP", "weighted_accuracy", "KSP") | |
task6 = Task("TSP", "weighted_accuracy", "TSP") | |
task7 = Task("GCP", "weighted_accuracy", "GCP") | |
task8 = Task("MSP", "weighted_accuracy", "MSP") | |
# Your leaderboard name | |
TITLE = """<h1 align="center" id="space-title">NPHardEval leaderboard</h1>""" | |
# What does your leaderboard evaluate? | |
INTRODUCTION_TEXT = """ | |
<div align="center"> | |
<img | |
src="https://raw.githubusercontent.com/casmlab/NPHardEval/main/figure/NPHardEval_text_right.png" | |
style="width: 80%;" | |
alt="Selected problems and the Euler diagram of computational complexity classes" | |
> | |
</div> | |
NPHardEval serves as a comprehensive benchmark for assessing the reasoning abilities of large language models (LLMs) through the lens of computational complexity classes. | |
""" | |
# Which evaluations are you running? how can people reproduce what you have? | |
LLM_BENCHMARKS_TEXT = f""" | |
The paramount importance of complex reasoning in Large Language Models (LLMs) is well-recognized, | |
especially in their application to intricate decision-making tasks. This underscores the necessity | |
of thoroughly investigating LLMs' reasoning capabilities. To this end, various benchmarks have been | |
developed to evaluate these capabilities. However, existing benchmarks fall short in providing a | |
comprehensive assessment of LLMs' potential in reasoning. Additionally, there is a risk of overfitting, | |
as these benchmarks are static and publicly accessible, allowing models to tailor responses to specific | |
metrics, thus artificially boosting their performance. | |
In response, our research introduces 'NPHardEval,' a novel benchmark meticulously designed to | |
comprehensively evaluate LLMs' reasoning abilities. It comprises a diverse array of 900 algorithmic | |
questions, spanning the spectrum up to NP-Hard complexity. These questions are strategically selected | |
to cover a vast range of complexities, ensuring a thorough evaluation of LLMs' reasoning power. This | |
benchmark not only offers insights into the current state of reasoning in LLMs but also establishes | |
a benchmark for comparing LLMs' performance across various complexity classes. | |
[Our repository](https://github.com/casmlab/NPHardEval) contains datasets, data generation scripts, and experimental procedures designed to evaluate LLMs in various reasoning tasks. | |
In particular, we use three complexity classes to define the task complexity in the benchmark, including P (polynomial time), NP-complete (nondeterministic polynomial-time complete), | |
and NP-hard, which are increasingly complex in both the intrinsic difficulty and the resources needed to solve them. The selected nine problems are: | |
1) P problems: Sorted Array Search (SAS), Edit Distance Problem (EDP), Shortest Path Problem (SPP); | |
2) NP-complete problems: Traveling Salesman Problem Decision Version (TSP-D), Graph Coloring Problem Decision Version (GCP-D), and Knapsack Problem (KSP); | |
3) NP-hard problems: Traveling Salesman Problem Optimization Version (TSP), Graph Coloring Problem Optimization Version (GCP), and Meeting Scheduling Problem (MSP). | |
The following figure shows their relation regarding computational complexity in an Euler diagram. | |
<div align="center"> | |
<img | |
src="https://raw.githubusercontent.com/casmlab/NPHardEval/main/figure/NP-hard.jpg" | |
style="width: 50%;" | |
alt="Selected problems and the Euler diagram of computational complexity classes" | |
> | |
</div> | |
Our benchmark offers several advantages compared with current benchmarks: | |
- Data construction grounded in the established computational complexity hierarchy | |
- Automatic checking mechanisms | |
- Automatic generation of datapoints | |
- Complete focus on reasoning while exclude numerical computation | |
Our study marks a significant contribution to understanding LLMs' current reasoning capabilities | |
and paves the way for future enhancements. Furthermore, NPHardEval features a dynamic update mechanism, | |
refreshing data points monthly. This approach is crucial in reducing the risk of model overfitting, | |
leading to a more accurate and dependable evaluation of LLMs' reasoning skills. The benchmark dataset | |
and the associated code are accessible at [NPHardEval GitHub Repository]("https://github.com/casmlab/NPHardEval"). | |
## Quick Start | |
### Environment setup | |
```bash | |
conda create --name llm_reason python=3.10 | |
conda activate llm_reason | |
git clone https://github.com/casmlab/NPHardEval.git | |
pip install -r requirements.txt | |
``` | |
### Set-up API keys | |
Please set up your API keys in `secrets.txt`. **Please don't directly upload your keys to any public repository.** | |
### Example Commands | |
Let's use the GPT 4 Turbo model (GPT-4-1106-preview) and the EDP for example. | |
For its zeroshot experiment, you can use: | |
``` | |
cd run | |
cd run_close_zeroshot | |
python run_hard_GCP.py gpt-4-1106-preview | |
``` | |
For its fewshot experiment, | |
``` | |
cd run | |
cd run_close_fewshot | |
python run_close_fewshot/run_hard_GCP.py gpt-4-1106-preview self | |
``` | |
""" | |
EVALUATION_QUEUE_TEXT = """ | |
Currently, we don't support the submission of new evaluations. | |
""" | |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results." | |
CITATION_BUTTON_TEXT = r""" | |
@misc{fan2023nphardeval, | |
title={NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes}, | |
author={Lizhou Fan and Wenyue Hua and Lingyao Li and Haoyang Ling and Yongfeng Zhang and Libby Hemphill}, | |
year={2023}, | |
eprint={2312.14890}, | |
archivePrefix={arXiv}, | |
primaryClass={cs.AI} | |
} | |
""" | |
COMMENT_BUTTON_TEXT = """ | |
We refer to the following links to mark the model parameters. Contact us if you find any issues. | |
- https://grabon.com/blog/claude-users-statistics/ | |
- https://medium.com/@seanbetts/peering-inside-gpt-4-understanding-its-mixture-of-experts-moe-architecture-2a42eb8bdcb3 | |
- https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-times-more-text-data-than-predecessor.html | |
""" |