|
from dataclasses import dataclass |
|
from enum import Enum |
|
|
|
@dataclass |
|
class Task: |
|
benchmark: str |
|
metric: str |
|
col_name: str |
|
type: str |
|
source: str |
|
|
|
|
|
|
|
|
|
class Tasks(Enum): |
|
|
|
|
|
|
|
task0 = Task("arc_easy", "accuracy", "ARC-Easy", "base", "https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/arc") |
|
task1 = Task("arc_challenge", "accuracy", "ARC-Challenge", "base", "https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/arc") |
|
task2 = Task("drop", "mean", "DROP", "base", "https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/drop") |
|
task3 = Task("winogrande", "accuracy", "WinoGrande", "base", "https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/winogrande") |
|
task4 = Task("gsm8k", "accuracy", "GSM8K", "base", "https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/gsm8k") |
|
task5 = Task("hellaswag", "accuracy", "HellaSwag", "base", "https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/hellaswag") |
|
task6 = Task("humaneval", "mean", "HumanEval", "base", "https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/humaneval") |
|
task7 = Task("ifeval", "final_acc", "IFEval", "base", "https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/ifeval") |
|
task8 = Task("math", "accuracy", "MATH", "base", "https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/mathematics") |
|
task9 = Task("mmlu", "accuracy", "MMLU", "base", "https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/mmlu") |
|
task10 = Task("mmlu_pro", "accuracy", "MMLU-Pro", "base", "https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/mmlu_pro") |
|
task11 = Task("gpqa_diamond", "accuracy", "GPQA-Diamond", "base", "https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/gpqa") |
|
task12 = Task("mmmu_multiple_choice", "accuracy", "MMMU-Multiple-Choice", "base", "https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/mmmu") |
|
task13 = Task("mmmu_open", "accuracy", "MMMU-Open-Ended", "base", "https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/mmmu") |
|
|
|
|
|
task14 = Task("gaia", "accuracy", "GAIA", "agentic", "https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/gaia") |
|
task15 = Task("gdm_intercode_ctf", "accuracy", "GDM-InterCode-CTF", "agentic", "https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/gdm_capabilities/intercode_ctf") |
|
task16 = Task("gdm_in_house_ctf", "accuracy", "GDM-In-House-CTF", "agentic", "https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/gdm_capabilities/in_house_ctf") |
|
task17 = Task("agentharm", "avg_score", "AgentHarm", "agentic", "https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/agentharm") |
|
task18 = Task("agentharm_benign", "avg_score", "AgentHarm-Benign", "agentic", "https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/agentharm") |
|
task19 = Task("swe_bench", "mean", "SWE-Bench", "agentic", "https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/swe_bench") |
|
|
|
|
|
NUM_FEWSHOT = 0 |
|
|
|
|
|
|
|
|
|
|
|
TITLE = """<h1 align="center" id="space-title">State of Evaluation Leaderboard</h1>""" |
|
|
|
SINGLE_TURN_TASK_NAMES = ", ".join([f"[{task.value.col_name}]({task.value.source})" for task in Tasks if task.value.type == "base"]) |
|
AGENTIC_TASK_NAMES = ", ".join([f"[{task.value.col_name}]({task.value.source})" for task in Tasks if task.value.type == "agentic"]) |
|
|
|
|
|
INTRODUCTION_TEXT = f""" |
|
Powered by **Inspect** and **Inspect Evals**, the **Vector State of Evaluation Leaderboard** presents an objective evaluation of leading frontier models across a comprehensive suite of benchmarks. Go beyond the summary metrics: click through to interactive reporting for each model and benchmark to explore sample-level performance and detailed traces.""" |
|
|
|
|
|
ABOUT_TEXT = f""" |
|
|
|
## Vector Institute |
|
The **Vector Institute** is dedicated to advancing the fields of artificial intelligence and machine learning through cutting-edge research, collaborative projects, and open-source contributions. Our mission is to drive excellence and innovation in AI, fostering a vibrant community of researchers, developers, and industry partners. |
|
|
|
## 🎯 Benchmarks |
|
|
|
This leaderboard showcases performance across a comprehensive suite of benchmarks, designed to rigorously evaluate different aspects of AI model capabilities. Let's explore the benchmarks we use: |
|
|
|
### Inspect Evals |
|
|
|
This leaderboard leverages [Inspect Evals](https://ukgovernmentbeis.github.io/inspect_evals/) to power evaluation. Inspect Evals is an open-source repository built upon the Inspect AI framework. Developed in collaboration between the Vector Institute, Arcadia Impact and the UK AI Safety Institute, Inspect Evals provides a comprehensive suite of high-quality benchmarks spanning diverse domains like coding, mathematics, cybersecurity, reasoning, and general knowledge. |
|
|
|
#### Transparent and Detailed Insights |
|
|
|
All evaluations presented on this leaderboard are run using Inspect Evals. To facilitate in-depth analysis and promote transparency, we provide [Inspect Logs](https://inspect.ai-safety-institute.org.uk/log-viewer.html) for every benchmark run. These logs offer sample and trace level reporting, allowing the community to explore the granular details of model performance. |
|
|
|
### ⚙️ Base Benchmarks |
|
|
|
These benchmarks assess fundamental reasoning and knowledge capabilities of models. |
|
|
|
<div class="benchmark-table-container"> |
|
|
|
| Benchmark | Description | Domain | |
|
|--------------------|----------------------------------------------------------------------------------|-----------------------------------------------| |
|
| **ARC-Easy** / **ARC-Challenge** | Multiple-choice science questions measuring scientific & commonsense reasoning. | Example | |
|
| **DROP** | Reading comprehension benchmark emphasizing discrete reasoning steps. | Example | |
|
| **WinoGrande** | Commonsense reasoning challenge focused on co-reference resolution. | Example | |
|
| **GSM8K** | Grade-school math word problems testing arithmetic & multi-step reasoning. | Example | |
|
| **HellaSwag** | Commonsense inference task centered on action completion. | Example | |
|
| **HumanEval** | Evaluates code generation and reasoning in a programming context. | Example | |
|
| **IFEval** | Specialized benchmark for incremental formal reasoning. | Example | |
|
| **IFEval** | Specialized benchmark for incremental formal reasoning. | Example | |
|
| **MATH** | High school-level math questions requiring detailed solutions. | Example | |
|
| **MMLU** / **MMLU-Pro**| Multi-subject multiple-choice tests of advanced knowledge. | Example | |
|
| **GPQA-Diamond** | Question-answering benchmark assessing deeper reasoning & knowledge linking. | Example | |
|
| **MMMU** (Multi-Choice / Open-Ended) | Multilingual & multi-domain tasks testing structured & open responses. | Example | |
|
</div> |
|
|
|
### 🚀 Agentic Benchmarks |
|
|
|
These benchmarks go beyond basic reasoning and evaluate more advanced, autonomous, or "agentic" capabilities of models, such as planning and interaction. |
|
|
|
<div class="benchmark-table-container"> |
|
|
|
| Benchmark | Description | Key Skills | |
|
|-----------------------|-----------------------------------------------------------------------------|-------------------------------------------------| |
|
| **GAIA** | Evaluates autonomous reasoning, planning, problem-solving, & multi-turn interactions. | Example | |
|
| [**InterCode-CTF**](https://ukgovernmentbeis.github.io/inspect_evals/evals/cybersecurity/in_house_ctf/) | Capture-the-flag challenge focused on code interpretation & debugging. | Example | |
|
| **GDM-In-House-CTF** | Capture-the-flag challenge testing web application security skills. | Example | |
|
| **AgentHarm** / **AgentHarm-Benign** | Measures harmfulness of LLM agents (and benign behavior baseline). | Example | |
|
| **SWE-Bench** | Tests AI agent ability to solve software engineering tasks. | Example | |
|
|
|
</div> |
|
""" |
|
|
|
REPRODUCIBILITY_TEXT = """ |
|
## Reproduce and Extend the Leaderboard |
|
|
|
### 1) Make sure you can load your model and tokenizer using AutoClasses: |
|
```python |
|
from transformers import AutoConfig, AutoModel, AutoTokenizer |
|
config = AutoConfig.from_pretrained("your model name", revision=revision) |
|
model = AutoModel.from_pretrained("your model name", revision=revision) |
|
tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision) |
|
``` |
|
If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded. |
|
|
|
Note: make sure your model is public! |
|
Note: if your model needs `use_remote_code=True`, we do not support this option yet but we are working on adding it, stay posted! |
|
|
|
### 2) Convert your model weights to [safetensors](https://huggingface.co/docs/safetensors/index) |
|
It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`! |
|
|
|
### 3) Make sure your model has an open license! |
|
This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model 🤗 |
|
|
|
### 4) Fill up your model card |
|
When we add extra information about models to the leaderboard, it will be automatically taken from the model card |
|
|
|
## In case of model failure |
|
If your model is displayed in the `FAILED` category, its execution stopped. |
|
Make sure you have followed the above steps first. |
|
If everything is done, check you can launch the EleutherAIHarness on your model locally, using the above command without modifications (you can add `--limit` to limit the number of examples per task). |
|
""" |
|
|
|
|