eval-leaderboard

Running

File size: 6,104 Bytes

# Your leaderboard name
TITLE = """<h1 align="center" id="space-title">Evaluation Leaderboard</h1>"""

# SINGLE_TURN_TASK_NAMES = ", ".join([f"[{task.value.col_name}]({task.value.source})" for task in Tasks if task.value.type == "base"])
# AGENTIC_TASK_NAMES = ", ".join([f"[{task.value.col_name}]({task.value.source})" for task in Tasks if task.value.type == "agentic"])

# What does your leaderboard evaluate?
INTRODUCTION_TEXT = f"""
Powered by **Inspect** and **Inspect Evals**, the **Vector Evaluation Leaderboard** presents an evaluation of leading frontier models across a comprehensive suite of benchmarks. Go beyond the summary metrics: click through to interactive reporting for each model and benchmark to explore sample-level performance and detailed traces."""

# Which evaluations are you running? how can people reproduce what you have?
ABOUT_TEXT = f"""

## Vector Institute
The **Vector Institute** is dedicated to advancing the fields of artificial intelligence and machine learning through cutting-edge research and open-source contributions. Our mission is to drive excellence and innovation in AI, fostering a vibrant community of researchers, developers, and industry partners.

## 🎯 Benchmarks

This leaderboard showcases performance across a comprehensive suite of benchmarks, designed to rigorously evaluate different aspects of AI model capabilities. Let's explore the benchmarks we use:

### Inspect Evals

This leaderboard leverages [Inspect Evals](https://ukgovernmentbeis.github.io/inspect_evals/) to power evaluation. Inspect Evals is an open-source repository built upon the Inspect AI framework. Developed in collaboration between the Vector Institute, Arcadia Impact and the UK AI Safety Institute, Inspect Evals provides a comprehensive suite of high-quality benchmarks spanning diverse domains like coding, mathematics, cybersecurity, reasoning, and general knowledge.

#### Transparent and Detailed Insights

All evaluations presented on this leaderboard are run using Inspect Evals. To facilitate in-depth analysis and promote transparency, we provide [Inspect Logs](https://inspect.ai-safety-institute.org.uk/log-viewer.html) for every benchmark run. These logs offer sample and trace level reporting, allowing the community to explore the granular details of model performance. 

### ⚙️ Base Benchmarks

These benchmarks assess fundamental reasoning and knowledge capabilities of models.

<div class="benchmark-table-container">

| Benchmark           | Description                                                                      |
|--------------------|----------------------------------------------------------------------------------|
| **ARC-Easy** / **ARC-Challenge** | Multiple-choice science questions measuring scientific & commonsense reasoning. |
| **DROP**             | Reading comprehension benchmark emphasizing discrete reasoning steps.             |
| **WinoGrande**        | Commonsense reasoning challenge focused on co-reference resolution.             |
| **GSM8K**             | Grade-school math word problems testing arithmetic & multi-step reasoning.         |
| **HellaSwag**        | Commonsense inference task centered on action completion.                        |
| **HumanEval**         | Evaluates code generation and reasoning in a programming context.                |
| **IFEval**            | Specialized benchmark for incremental formal reasoning.                          |
| **IFEval**            | Specialized benchmark for incremental formal reasoning.                          |
| **MATH**              | High school-level math questions requiring detailed solutions.                  |
| **MMLU** / **MMLU-Pro**| Multi-subject multiple-choice tests of advanced knowledge.                     |
| **GPQA-Diamond**      | Question-answering benchmark assessing deeper reasoning & knowledge linking.      |
| **MMMU** (Multi-Choice / Open-Ended) | Multilingual & multi-domain tasks testing structured & open responses.   |
</div>

### 🚀 Agentic Benchmarks

These benchmarks go beyond basic reasoning and evaluate more advanced, autonomous, or "agentic" capabilities of models, such as planning and interaction.

<div class="benchmark-table-container">

| Benchmark              | Description                                                                 |
|-----------------------|----------------------------------------------------------------------------|
| **GAIA**                | Evaluates autonomous reasoning, planning, problem-solving, & multi-turn interactions. |
| [**InterCode-CTF**](https://ukgovernmentbeis.github.io/inspect_evals/evals/cybersecurity/in_house_ctf/)   | Capture-the-flag challenge focused on code interpretation & debugging.       |
| **GDM-In-House-CTF**    | Capture-the-flag challenge testing web application security skills.         |
| **AgentHarm** / **AgentHarm-Benign** | Measures harmfulness of LLM agents (and benign behavior baseline).   |
| **SWE-Bench**           | Tests AI agent ability to solve software engineering tasks.                 |
</div>
"""

REPRODUCIBILITY_TEXT = """
## 🛠️ Reproducibility
The [Vector State of Evaluation Leaderboard Repository](https://github.com/VectorInstitute/evaluation) repository contains the evaluation script to reproduce results presented on the leaderboard.

### Install dependencies

1. Create a python virtual env. with ```python>=3.10``` and activate it
```bash
python -m venv env
source env/bin/activate
```

2. Install ```inspect_ai```, ```inspect_evals``` and other dependencies based on ```requirements.txt```
```bash
python -m pip install -r requirements.txt
```

3. Install any packages required for models you'd like to evaluate and use as grader models
```bash
python -m pip install <model_package>
```
Note: ```openai``` package is already included in ```requirements.txt```

### Run Inspect evaluation
1. Update the ```src/evals_cfg/run_cfg.yaml``` file to select the evals (base/agentic) and include all models to be evaluated
2. Now run evaluation as follows:
```bash
python src/run_evals.py
```
"""