jwilles commited on
Commit
4425c4b
·
1 Parent(s): 1661f8d
Files changed (1) hide show
  1. src/about.py +13 -14
src/about.py CHANGED
@@ -34,18 +34,17 @@ These benchmarks assess fundamental reasoning and knowledge capabilities of mode
34
 
35
  | Benchmark | Description |
36
  |--------------------|----------------------------------------------------------------------------------|
37
- | **ARC-Easy** / **ARC-Challenge** | Multiple-choice science questions measuring scientific & commonsense reasoning. |
38
- | **DROP** | Reading comprehension benchmark emphasizing discrete reasoning steps. |
39
- | **WinoGrande** | Commonsense reasoning challenge focused on co-reference resolution. |
40
- | **GSM8K** | Grade-school math word problems testing arithmetic & multi-step reasoning. |
41
- | **HellaSwag** | Commonsense inference task centered on action completion. |
42
  | **HumanEval** | Evaluates code generation and reasoning in a programming context. |
43
- | **IFEval** | Specialized benchmark for incremental formal reasoning. |
44
- | **IFEval** | Specialized benchmark for incremental formal reasoning. |
45
- | **MATH** | High school-level math questions requiring detailed solutions. |
46
  | **MMLU** / **MMLU-Pro**| Multi-subject multiple-choice tests of advanced knowledge. |
47
- | **GPQA-Diamond** | Question-answering benchmark assessing deeper reasoning & knowledge linking. |
48
- | **MMMU** (Multi-Choice / Open-Ended) | Multilingual & multi-domain tasks testing structured & open responses. |
49
  </div>
50
 
51
  ### 🚀 Agentic Benchmarks
@@ -56,11 +55,11 @@ These benchmarks go beyond basic reasoning and evaluate more advanced, autonomou
56
 
57
  | Benchmark | Description |
58
  |-----------------------|----------------------------------------------------------------------------|
59
- | **GAIA** | Evaluates autonomous reasoning, planning, problem-solving, & multi-turn interactions. |
60
- | [**InterCode-CTF**](https://ukgovernmentbeis.github.io/inspect_evals/evals/cybersecurity/in_house_ctf/) | Capture-the-flag challenge focused on code interpretation & debugging. |
61
- | **GDM-In-House-CTF** | Capture-the-flag challenge testing web application security skills. |
62
  | **AgentHarm** / **AgentHarm-Benign** | Measures harmfulness of LLM agents (and benign behavior baseline). |
63
- | **SWE-Bench** | Tests AI agent ability to solve software engineering tasks. |
64
  </div>
65
  """
66
 
 
34
 
35
  | Benchmark | Description |
36
  |--------------------|----------------------------------------------------------------------------------|
37
+ | **ARC-Easy** / **ARC-Challenge** | Multiple-choice science questions. |
38
+ | **DROP** | Comprehension benchmark evaluating advanced reasoning capability. |
39
+ | **WinoGrande** | Commonsense reasoning challenge. |
40
+ | **GSM8K** | Grade-school math word problems testing math capability & multi-step reasoning. |
41
+ | **HellaSwag** | Commonsense reasoning task. |
42
  | **HumanEval** | Evaluates code generation and reasoning in a programming context. |
43
+ | **IFEval** | Specialized benchmark for instruction following. |
44
+ | **MATH** | Challenging questions sourced from math competitions. |
 
45
  | **MMLU** / **MMLU-Pro**| Multi-subject multiple-choice tests of advanced knowledge. |
46
+ | **GPQA-Diamond** | Question-answering benchmark assessing deeper reasoning. |
47
+ | **MMMU** (Multi-Choice / Open-Ended) | Multi-modal tasks testing structured & open responses. |
48
  </div>
49
 
50
  ### 🚀 Agentic Benchmarks
 
55
 
56
  | Benchmark | Description |
57
  |-----------------------|----------------------------------------------------------------------------|
58
+ | **GAIA** | Evaluates autonomous reasoning, planning, problem-solving for question answering. |
59
+ | [**InterCode-CTF**] | Capture-the-flag challenge testing security skills. |
60
+ | **In-House-CTF** | Capture-the-flag challenge testing security skills. |
61
  | **AgentHarm** / **AgentHarm-Benign** | Measures harmfulness of LLM agents (and benign behavior baseline). |
62
+ | **SWE-Bench-Verified** | Tests AI agent ability to solve software engineering tasks. |
63
  </div>
64
  """
65