Copy
Browse files- src/about.py +13 -14
src/about.py
CHANGED
@@ -34,18 +34,17 @@ These benchmarks assess fundamental reasoning and knowledge capabilities of mode
|
|
34 |
|
35 |
| Benchmark | Description |
|
36 |
|--------------------|----------------------------------------------------------------------------------|
|
37 |
-
| **ARC-Easy** / **ARC-Challenge** | Multiple-choice science questions
|
38 |
-
| **DROP** |
|
39 |
-
| **WinoGrande** | Commonsense reasoning challenge
|
40 |
-
| **GSM8K** | Grade-school math word problems testing
|
41 |
-
| **HellaSwag** | Commonsense
|
42 |
| **HumanEval** | Evaluates code generation and reasoning in a programming context. |
|
43 |
-
| **IFEval** | Specialized benchmark for
|
44 |
-
| **
|
45 |
-
| **MATH** | High school-level math questions requiring detailed solutions. |
|
46 |
| **MMLU** / **MMLU-Pro**| Multi-subject multiple-choice tests of advanced knowledge. |
|
47 |
-
| **GPQA-Diamond** | Question-answering benchmark assessing deeper reasoning
|
48 |
-
| **MMMU** (Multi-Choice / Open-Ended) |
|
49 |
</div>
|
50 |
|
51 |
### 🚀 Agentic Benchmarks
|
@@ -56,11 +55,11 @@ These benchmarks go beyond basic reasoning and evaluate more advanced, autonomou
|
|
56 |
|
57 |
| Benchmark | Description |
|
58 |
|-----------------------|----------------------------------------------------------------------------|
|
59 |
-
| **GAIA** | Evaluates autonomous reasoning, planning, problem-solving
|
60 |
-
| [**InterCode-CTF**]
|
61 |
-
| **
|
62 |
| **AgentHarm** / **AgentHarm-Benign** | Measures harmfulness of LLM agents (and benign behavior baseline). |
|
63 |
-
| **SWE-Bench** | Tests AI agent ability to solve software engineering tasks. |
|
64 |
</div>
|
65 |
"""
|
66 |
|
|
|
34 |
|
35 |
| Benchmark | Description |
|
36 |
|--------------------|----------------------------------------------------------------------------------|
|
37 |
+
| **ARC-Easy** / **ARC-Challenge** | Multiple-choice science questions. |
|
38 |
+
| **DROP** | Comprehension benchmark evaluating advanced reasoning capability. |
|
39 |
+
| **WinoGrande** | Commonsense reasoning challenge. |
|
40 |
+
| **GSM8K** | Grade-school math word problems testing math capability & multi-step reasoning. |
|
41 |
+
| **HellaSwag** | Commonsense reasoning task. |
|
42 |
| **HumanEval** | Evaluates code generation and reasoning in a programming context. |
|
43 |
+
| **IFEval** | Specialized benchmark for instruction following. |
|
44 |
+
| **MATH** | Challenging questions sourced from math competitions. |
|
|
|
45 |
| **MMLU** / **MMLU-Pro**| Multi-subject multiple-choice tests of advanced knowledge. |
|
46 |
+
| **GPQA-Diamond** | Question-answering benchmark assessing deeper reasoning. |
|
47 |
+
| **MMMU** (Multi-Choice / Open-Ended) | Multi-modal tasks testing structured & open responses. |
|
48 |
</div>
|
49 |
|
50 |
### 🚀 Agentic Benchmarks
|
|
|
55 |
|
56 |
| Benchmark | Description |
|
57 |
|-----------------------|----------------------------------------------------------------------------|
|
58 |
+
| **GAIA** | Evaluates autonomous reasoning, planning, problem-solving for question answering. |
|
59 |
+
| [**InterCode-CTF**] | Capture-the-flag challenge testing security skills. |
|
60 |
+
| **In-House-CTF** | Capture-the-flag challenge testing security skills. |
|
61 |
| **AgentHarm** / **AgentHarm-Benign** | Measures harmfulness of LLM agents (and benign behavior baseline). |
|
62 |
+
| **SWE-Bench-Verified** | Tests AI agent ability to solve software engineering tasks. |
|
63 |
</div>
|
64 |
"""
|
65 |
|