eval-leaderboard

Running

jwilles commited on Apr 10

Commit

ec7dfaf

1 Parent(s): fe65f69

copy

Files changed (1) hide show

src/about.py CHANGED Viewed

@@ -56,8 +56,8 @@ These benchmarks go beyond basic reasoning and evaluate more advanced, autonomou
 | Benchmark              | Description                                                                 |
 |-----------------------|----------------------------------------------------------------------------|
 | **GAIA**                | Evaluates autonomous reasoning, planning, problem-solving for question answering. |
-| [**InterCode-CTF**]  | Capture-the-flag challenge testing security skills.        |
-| **In-House-CTF**    | Capture-the-flag challenge testing security skills.         |
 | **AgentHarm** / **AgentHarm-Benign** | Measures harmfulness of LLM agents (and benign behavior baseline).   |
 | **SWE-Bench-Verified**           | Tests AI agent ability to solve software engineering tasks.                 |
 </div>

 | Benchmark              | Description                                                                 |
 |-----------------------|----------------------------------------------------------------------------|
 | **GAIA**                | Evaluates autonomous reasoning, planning, problem-solving for question answering. |
+| **InterCode-CTF**  | Capture-the-flag challenge testing cyber-security skills.        |
+| **In-House-CTF**    | Capture-the-flag challenge testing cyber-security skills.         |
 | **AgentHarm** / **AgentHarm-Benign** | Measures harmfulness of LLM agents (and benign behavior baseline).   |
 | **SWE-Bench-Verified**           | Tests AI agent ability to solve software engineering tasks.                 |
 </div>