jwilles commited on
Commit
ec7dfaf
·
1 Parent(s): fe65f69
Files changed (1) hide show
  1. src/about.py +2 -2
src/about.py CHANGED
@@ -56,8 +56,8 @@ These benchmarks go beyond basic reasoning and evaluate more advanced, autonomou
56
  | Benchmark | Description |
57
  |-----------------------|----------------------------------------------------------------------------|
58
  | **GAIA** | Evaluates autonomous reasoning, planning, problem-solving for question answering. |
59
- | [**InterCode-CTF**] | Capture-the-flag challenge testing security skills. |
60
- | **In-House-CTF** | Capture-the-flag challenge testing security skills. |
61
  | **AgentHarm** / **AgentHarm-Benign** | Measures harmfulness of LLM agents (and benign behavior baseline). |
62
  | **SWE-Bench-Verified** | Tests AI agent ability to solve software engineering tasks. |
63
  </div>
 
56
  | Benchmark | Description |
57
  |-----------------------|----------------------------------------------------------------------------|
58
  | **GAIA** | Evaluates autonomous reasoning, planning, problem-solving for question answering. |
59
+ | **InterCode-CTF** | Capture-the-flag challenge testing cyber-security skills. |
60
+ | **In-House-CTF** | Capture-the-flag challenge testing cyber-security skills. |
61
  | **AgentHarm** / **AgentHarm-Benign** | Measures harmfulness of LLM agents (and benign behavior baseline). |
62
  | **SWE-Bench-Verified** | Tests AI agent ability to solve software engineering tasks. |
63
  </div>