copy
Browse files- src/about.py +2 -2
src/about.py
CHANGED
@@ -56,8 +56,8 @@ These benchmarks go beyond basic reasoning and evaluate more advanced, autonomou
|
|
56 |
| Benchmark | Description |
|
57 |
|-----------------------|----------------------------------------------------------------------------|
|
58 |
| **GAIA** | Evaluates autonomous reasoning, planning, problem-solving for question answering. |
|
59 |
-
|
|
60 |
-
| **In-House-CTF** | Capture-the-flag challenge testing security skills. |
|
61 |
| **AgentHarm** / **AgentHarm-Benign** | Measures harmfulness of LLM agents (and benign behavior baseline). |
|
62 |
| **SWE-Bench-Verified** | Tests AI agent ability to solve software engineering tasks. |
|
63 |
</div>
|
|
|
56 |
| Benchmark | Description |
|
57 |
|-----------------------|----------------------------------------------------------------------------|
|
58 |
| **GAIA** | Evaluates autonomous reasoning, planning, problem-solving for question answering. |
|
59 |
+
| **InterCode-CTF** | Capture-the-flag challenge testing cyber-security skills. |
|
60 |
+
| **In-House-CTF** | Capture-the-flag challenge testing cyber-security skills. |
|
61 |
| **AgentHarm** / **AgentHarm-Benign** | Measures harmfulness of LLM agents (and benign behavior baseline). |
|
62 |
| **SWE-Bench-Verified** | Tests AI agent ability to solve software engineering tasks. |
|
63 |
</div>
|