File size: 1,521 Bytes
eae65f4
e5da7a6
 
eae65f4
 
 
 
 
872c476
e5da7a6
eae65f4
e5da7a6
 
eae65f4
0380c4f
 
 
 
 
 
e5da7a6
 
eae65f4
e5da7a6
eae65f4
e5da7a6
 
 
0380c4f
e5da7a6
eae65f4
e5da7a6
eae65f4
0380c4f
eae65f4
0380c4f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
---
title: TRAIL Leaderboard
emoji: 🏆
colorFrom: green
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: true
license: mit
short_description: Trace Reasoning and Agentic Issue Localization Leaderboard
sdk_version: 5.19.0
tags:
- leaderboard
---
# Model Performance Leaderboard

This is a Hugging Face Space that hosts a leaderboard for comparing model performances across various metrics of TRAIL dataset.

## Features

- **Submit Your Answers**: Run your model on TRAIL dataset. Submit your results.
- **Leaderboard**: View how your submissions are ranked.

## Instructions

1. Please refer to our GitHub repository at https://github.com/patronus-ai/trail-benchmark for step‑by‑step instructions on how to run your model with the TRAIL dataset. 
2. Compress the resulting JSON outputs into a ZIP archive whose filename begins with SWE_/GAIA_, and submit it.
3. Once the evaluation is complete, we’ll upload the scores (this process will soon be automated).

## Benchmarking on TRAIL

TRAIL(Trace Reasoning and Agentic Issue Localization) is a benchmark dataset of 148 annotated AI agent execution traces containing 841 errors across reasoning, execution, and planning categories. Created from real-world software engineering and information retrieval tasks, it challenges even state-of-the-art LLMs, with the best model achieving only 11% accuracy, highlighting the difficulty of trace debugging for complex agent workflows.

## License

This project is open source and available under the MIT license.