Update app.py
Browse files
app.py
CHANGED
@@ -156,20 +156,6 @@ with gr.Blocks() as app:
|
|
156 |
outputs=[table1, table2]
|
157 |
)
|
158 |
|
159 |
-
gr.Markdown("## The Stable Evaluation System")
|
160 |
-
gr.Markdown("**Solvable Tasks Filtration**. Since the solvablility of tasks in original ToolBench induces siginificant instability, we filter out the unsolvable tasks in advance. This process is executed using GPT-4, Gemini Pro, and Claude 2. Each task from the dataset is evaluated by these models to determine its solvability through majority voting. A task is classified as solvable if it provides all the necessary and valid information required for completion and can be resolved with the available tools. Human evaluation shows that these models can effectively filter out unsolvable tasks, ensuring the stability of the benchmark.")
|
161 |
-
gr.Markdown("**Metrics (SoPR and SoWR)**. Due to the limitation of gpt-3.5-turbo-16k in tool learning, we uniformly adopt gpt-4-turbo-preview as the automatic evaluator. SoPR is in essence PR with all tasks solvable and only assesses the answers using the same prompt in ToolBench. The evaluator assigns outcomes of answers categorised as Solved, Unsolved, or Unsure, which respectively contribute scores of 1, 0.5, and 0 to the overall SoPR calculation. As for SoWR, when one is solved and the other is unsolved, the solved one wins. Under other circumstances, gpt-4-turbo-preview will be used to make a win-lose decision.")
|
162 |
-
|
163 |
-
|
164 |
-
headers_ex = ["", "I1 Instruction", "I1 Category", "I1 Tool", "I2 Instruction", "I2 Category", "I3 Instruction",
|
165 |
-
"Total"]
|
166 |
-
data_ex = [
|
167 |
-
["Full", 200, 200, 200, 200, 200, 100, 1100],
|
168 |
-
["Solvable", 163, 153, 158, 106, 124, 61, 765]
|
169 |
-
]
|
170 |
-
gr.Markdown("#### Table: Summary of Task Statistics before and after filtration")
|
171 |
-
gr.Dataframe(headers=headers_ex, value=data_ex, interactive=False)
|
172 |
-
|
173 |
gr.Markdown("## Upload Your Own Results")
|
174 |
gr.Markdown("""
|
175 |
If you would like to contribute to the leaderboard, please follow the JSON structure below for your method's scores.
|
|
|
156 |
outputs=[table1, table2]
|
157 |
)
|
158 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
159 |
gr.Markdown("## Upload Your Own Results")
|
160 |
gr.Markdown("""
|
161 |
If you would like to contribute to the leaderboard, please follow the JSON structure below for your method's scores.
|