JarvisKi commited on
Commit
4134025
·
verified ·
1 Parent(s): 8189f72

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +0 -14
app.py CHANGED
@@ -156,20 +156,6 @@ with gr.Blocks() as app:
156
  outputs=[table1, table2]
157
  )
158
 
159
- gr.Markdown("## The Stable Evaluation System")
160
- gr.Markdown("**Solvable Tasks Filtration**. Since the solvablility of tasks in original ToolBench induces siginificant instability, we filter out the unsolvable tasks in advance. This process is executed using GPT-4, Gemini Pro, and Claude 2. Each task from the dataset is evaluated by these models to determine its solvability through majority voting. A task is classified as solvable if it provides all the necessary and valid information required for completion and can be resolved with the available tools. Human evaluation shows that these models can effectively filter out unsolvable tasks, ensuring the stability of the benchmark.")
161
- gr.Markdown("**Metrics (SoPR and SoWR)**. Due to the limitation of gpt-3.5-turbo-16k in tool learning, we uniformly adopt gpt-4-turbo-preview as the automatic evaluator. SoPR is in essence PR with all tasks solvable and only assesses the answers using the same prompt in ToolBench. The evaluator assigns outcomes of answers categorised as Solved, Unsolved, or Unsure, which respectively contribute scores of 1, 0.5, and 0 to the overall SoPR calculation. As for SoWR, when one is solved and the other is unsolved, the solved one wins. Under other circumstances, gpt-4-turbo-preview will be used to make a win-lose decision.")
162
-
163
-
164
- headers_ex = ["", "I1 Instruction", "I1 Category", "I1 Tool", "I2 Instruction", "I2 Category", "I3 Instruction",
165
- "Total"]
166
- data_ex = [
167
- ["Full", 200, 200, 200, 200, 200, 100, 1100],
168
- ["Solvable", 163, 153, 158, 106, 124, 61, 765]
169
- ]
170
- gr.Markdown("#### Table: Summary of Task Statistics before and after filtration")
171
- gr.Dataframe(headers=headers_ex, value=data_ex, interactive=False)
172
-
173
  gr.Markdown("## Upload Your Own Results")
174
  gr.Markdown("""
175
  If you would like to contribute to the leaderboard, please follow the JSON structure below for your method's scores.
 
156
  outputs=[table1, table2]
157
  )
158
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
159
  gr.Markdown("## Upload Your Own Results")
160
  gr.Markdown("""
161
  If you would like to contribute to the leaderboard, please follow the JSON structure below for your method's scores.