Huanzhi Mao
commited on
Commit
·
383da93
1
Parent(s):
c49d028
update description
Browse files
app.py
CHANGED
@@ -1051,8 +1051,6 @@ with gr.Blocks() as demo:
|
|
1051 |
**FC = native support for function/tool calling.**
|
1052 |
|
1053 |
**Cost is calculated as an estimate of the cost per 1000 function calls, in USD. Latency is measured in seconds.**
|
1054 |
-
|
1055 |
-
**AST Summary is the unweighted average of the four test categories under AST Evaluation. Exec Summary is the unweighted average of the four test categories under Exec Evaluation.**
|
1056 |
|
1057 |
**Click on column header to sort. If you would like to add your model or contribute test-cases, please contact us via [discord](https://discord.gg/SwTyuTAxX3).**
|
1058 |
"""
|
@@ -1062,28 +1060,28 @@ with gr.Blocks() as demo:
|
|
1062 |
with gr.TabItem("Evaluation Categories"):
|
1063 |
gr.Markdown(
|
1064 |
"""
|
1065 |
-
|
1066 |
-
|
1067 |
-
**Simple Function** evaluation contains the simplest but most commonly seen format, where the user supplies a single JSON function document, with one and only one function call will be invoked.
|
1068 |
|
1069 |
-
|
1070 |
|
1071 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
1072 |
|
1073 |
-
|
1074 |
-
"""
|
1075 |
|
1076 |
-
|
1077 |
-
|
1078 |
-
|
1079 |
-
|
1080 |
-
|
1081 |
In **relevance detection**, we design scenarios where none of the provided functions are relevant and supposed to be invoked. We expect the model's output to be no function call. This scenario provides insight to whether a model will hallucinate on its function and parameter to generate function code despite lacking the function information or instructions from the users to do so.
|
1082 |
-
|
1083 |
-
|
1084 |
-
|
1085 |
-
In **Java** and **Javascript**, the goal is to understand how well the function calling model can be extended to not just Python type but all the language specific typings such as the HashMap in Java. We included 100 examples for Java AST evaluation and 70 examples for Javascript AST evaluation.
|
1086 |
-
""")
|
1087 |
with gr.TabItem("Try It Out"):
|
1088 |
with gr.Row():
|
1089 |
with gr.Column(scale=1):
|
|
|
1051 |
**FC = native support for function/tool calling.**
|
1052 |
|
1053 |
**Cost is calculated as an estimate of the cost per 1000 function calls, in USD. Latency is measured in seconds.**
|
|
|
|
|
1054 |
|
1055 |
**Click on column header to sort. If you would like to add your model or contribute test-cases, please contact us via [discord](https://discord.gg/SwTyuTAxX3).**
|
1056 |
"""
|
|
|
1060 |
with gr.TabItem("Evaluation Categories"):
|
1061 |
gr.Markdown(
|
1062 |
"""
|
1063 |
+
### What are the different columns representing in the leaderboard?
|
|
|
|
|
1064 |
|
1065 |
+
We provide a short summary here. For more details, please refer to our release [blog](https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html):
|
1066 |
|
1067 |
+
**AST** means evaluation through Abstract Syntax Tree, and **Exec** means evaluation through execution.
|
1068 |
+
|
1069 |
+
**Cost** is calculated as an estimate of the cost per 1000 function calls, in USD.
|
1070 |
+
|
1071 |
+
**Latency** is measured in seconds.
|
1072 |
+
|
1073 |
+
**Simple Function** evaluation contains the simplest but most commonly seen format, where the user supplies a single JSON function document, with one and only one function call will be invoked.
|
1074 |
|
1075 |
+
**Multiple Function** contains a user question that only invokes one function call out of 2 to 4 JSON function documentations. The model needs to be capable of selecting the best function to invoke according to user provided context. For example, if the prompt is `what is 2 + 3?` and the options are `add()` and `mult()`, the model should select `add()`.
|
|
|
1076 |
|
1077 |
+
**Parallel Function** is defined as invoking multiple function calls in parallel with one user query. The model needs to digest how many function calls need to be made and the question to model can be a single sentence or multiple sentence. For example, if the prompt is `What's the weather in San Francisco and New York` and the function provided is `get_weather()`, the model should return both `get_weather('San Francisco')` and `get_weather('New York')`.
|
1078 |
+
|
1079 |
+
**Parallel Multiple Function** is the combination of parallel function and multiple function. In another word, the model is provided with multiple function documentations, each of the corresponding function calls will be invoked zero or more times.
|
1080 |
+
|
|
|
1081 |
In **relevance detection**, we design scenarios where none of the provided functions are relevant and supposed to be invoked. We expect the model's output to be no function call. This scenario provides insight to whether a model will hallucinate on its function and parameter to generate function code despite lacking the function information or instructions from the users to do so.
|
1082 |
+
"""
|
1083 |
+
)
|
1084 |
+
|
|
|
|
|
1085 |
with gr.TabItem("Try It Out"):
|
1086 |
with gr.Row():
|
1087 |
with gr.Column(scale=1):
|