Spaces:

gorilla-llm
/

berkeley-function-calling-leaderboard

Running

App Files Files Community

Huanzhi Mao commited on Apr 2, 2024

Commit

383da93

1 Parent(s): c49d028

update description

Browse files

Files changed (1) hide show

app.py +17 -19

app.py CHANGED Viewed

@@ -1051,8 +1051,6 @@ with gr.Blocks() as demo:
                 **FC = native support for function/tool calling.**
                 **Cost is calculated as an estimate of the cost per 1000 function calls, in USD. Latency is measured in seconds.**
-                **AST Summary is the unweighted average of the four test categories under AST Evaluation. Exec Summary is the unweighted average of the four test categories under Exec Evaluation.**
                 **Click on column header to sort. If you would like to add your model or contribute test-cases, please contact us via [discord](https://discord.gg/SwTyuTAxX3).**
                 """
@@ -1062,28 +1060,28 @@ with gr.Blocks() as demo:
         with gr.TabItem("Evaluation Categories"):
             gr.Markdown(
                 """
-                    # Python  Evaluation
-                    **Simple Function** evaluation contains the simplest but most commonly seen format, where the user supplies a single JSON function document, with one and only one function call will be invoked.
-                    **Multiple Function** contains a user question that only invokes one function call out of 2 to 4 JSON function documentations. The model needs to be capable of selecting the best function to invoke according to user provided context.
-                    **Parallel Function** is defined as invoking multiple function calls in parallel with one user query. The model needs to digest how many function calls need to be made and the question to model can be a single sentence or multiple sentence.
-                    **Parallel Multiple Function** is the combination of parallel function and multiple function. In another word, the model is provided with multiple function documentations, each of the corresponding function calls will be invoked zero or more times.
-                    """
-            )
-            gr.Markdown(
-                """
-                # non-Python Evaluation
                 In **relevance detection**, we design scenarios where none of the provided functions are relevant and supposed to be invoked. We expect the model's output to be no function call. This scenario provides insight to whether a model will hallucinate on its function and parameter to generate function code despite lacking the function information or instructions from the users to do so.
-                In **REST**, we include real world GET requests to test the model's capabilities to generate executable REST API calls through complex function documentations, using requests.get() along with the API's hardcoded URL and description of the purpose of the function and its parameters. Our evaluation includes two variations. The first type requires passing the parameters inside the URL, called path parameters. The second type requires the model to put parameters as key/value pairs into the params and/or headers of requests.get(.).
-                In **Java** and **Javascript**, the goal is to understand how well the function calling model can be extended to not just Python type but all the language specific typings such as the HashMap in Java. We included 100 examples for Java AST evaluation and 70 examples for Javascript AST evaluation.
-                """)
         with gr.TabItem("Try It Out"):
             with gr.Row():
                 with gr.Column(scale=1):

                 **FC = native support for function/tool calling.**
                 **Cost is calculated as an estimate of the cost per 1000 function calls, in USD. Latency is measured in seconds.**
                 **Click on column header to sort. If you would like to add your model or contribute test-cases, please contact us via [discord](https://discord.gg/SwTyuTAxX3).**
                 """
         with gr.TabItem("Evaluation Categories"):
             gr.Markdown(
                 """
+                ### What are the different columns representing in the leaderboard?
+                We provide a short summary here. For more details, please refer to our release [blog](https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html):
+                **AST** means evaluation through Abstract Syntax Tree, and **Exec** means evaluation through execution.
+                **Cost** is calculated as an estimate of the cost per 1000 function calls, in USD.
+                **Latency** is measured in seconds.
+                **Simple Function** evaluation contains the simplest but most commonly seen format, where the user supplies a single JSON function document, with one and only one function call will be invoked.
+                **Multiple Function** contains a user question that only invokes one function call out of 2 to 4 JSON function documentations. The model needs to be capable of selecting the best function to invoke according to user provided context. For example, if the prompt is `what is 2 + 3?` and the options are `add()` and `mult()`, the model should select `add()`.
+                **Parallel Function** is defined as invoking multiple function calls in parallel with one user query. The model needs to digest how many function calls need to be made and the question to model can be a single sentence or multiple sentence. For example, if the prompt is `What's the weather in San Francisco and New York` and the function provided is `get_weather()`, the model should return both `get_weather('San Francisco')` and `get_weather('New York')`.
+                **Parallel Multiple Function** is the combination of parallel function and multiple function. In another word, the model is provided with multiple function documentations, each of the corresponding function calls will be invoked zero or more times.
                 In **relevance detection**, we design scenarios where none of the provided functions are relevant and supposed to be invoked. We expect the model's output to be no function call. This scenario provides insight to whether a model will hallucinate on its function and parameter to generate function code despite lacking the function information or instructions from the users to do so.
+                """
+            )
         with gr.TabItem("Try It Out"):
             with gr.Row():
                 with gr.Column(scale=1):