Spaces:

gorilla-llm
/

berkeley-function-calling-leaderboard

Running

App Files Files Community

Huanzhi Mao commited on Apr 1, 2024

Commit

027abe2

1 Parent(s): 8a12377

add description

Browse files

Files changed (1) hide show

app.py +25 -0

app.py CHANGED Viewed

@@ -1059,6 +1059,31 @@ with gr.Blocks() as demo:
             )
             leaderboard_data = gr.Dataframe(value=get_leaderboard(), wrap=True)
         with gr.TabItem("Try It Out"):
             with gr.Row():
                 with gr.Column(scale=1):

             )
             leaderboard_data = gr.Dataframe(value=get_leaderboard(), wrap=True)
+        with gr.TabItem("Evaluation Categories"):
+            gr.Markdown(
+                """
+                    # Python  Evaluation
+                    **Simple Function** evaluation contains the simplest but most commonly seen format, where the user supplies a single JSON function document, with one and only one function call will be invoked.
+                    **Multiple Function** contains a user question that only invokes one function call out of 2 to 4 JSON function documentations. The model needs to be capable of selecting the best function to invoke according to user provided context.
+                    **Parallel Function** is defined as invoking multiple function calls in parallel with one user query. The model needs to digest how many function calls need to be made and the question to model can be a single sentence or multiple sentence.
+                    **Parallel Multiple Function** is the combination of parallel function and multiple function. In another word, the model is provided with multiple function documentations, each of the corresponding function calls will be invoked zero or more times.
+                    """
+            )
+            gr.Markdown(
+                """
+                # non-Python Evaluation
+                In **relevance detection**, we design scenarios where none of the provided functions are relevant and supposed to be invoked. We expect the model's output to be no function call. This scenario provides insight to whether a model will hallucinate on its function and parameter to generate function code despite lacking the function information or instructions from the users to do so.
+                In **REST**, we include real world GET requests to test the model's capabilities to generate executable REST API calls through complex function documentations, using requests.get() along with the API's hardcoded URL and description of the purpose of the function and its parameters. Our evaluation includes two variations. The first type requires passing the parameters inside the URL, called path parameters. The second type requires the model to put parameters as key/value pairs into the params and/or headers of requests.get(.).
+                In **Java** and **Javascript**, the goal is to understand how well the function calling model can be extended to not just Python type but all the language specific typings such as the HashMap in Java. We included 100 examples for Java AST evaluation and 70 examples for Javascript AST evaluation.
+                """)
         with gr.TabItem("Try It Out"):
             with gr.Row():
                 with gr.Column(scale=1):