Spaces:

AutoBench
/

AutoBench-Leaderboard

Running

App Files Files Community

PeterKruger commited on Apr 30

Commit

0e30fae

1 Parent(s): e4f522a

minot tweaks to the text

Browse files

Files changed (1) hide show

app.py +10 -7

app.py CHANGED Viewed

@@ -110,8 +110,10 @@ with gr.Blocks(theme=gr.themes.Soft()) as app:
     gr.Markdown("# AutoBench LLM Leaderboard")
     gr.Markdown(
         "Interactive leaderboard for AutoBench, where LLMs rank LLMs' responses. "
-        "Includes performance, cost, and latency metrics.\\n"
-        "More info: [AutoBench Blog Post](https://huggingface.co/blog/PeterKruger/autobench)"
         )
     # --- Tab 1: Overall Ranking ---
@@ -139,7 +141,7 @@ with gr.Blocks(theme=gr.themes.Soft()) as app:
     # --- NEW Tab 1.5: Benchmark Comparison ---
     with gr.Tab("Benchmark Comparison"):
         gr.Markdown("## Benchmark Comparison")
-        gr.Markdown("Comparison of AutoBench scores with other popular benchmarks (Chatbot Arena, Artificial Analysis Index, MMLU Index). Models sorted by AutoBench score.")
         if not df_benchmark_display.empty:
             gr.DataFrame(
                 df_benchmark_display,
@@ -315,10 +317,10 @@ with gr.Blocks(theme=gr.themes.Soft()) as app:
         AutoBench is an LLM benchmark where Large Language Models (LLMs) evaluate and rank the responses generated by other LLMs. The questions themselves are also generated by LLMs across a diverse set of domains and ranked for quality.
         ### Methodology
-        1.  **Question Generation:** High-quality questions across various domains (Coding, History, Science, etc.) are generated by capable LLMs.
         2.  **Response Generation:** The models being benchmarked generate answers to these questions.
-        3.  **Ranking:** A high-capability LLM (e.g., GPT-4, Claude 3) ranks the responses from different models for each question, typically on a scale (e.g., 1-5).
-        4.  **Aggregation:** Scores are averaged across multiple questions and domains to produce the final AutoBench rank.
         ### Metrics
         * **AutoBench Score (AB):** The average rank received by a model's responses across all questions/domains (higher is better).
@@ -331,8 +333,9 @@ with gr.Blocks(theme=gr.themes.Soft()) as app:
         This leaderboard reflects a run completed on April 23, 2025. Models included recently released models such as o4-mini, Gpt-4.1-mini, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet:thikning, etc..
         ### Links
         * [AutoBench Blog Post](https://huggingface.co/blog/PeterKruger/autobench)
-        * [Leaderboard Source Code](https://huggingface.co/spaces/<your-username>/<your-space-name>/tree/main)
         **Disclaimer:** Benchmark results provide one perspective on model capabilities. Performance can vary based on specific tasks, prompts, and API conditions. Costs are estimates and subject to change by providers. Latency depends on server load and geographic location.
         """)

     gr.Markdown("# AutoBench LLM Leaderboard")
     gr.Markdown(
         "Interactive leaderboard for AutoBench, where LLMs rank LLMs' responses. "
+        "Includes performance, cost, and latency metrics."
+        "Data updated on April 25, 2025."
+        "\n\nMore info for this benchmark run: [AutoBench Run 2 Results](https://huggingface.co/blog/PeterKruger/autobench-2nd-run)"
+        " If you want to know more about AutoBench: [AutoBench Release](https://huggingface.co/blog/PeterKruger/autobench)"
         )
     # --- Tab 1: Overall Ranking ---
     # --- NEW Tab 1.5: Benchmark Comparison ---
     with gr.Tab("Benchmark Comparison"):
         gr.Markdown("## Benchmark Comparison")
+        gr.Markdown("Comparison of AutoBench scores with other popular benchmarks. AutoBench features 82.51% correlation with Chatbot Arena, 83.74% with Artificial Analysis Intelligence Index, and 71.51% with MMLU. Models sorted by AutoBench score.")
         if not df_benchmark_display.empty:
             gr.DataFrame(
                 df_benchmark_display,
         AutoBench is an LLM benchmark where Large Language Models (LLMs) evaluate and rank the responses generated by other LLMs. The questions themselves are also generated by LLMs across a diverse set of domains and ranked for quality.
         ### Methodology
+        1.  **Question Generation:** High-quality questions across various domains (Coding, History, Science, etc.) are generated by selected LLMs.
         2.  **Response Generation:** The models being benchmarked generate answers to these questions.
+        3.  **Ranking:** Ranking LLMs rank the responses from different models for each question, on a 1-5 scale.
+        4.  **Aggregation:** Scores are averaged across multiple questions and domains to produce the final AutoBench rank.
         ### Metrics
         * **AutoBench Score (AB):** The average rank received by a model's responses across all questions/domains (higher is better).
         This leaderboard reflects a run completed on April 23, 2025. Models included recently released models such as o4-mini, Gpt-4.1-mini, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet:thikning, etc..
         ### Links
+        * [AutoBench Run 2 Results](https://huggingface.co/blog/PeterKruger/autobench-2nd-run)
         * [AutoBench Blog Post](https://huggingface.co/blog/PeterKruger/autobench)
+        * [Autobench Repositories](https://huggingface.co/AutoBench)
         **Disclaimer:** Benchmark results provide one perspective on model capabilities. Performance can vary based on specific tasks, prompts, and API conditions. Costs are estimates and subject to change by providers. Latency depends on server load and geographic location.
         """)