PeterKruger commited on
Commit
0e30fae
·
1 Parent(s): e4f522a

minot tweaks to the text

Browse files
Files changed (1) hide show
  1. app.py +10 -7
app.py CHANGED
@@ -110,8 +110,10 @@ with gr.Blocks(theme=gr.themes.Soft()) as app:
110
  gr.Markdown("# AutoBench LLM Leaderboard")
111
  gr.Markdown(
112
  "Interactive leaderboard for AutoBench, where LLMs rank LLMs' responses. "
113
- "Includes performance, cost, and latency metrics.\\n"
114
- "More info: [AutoBench Blog Post](https://huggingface.co/blog/PeterKruger/autobench)"
 
 
115
  )
116
 
117
  # --- Tab 1: Overall Ranking ---
@@ -139,7 +141,7 @@ with gr.Blocks(theme=gr.themes.Soft()) as app:
139
  # --- NEW Tab 1.5: Benchmark Comparison ---
140
  with gr.Tab("Benchmark Comparison"):
141
  gr.Markdown("## Benchmark Comparison")
142
- gr.Markdown("Comparison of AutoBench scores with other popular benchmarks (Chatbot Arena, Artificial Analysis Index, MMLU Index). Models sorted by AutoBench score.")
143
  if not df_benchmark_display.empty:
144
  gr.DataFrame(
145
  df_benchmark_display,
@@ -315,10 +317,10 @@ with gr.Blocks(theme=gr.themes.Soft()) as app:
315
  AutoBench is an LLM benchmark where Large Language Models (LLMs) evaluate and rank the responses generated by other LLMs. The questions themselves are also generated by LLMs across a diverse set of domains and ranked for quality.
316
 
317
  ### Methodology
318
- 1. **Question Generation:** High-quality questions across various domains (Coding, History, Science, etc.) are generated by capable LLMs.
319
  2. **Response Generation:** The models being benchmarked generate answers to these questions.
320
- 3. **Ranking:** A high-capability LLM (e.g., GPT-4, Claude 3) ranks the responses from different models for each question, typically on a scale (e.g., 1-5).
321
- 4. **Aggregation:** Scores are averaged across multiple questions and domains to produce the final AutoBench rank.
322
 
323
  ### Metrics
324
  * **AutoBench Score (AB):** The average rank received by a model's responses across all questions/domains (higher is better).
@@ -331,8 +333,9 @@ with gr.Blocks(theme=gr.themes.Soft()) as app:
331
  This leaderboard reflects a run completed on April 23, 2025. Models included recently released models such as o4-mini, Gpt-4.1-mini, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet:thikning, etc..
332
 
333
  ### Links
 
334
  * [AutoBench Blog Post](https://huggingface.co/blog/PeterKruger/autobench)
335
- * [Leaderboard Source Code](https://huggingface.co/spaces/<your-username>/<your-space-name>/tree/main)
336
 
337
  **Disclaimer:** Benchmark results provide one perspective on model capabilities. Performance can vary based on specific tasks, prompts, and API conditions. Costs are estimates and subject to change by providers. Latency depends on server load and geographic location.
338
  """)
 
110
  gr.Markdown("# AutoBench LLM Leaderboard")
111
  gr.Markdown(
112
  "Interactive leaderboard for AutoBench, where LLMs rank LLMs' responses. "
113
+ "Includes performance, cost, and latency metrics."
114
+ "Data updated on April 25, 2025."
115
+ "\n\nMore info for this benchmark run: [AutoBench Run 2 Results](https://huggingface.co/blog/PeterKruger/autobench-2nd-run)"
116
+ " If you want to know more about AutoBench: [AutoBench Release](https://huggingface.co/blog/PeterKruger/autobench)"
117
  )
118
 
119
  # --- Tab 1: Overall Ranking ---
 
141
  # --- NEW Tab 1.5: Benchmark Comparison ---
142
  with gr.Tab("Benchmark Comparison"):
143
  gr.Markdown("## Benchmark Comparison")
144
+ gr.Markdown("Comparison of AutoBench scores with other popular benchmarks. AutoBench features 82.51% correlation with Chatbot Arena, 83.74% with Artificial Analysis Intelligence Index, and 71.51% with MMLU. Models sorted by AutoBench score.")
145
  if not df_benchmark_display.empty:
146
  gr.DataFrame(
147
  df_benchmark_display,
 
317
  AutoBench is an LLM benchmark where Large Language Models (LLMs) evaluate and rank the responses generated by other LLMs. The questions themselves are also generated by LLMs across a diverse set of domains and ranked for quality.
318
 
319
  ### Methodology
320
+ 1. **Question Generation:** High-quality questions across various domains (Coding, History, Science, etc.) are generated by selected LLMs.
321
  2. **Response Generation:** The models being benchmarked generate answers to these questions.
322
+ 3. **Ranking:** Ranking LLMs rank the responses from different models for each question, on a 1-5 scale.
323
+ 4. **Aggregation:** Scores are averaged across multiple questions and domains to produce the final AutoBench rank.
324
 
325
  ### Metrics
326
  * **AutoBench Score (AB):** The average rank received by a model's responses across all questions/domains (higher is better).
 
333
  This leaderboard reflects a run completed on April 23, 2025. Models included recently released models such as o4-mini, Gpt-4.1-mini, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet:thikning, etc..
334
 
335
  ### Links
336
+ * [AutoBench Run 2 Results](https://huggingface.co/blog/PeterKruger/autobench-2nd-run)
337
  * [AutoBench Blog Post](https://huggingface.co/blog/PeterKruger/autobench)
338
+ * [Autobench Repositories](https://huggingface.co/AutoBench)
339
 
340
  **Disclaimer:** Benchmark results provide one perspective on model capabilities. Performance can vary based on specific tasks, prompts, and API conditions. Costs are estimates and subject to change by providers. Latency depends on server load and geographic location.
341
  """)