Spaces:
Running
Running
Commit
·
0e30fae
1
Parent(s):
e4f522a
minot tweaks to the text
Browse files
app.py
CHANGED
@@ -110,8 +110,10 @@ with gr.Blocks(theme=gr.themes.Soft()) as app:
|
|
110 |
gr.Markdown("# AutoBench LLM Leaderboard")
|
111 |
gr.Markdown(
|
112 |
"Interactive leaderboard for AutoBench, where LLMs rank LLMs' responses. "
|
113 |
-
"Includes performance, cost, and latency metrics
|
114 |
-
"
|
|
|
|
|
115 |
)
|
116 |
|
117 |
# --- Tab 1: Overall Ranking ---
|
@@ -139,7 +141,7 @@ with gr.Blocks(theme=gr.themes.Soft()) as app:
|
|
139 |
# --- NEW Tab 1.5: Benchmark Comparison ---
|
140 |
with gr.Tab("Benchmark Comparison"):
|
141 |
gr.Markdown("## Benchmark Comparison")
|
142 |
-
gr.Markdown("Comparison of AutoBench scores with other popular benchmarks
|
143 |
if not df_benchmark_display.empty:
|
144 |
gr.DataFrame(
|
145 |
df_benchmark_display,
|
@@ -315,10 +317,10 @@ with gr.Blocks(theme=gr.themes.Soft()) as app:
|
|
315 |
AutoBench is an LLM benchmark where Large Language Models (LLMs) evaluate and rank the responses generated by other LLMs. The questions themselves are also generated by LLMs across a diverse set of domains and ranked for quality.
|
316 |
|
317 |
### Methodology
|
318 |
-
1. **Question Generation:** High-quality questions across various domains (Coding, History, Science, etc.) are generated by
|
319 |
2. **Response Generation:** The models being benchmarked generate answers to these questions.
|
320 |
-
3. **Ranking:**
|
321 |
-
4. **Aggregation:** Scores are averaged across multiple questions and domains to produce the final AutoBench rank.
|
322 |
|
323 |
### Metrics
|
324 |
* **AutoBench Score (AB):** The average rank received by a model's responses across all questions/domains (higher is better).
|
@@ -331,8 +333,9 @@ with gr.Blocks(theme=gr.themes.Soft()) as app:
|
|
331 |
This leaderboard reflects a run completed on April 23, 2025. Models included recently released models such as o4-mini, Gpt-4.1-mini, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet:thikning, etc..
|
332 |
|
333 |
### Links
|
|
|
334 |
* [AutoBench Blog Post](https://huggingface.co/blog/PeterKruger/autobench)
|
335 |
-
* [
|
336 |
|
337 |
**Disclaimer:** Benchmark results provide one perspective on model capabilities. Performance can vary based on specific tasks, prompts, and API conditions. Costs are estimates and subject to change by providers. Latency depends on server load and geographic location.
|
338 |
""")
|
|
|
110 |
gr.Markdown("# AutoBench LLM Leaderboard")
|
111 |
gr.Markdown(
|
112 |
"Interactive leaderboard for AutoBench, where LLMs rank LLMs' responses. "
|
113 |
+
"Includes performance, cost, and latency metrics."
|
114 |
+
"Data updated on April 25, 2025."
|
115 |
+
"\n\nMore info for this benchmark run: [AutoBench Run 2 Results](https://huggingface.co/blog/PeterKruger/autobench-2nd-run)"
|
116 |
+
" If you want to know more about AutoBench: [AutoBench Release](https://huggingface.co/blog/PeterKruger/autobench)"
|
117 |
)
|
118 |
|
119 |
# --- Tab 1: Overall Ranking ---
|
|
|
141 |
# --- NEW Tab 1.5: Benchmark Comparison ---
|
142 |
with gr.Tab("Benchmark Comparison"):
|
143 |
gr.Markdown("## Benchmark Comparison")
|
144 |
+
gr.Markdown("Comparison of AutoBench scores with other popular benchmarks. AutoBench features 82.51% correlation with Chatbot Arena, 83.74% with Artificial Analysis Intelligence Index, and 71.51% with MMLU. Models sorted by AutoBench score.")
|
145 |
if not df_benchmark_display.empty:
|
146 |
gr.DataFrame(
|
147 |
df_benchmark_display,
|
|
|
317 |
AutoBench is an LLM benchmark where Large Language Models (LLMs) evaluate and rank the responses generated by other LLMs. The questions themselves are also generated by LLMs across a diverse set of domains and ranked for quality.
|
318 |
|
319 |
### Methodology
|
320 |
+
1. **Question Generation:** High-quality questions across various domains (Coding, History, Science, etc.) are generated by selected LLMs.
|
321 |
2. **Response Generation:** The models being benchmarked generate answers to these questions.
|
322 |
+
3. **Ranking:** Ranking LLMs rank the responses from different models for each question, on a 1-5 scale.
|
323 |
+
4. **Aggregation:** Scores are averaged across multiple questions and domains to produce the final AutoBench rank.
|
324 |
|
325 |
### Metrics
|
326 |
* **AutoBench Score (AB):** The average rank received by a model's responses across all questions/domains (higher is better).
|
|
|
333 |
This leaderboard reflects a run completed on April 23, 2025. Models included recently released models such as o4-mini, Gpt-4.1-mini, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet:thikning, etc..
|
334 |
|
335 |
### Links
|
336 |
+
* [AutoBench Run 2 Results](https://huggingface.co/blog/PeterKruger/autobench-2nd-run)
|
337 |
* [AutoBench Blog Post](https://huggingface.co/blog/PeterKruger/autobench)
|
338 |
+
* [Autobench Repositories](https://huggingface.co/AutoBench)
|
339 |
|
340 |
**Disclaimer:** Benchmark results provide one perspective on model capabilities. Performance can vary based on specific tasks, prompts, and API conditions. Costs are estimates and subject to change by providers. Latency depends on server load and geographic location.
|
341 |
""")
|