AutoBench-Leaderboard / data /summary_data.csv
PeterKruger's picture
optinization of the leaderboard
e4f522a
Model,AutoBench,Chatbot Ar.,AAI Index,MMLU Index,Costs (USD),Avg Answer Duration (sec),P99 Answer Duration (sec)
claude-3.5-haiku-20241022,3.99,1237,34740,0.634,0.00182703,10.8,17.98
claude-3.7-sonnet,4.2,1293,48150,0.803,0.01133934,15.53,32.86
claude-3.7-sonnet:thinking,4.39,1303,57390,0.837,0.0431979,45.8,82.6
deepSeek-R1,4.26,1358,60220,0.844,0.00515901,84.77,223.47
deepSeek-V3,4.09,1318,45580,0.752,0.00094273,34.57,106.53
deepSeek-V3-0324,4.16,1372,53240,0.819,0.00102168,42.28,140.54
gemini-2.0-flash-001,4.16,1356,48090,0.779,0.0003545,5.76,8.82
gemini-2.5-pro-preview-03-25,4.46,1439,67840,0.858,0.01225322,36.57,64.18
gemma-3-27b-it,4.2,1342,37620,0.669,0.00025149,30.03,79.12
gpt-4.1-mini,4.34,,52860,0.781,0.00145459,15.38,29.19
gpt-4o-mini,4,1272,35680,0.648,0.00038653,12.17,21.75
grok-2-1212,4.1,1288,39230,0.709,0.00847157,11.74,23.32
grok-3-beta,4.34,1402,50630,0.799,0.01694996,33.94,69.79
llama-3.1-Nemotron-70B-Instruct-HF,4.18,1269,37280,,0.00038647,25.04,48.74
llama-3.3-70B-Instruct,4.02,1257,41110,0.713,0.00035565,31.03,73.7
llama-3_1-Nemotron-Ultra-253B-v1,4.26,,,0.69,0.0031635,43.84,94.45
llama-4-Maverick-17B-128E-Instruct-FP8,4,1271,50530,0.809,0.00067195,9.76,23.11
llama-4-Scout-17B-16E-Instruct,4,,42990,0.752,0.000477,8.49,13.82
mistral-large-2411,4.05,1249,38270,0.697,0.0052478,29.18,96.77
mistral-small-24b-instruct-2501,3.88,1217,35280,0.652,0.00012061,13.99,29.62
nova-lite-v1,3.89,1217,32530,0.59,0.00015889,5.22,12.47
nova-pro-v1,3.83,1245,37080,0.691,0.0013758,5.65,9.93
o3-mini-2025-01-31,4.26,1305,62860,0.791,0.00612595,10.69,23.67
o4-mini-2025-04-16,4.57,,69830,0.832,0.00792985,19.1,52.3
qwen-plus,4.17,1310,,,0.00094732,34.73,66.7