edbeeching HF staff commited on
Commit
8e9aa4b
·
verified ·
1 Parent(s): 143076b

Upload eval_results/HuggingFaceH4/qwen-1.5-1.8b-dpo/v0.16/mmlu/results_2024-03-25T21-38-10.916331.json with huggingface_hub

Browse files
eval_results/HuggingFaceH4/qwen-1.5-1.8b-dpo/v0.16/mmlu/results_2024-03-25T21-38-10.916331.json ADDED
@@ -0,0 +1,3006 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "config_general": {
3
+ "lighteval_sha": "?",
4
+ "num_fewshot_seeds": 1,
5
+ "override_batch_size": 1,
6
+ "max_samples": null,
7
+ "job_id": "",
8
+ "start_time": 591079.494793358,
9
+ "end_time": 591911.834508147,
10
+ "total_evaluation_time_secondes": "832.3397147889482",
11
+ "model_name": "HuggingFaceH4/qwen-1.5-1.8b-dpo",
12
+ "model_sha": "cece5dc1dbb821c6336f7d7163496c54cde603ea",
13
+ "model_dtype": "torch.bfloat16",
14
+ "model_size": "3.79 GB",
15
+ "config": null
16
+ },
17
+ "results": {
18
+ "leaderboard|mmlu:abstract_algebra|5": {
19
+ "acc": 0.24,
20
+ "acc_stderr": 0.04292346959909283
21
+ },
22
+ "leaderboard|mmlu:anatomy|5": {
23
+ "acc": 0.4,
24
+ "acc_stderr": 0.04232073695151589
25
+ },
26
+ "leaderboard|mmlu:astronomy|5": {
27
+ "acc": 0.5328947368421053,
28
+ "acc_stderr": 0.040601270352363966
29
+ },
30
+ "leaderboard|mmlu:business_ethics|5": {
31
+ "acc": 0.45,
32
+ "acc_stderr": 0.049999999999999996
33
+ },
34
+ "leaderboard|mmlu:clinical_knowledge|5": {
35
+ "acc": 0.5056603773584906,
36
+ "acc_stderr": 0.03077090076385131
37
+ },
38
+ "leaderboard|mmlu:college_biology|5": {
39
+ "acc": 0.4652777777777778,
40
+ "acc_stderr": 0.04171115858181618
41
+ },
42
+ "leaderboard|mmlu:college_chemistry|5": {
43
+ "acc": 0.39,
44
+ "acc_stderr": 0.04902071300001975
45
+ },
46
+ "leaderboard|mmlu:college_computer_science|5": {
47
+ "acc": 0.35,
48
+ "acc_stderr": 0.047937248544110196
49
+ },
50
+ "leaderboard|mmlu:college_mathematics|5": {
51
+ "acc": 0.27,
52
+ "acc_stderr": 0.0446196043338474
53
+ },
54
+ "leaderboard|mmlu:college_medicine|5": {
55
+ "acc": 0.45664739884393063,
56
+ "acc_stderr": 0.03798106566014498
57
+ },
58
+ "leaderboard|mmlu:college_physics|5": {
59
+ "acc": 0.29411764705882354,
60
+ "acc_stderr": 0.04533838195929775
61
+ },
62
+ "leaderboard|mmlu:computer_security|5": {
63
+ "acc": 0.58,
64
+ "acc_stderr": 0.049604496374885836
65
+ },
66
+ "leaderboard|mmlu:conceptual_physics|5": {
67
+ "acc": 0.4085106382978723,
68
+ "acc_stderr": 0.032134180267015755
69
+ },
70
+ "leaderboard|mmlu:econometrics|5": {
71
+ "acc": 0.2719298245614035,
72
+ "acc_stderr": 0.04185774424022056
73
+ },
74
+ "leaderboard|mmlu:electrical_engineering|5": {
75
+ "acc": 0.4896551724137931,
76
+ "acc_stderr": 0.041657747757287644
77
+ },
78
+ "leaderboard|mmlu:elementary_mathematics|5": {
79
+ "acc": 0.36243386243386244,
80
+ "acc_stderr": 0.024757473902752045
81
+ },
82
+ "leaderboard|mmlu:formal_logic|5": {
83
+ "acc": 0.3333333333333333,
84
+ "acc_stderr": 0.042163702135578345
85
+ },
86
+ "leaderboard|mmlu:global_facts|5": {
87
+ "acc": 0.38,
88
+ "acc_stderr": 0.04878317312145634
89
+ },
90
+ "leaderboard|mmlu:high_school_biology|5": {
91
+ "acc": 0.5451612903225806,
92
+ "acc_stderr": 0.028327743091561074
93
+ },
94
+ "leaderboard|mmlu:high_school_chemistry|5": {
95
+ "acc": 0.39901477832512317,
96
+ "acc_stderr": 0.03445487686264715
97
+ },
98
+ "leaderboard|mmlu:high_school_computer_science|5": {
99
+ "acc": 0.44,
100
+ "acc_stderr": 0.049888765156985884
101
+ },
102
+ "leaderboard|mmlu:high_school_european_history|5": {
103
+ "acc": 0.5696969696969697,
104
+ "acc_stderr": 0.03866225962879077
105
+ },
106
+ "leaderboard|mmlu:high_school_geography|5": {
107
+ "acc": 0.4797979797979798,
108
+ "acc_stderr": 0.03559443565563918
109
+ },
110
+ "leaderboard|mmlu:high_school_government_and_politics|5": {
111
+ "acc": 0.6217616580310881,
112
+ "acc_stderr": 0.03499807276193338
113
+ },
114
+ "leaderboard|mmlu:high_school_macroeconomics|5": {
115
+ "acc": 0.4256410256410256,
116
+ "acc_stderr": 0.025069094387296535
117
+ },
118
+ "leaderboard|mmlu:high_school_mathematics|5": {
119
+ "acc": 0.34444444444444444,
120
+ "acc_stderr": 0.028972648884844267
121
+ },
122
+ "leaderboard|mmlu:high_school_microeconomics|5": {
123
+ "acc": 0.4369747899159664,
124
+ "acc_stderr": 0.03221943636566196
125
+ },
126
+ "leaderboard|mmlu:high_school_physics|5": {
127
+ "acc": 0.2582781456953642,
128
+ "acc_stderr": 0.035737053147634576
129
+ },
130
+ "leaderboard|mmlu:high_school_psychology|5": {
131
+ "acc": 0.5614678899082569,
132
+ "acc_stderr": 0.021274713073954562
133
+ },
134
+ "leaderboard|mmlu:high_school_statistics|5": {
135
+ "acc": 0.3333333333333333,
136
+ "acc_stderr": 0.03214952147802749
137
+ },
138
+ "leaderboard|mmlu:high_school_us_history|5": {
139
+ "acc": 0.5245098039215687,
140
+ "acc_stderr": 0.03505093194348798
141
+ },
142
+ "leaderboard|mmlu:high_school_world_history|5": {
143
+ "acc": 0.6371308016877637,
144
+ "acc_stderr": 0.031299208255302136
145
+ },
146
+ "leaderboard|mmlu:human_aging|5": {
147
+ "acc": 0.47085201793721976,
148
+ "acc_stderr": 0.03350073248773404
149
+ },
150
+ "leaderboard|mmlu:human_sexuality|5": {
151
+ "acc": 0.5572519083969466,
152
+ "acc_stderr": 0.0435644720266507
153
+ },
154
+ "leaderboard|mmlu:international_law|5": {
155
+ "acc": 0.5950413223140496,
156
+ "acc_stderr": 0.04481137755942469
157
+ },
158
+ "leaderboard|mmlu:jurisprudence|5": {
159
+ "acc": 0.5462962962962963,
160
+ "acc_stderr": 0.04812917324536823
161
+ },
162
+ "leaderboard|mmlu:logical_fallacies|5": {
163
+ "acc": 0.4294478527607362,
164
+ "acc_stderr": 0.03889066619112722
165
+ },
166
+ "leaderboard|mmlu:machine_learning|5": {
167
+ "acc": 0.30357142857142855,
168
+ "acc_stderr": 0.04364226155841044
169
+ },
170
+ "leaderboard|mmlu:management|5": {
171
+ "acc": 0.6796116504854369,
172
+ "acc_stderr": 0.04620284082280041
173
+ },
174
+ "leaderboard|mmlu:marketing|5": {
175
+ "acc": 0.7264957264957265,
176
+ "acc_stderr": 0.02920254015343118
177
+ },
178
+ "leaderboard|mmlu:medical_genetics|5": {
179
+ "acc": 0.61,
180
+ "acc_stderr": 0.04902071300001974
181
+ },
182
+ "leaderboard|mmlu:miscellaneous|5": {
183
+ "acc": 0.5951468710089399,
184
+ "acc_stderr": 0.017553246467720256
185
+ },
186
+ "leaderboard|mmlu:moral_disputes|5": {
187
+ "acc": 0.4884393063583815,
188
+ "acc_stderr": 0.026911898686377903
189
+ },
190
+ "leaderboard|mmlu:moral_scenarios|5": {
191
+ "acc": 0.23687150837988827,
192
+ "acc_stderr": 0.01421957078810399
193
+ },
194
+ "leaderboard|mmlu:nutrition|5": {
195
+ "acc": 0.5261437908496732,
196
+ "acc_stderr": 0.028590752958852394
197
+ },
198
+ "leaderboard|mmlu:philosophy|5": {
199
+ "acc": 0.4919614147909968,
200
+ "acc_stderr": 0.028394421370984545
201
+ },
202
+ "leaderboard|mmlu:prehistory|5": {
203
+ "acc": 0.46296296296296297,
204
+ "acc_stderr": 0.02774431344337654
205
+ },
206
+ "leaderboard|mmlu:professional_accounting|5": {
207
+ "acc": 0.36524822695035464,
208
+ "acc_stderr": 0.028723863853281278
209
+ },
210
+ "leaderboard|mmlu:professional_law|5": {
211
+ "acc": 0.3494132985658409,
212
+ "acc_stderr": 0.01217730625278668
213
+ },
214
+ "leaderboard|mmlu:professional_medicine|5": {
215
+ "acc": 0.41911764705882354,
216
+ "acc_stderr": 0.02997280717046463
217
+ },
218
+ "leaderboard|mmlu:professional_psychology|5": {
219
+ "acc": 0.42320261437908496,
220
+ "acc_stderr": 0.01998780976948207
221
+ },
222
+ "leaderboard|mmlu:public_relations|5": {
223
+ "acc": 0.5363636363636364,
224
+ "acc_stderr": 0.04776449162396197
225
+ },
226
+ "leaderboard|mmlu:security_studies|5": {
227
+ "acc": 0.5346938775510204,
228
+ "acc_stderr": 0.03193207024425314
229
+ },
230
+ "leaderboard|mmlu:sociology|5": {
231
+ "acc": 0.6119402985074627,
232
+ "acc_stderr": 0.034457899643627506
233
+ },
234
+ "leaderboard|mmlu:us_foreign_policy|5": {
235
+ "acc": 0.69,
236
+ "acc_stderr": 0.04648231987117316
237
+ },
238
+ "leaderboard|mmlu:virology|5": {
239
+ "acc": 0.40963855421686746,
240
+ "acc_stderr": 0.03828401115079023
241
+ },
242
+ "leaderboard|mmlu:world_religions|5": {
243
+ "acc": 0.5614035087719298,
244
+ "acc_stderr": 0.038057975055904594
245
+ },
246
+ "leaderboard|mmlu:_average|5": {
247
+ "acc": 0.4627857789406414,
248
+ "acc_stderr": 0.036247392344475986
249
+ }
250
+ },
251
+ "versions": {
252
+ "leaderboard|mmlu:abstract_algebra|5": 0,
253
+ "leaderboard|mmlu:anatomy|5": 0,
254
+ "leaderboard|mmlu:astronomy|5": 0,
255
+ "leaderboard|mmlu:business_ethics|5": 0,
256
+ "leaderboard|mmlu:clinical_knowledge|5": 0,
257
+ "leaderboard|mmlu:college_biology|5": 0,
258
+ "leaderboard|mmlu:college_chemistry|5": 0,
259
+ "leaderboard|mmlu:college_computer_science|5": 0,
260
+ "leaderboard|mmlu:college_mathematics|5": 0,
261
+ "leaderboard|mmlu:college_medicine|5": 0,
262
+ "leaderboard|mmlu:college_physics|5": 0,
263
+ "leaderboard|mmlu:computer_security|5": 0,
264
+ "leaderboard|mmlu:conceptual_physics|5": 0,
265
+ "leaderboard|mmlu:econometrics|5": 0,
266
+ "leaderboard|mmlu:electrical_engineering|5": 0,
267
+ "leaderboard|mmlu:elementary_mathematics|5": 0,
268
+ "leaderboard|mmlu:formal_logic|5": 0,
269
+ "leaderboard|mmlu:global_facts|5": 0,
270
+ "leaderboard|mmlu:high_school_biology|5": 0,
271
+ "leaderboard|mmlu:high_school_chemistry|5": 0,
272
+ "leaderboard|mmlu:high_school_computer_science|5": 0,
273
+ "leaderboard|mmlu:high_school_european_history|5": 0,
274
+ "leaderboard|mmlu:high_school_geography|5": 0,
275
+ "leaderboard|mmlu:high_school_government_and_politics|5": 0,
276
+ "leaderboard|mmlu:high_school_macroeconomics|5": 0,
277
+ "leaderboard|mmlu:high_school_mathematics|5": 0,
278
+ "leaderboard|mmlu:high_school_microeconomics|5": 0,
279
+ "leaderboard|mmlu:high_school_physics|5": 0,
280
+ "leaderboard|mmlu:high_school_psychology|5": 0,
281
+ "leaderboard|mmlu:high_school_statistics|5": 0,
282
+ "leaderboard|mmlu:high_school_us_history|5": 0,
283
+ "leaderboard|mmlu:high_school_world_history|5": 0,
284
+ "leaderboard|mmlu:human_aging|5": 0,
285
+ "leaderboard|mmlu:human_sexuality|5": 0,
286
+ "leaderboard|mmlu:international_law|5": 0,
287
+ "leaderboard|mmlu:jurisprudence|5": 0,
288
+ "leaderboard|mmlu:logical_fallacies|5": 0,
289
+ "leaderboard|mmlu:machine_learning|5": 0,
290
+ "leaderboard|mmlu:management|5": 0,
291
+ "leaderboard|mmlu:marketing|5": 0,
292
+ "leaderboard|mmlu:medical_genetics|5": 0,
293
+ "leaderboard|mmlu:miscellaneous|5": 0,
294
+ "leaderboard|mmlu:moral_disputes|5": 0,
295
+ "leaderboard|mmlu:moral_scenarios|5": 0,
296
+ "leaderboard|mmlu:nutrition|5": 0,
297
+ "leaderboard|mmlu:philosophy|5": 0,
298
+ "leaderboard|mmlu:prehistory|5": 0,
299
+ "leaderboard|mmlu:professional_accounting|5": 0,
300
+ "leaderboard|mmlu:professional_law|5": 0,
301
+ "leaderboard|mmlu:professional_medicine|5": 0,
302
+ "leaderboard|mmlu:professional_psychology|5": 0,
303
+ "leaderboard|mmlu:public_relations|5": 0,
304
+ "leaderboard|mmlu:security_studies|5": 0,
305
+ "leaderboard|mmlu:sociology|5": 0,
306
+ "leaderboard|mmlu:us_foreign_policy|5": 0,
307
+ "leaderboard|mmlu:virology|5": 0,
308
+ "leaderboard|mmlu:world_religions|5": 0
309
+ },
310
+ "config_tasks": {
311
+ "leaderboard|mmlu:abstract_algebra": {
312
+ "name": "mmlu:abstract_algebra",
313
+ "prompt_function": "mmlu_harness",
314
+ "hf_repo": "lighteval/mmlu",
315
+ "hf_subset": "abstract_algebra",
316
+ "metric": [
317
+ "loglikelihood_acc"
318
+ ],
319
+ "hf_avail_splits": [
320
+ "auxiliary_train",
321
+ "test",
322
+ "validation",
323
+ "dev"
324
+ ],
325
+ "evaluation_splits": [
326
+ "test"
327
+ ],
328
+ "few_shots_split": "dev",
329
+ "few_shots_select": "sequential",
330
+ "generation_size": 1,
331
+ "stop_sequence": [
332
+ "\n"
333
+ ],
334
+ "output_regex": null,
335
+ "frozen": false,
336
+ "suite": [
337
+ "leaderboard",
338
+ "mmlu"
339
+ ],
340
+ "original_num_docs": 100,
341
+ "effective_num_docs": 100,
342
+ "trust_dataset": true
343
+ },
344
+ "leaderboard|mmlu:anatomy": {
345
+ "name": "mmlu:anatomy",
346
+ "prompt_function": "mmlu_harness",
347
+ "hf_repo": "lighteval/mmlu",
348
+ "hf_subset": "anatomy",
349
+ "metric": [
350
+ "loglikelihood_acc"
351
+ ],
352
+ "hf_avail_splits": [
353
+ "auxiliary_train",
354
+ "test",
355
+ "validation",
356
+ "dev"
357
+ ],
358
+ "evaluation_splits": [
359
+ "test"
360
+ ],
361
+ "few_shots_split": "dev",
362
+ "few_shots_select": "sequential",
363
+ "generation_size": 1,
364
+ "stop_sequence": [
365
+ "\n"
366
+ ],
367
+ "output_regex": null,
368
+ "frozen": false,
369
+ "suite": [
370
+ "leaderboard",
371
+ "mmlu"
372
+ ],
373
+ "original_num_docs": 135,
374
+ "effective_num_docs": 135,
375
+ "trust_dataset": true
376
+ },
377
+ "leaderboard|mmlu:astronomy": {
378
+ "name": "mmlu:astronomy",
379
+ "prompt_function": "mmlu_harness",
380
+ "hf_repo": "lighteval/mmlu",
381
+ "hf_subset": "astronomy",
382
+ "metric": [
383
+ "loglikelihood_acc"
384
+ ],
385
+ "hf_avail_splits": [
386
+ "auxiliary_train",
387
+ "test",
388
+ "validation",
389
+ "dev"
390
+ ],
391
+ "evaluation_splits": [
392
+ "test"
393
+ ],
394
+ "few_shots_split": "dev",
395
+ "few_shots_select": "sequential",
396
+ "generation_size": 1,
397
+ "stop_sequence": [
398
+ "\n"
399
+ ],
400
+ "output_regex": null,
401
+ "frozen": false,
402
+ "suite": [
403
+ "leaderboard",
404
+ "mmlu"
405
+ ],
406
+ "original_num_docs": 152,
407
+ "effective_num_docs": 152,
408
+ "trust_dataset": true
409
+ },
410
+ "leaderboard|mmlu:business_ethics": {
411
+ "name": "mmlu:business_ethics",
412
+ "prompt_function": "mmlu_harness",
413
+ "hf_repo": "lighteval/mmlu",
414
+ "hf_subset": "business_ethics",
415
+ "metric": [
416
+ "loglikelihood_acc"
417
+ ],
418
+ "hf_avail_splits": [
419
+ "auxiliary_train",
420
+ "test",
421
+ "validation",
422
+ "dev"
423
+ ],
424
+ "evaluation_splits": [
425
+ "test"
426
+ ],
427
+ "few_shots_split": "dev",
428
+ "few_shots_select": "sequential",
429
+ "generation_size": 1,
430
+ "stop_sequence": [
431
+ "\n"
432
+ ],
433
+ "output_regex": null,
434
+ "frozen": false,
435
+ "suite": [
436
+ "leaderboard",
437
+ "mmlu"
438
+ ],
439
+ "original_num_docs": 100,
440
+ "effective_num_docs": 100,
441
+ "trust_dataset": true
442
+ },
443
+ "leaderboard|mmlu:clinical_knowledge": {
444
+ "name": "mmlu:clinical_knowledge",
445
+ "prompt_function": "mmlu_harness",
446
+ "hf_repo": "lighteval/mmlu",
447
+ "hf_subset": "clinical_knowledge",
448
+ "metric": [
449
+ "loglikelihood_acc"
450
+ ],
451
+ "hf_avail_splits": [
452
+ "auxiliary_train",
453
+ "test",
454
+ "validation",
455
+ "dev"
456
+ ],
457
+ "evaluation_splits": [
458
+ "test"
459
+ ],
460
+ "few_shots_split": "dev",
461
+ "few_shots_select": "sequential",
462
+ "generation_size": 1,
463
+ "stop_sequence": [
464
+ "\n"
465
+ ],
466
+ "output_regex": null,
467
+ "frozen": false,
468
+ "suite": [
469
+ "leaderboard",
470
+ "mmlu"
471
+ ],
472
+ "original_num_docs": 265,
473
+ "effective_num_docs": 265,
474
+ "trust_dataset": true
475
+ },
476
+ "leaderboard|mmlu:college_biology": {
477
+ "name": "mmlu:college_biology",
478
+ "prompt_function": "mmlu_harness",
479
+ "hf_repo": "lighteval/mmlu",
480
+ "hf_subset": "college_biology",
481
+ "metric": [
482
+ "loglikelihood_acc"
483
+ ],
484
+ "hf_avail_splits": [
485
+ "auxiliary_train",
486
+ "test",
487
+ "validation",
488
+ "dev"
489
+ ],
490
+ "evaluation_splits": [
491
+ "test"
492
+ ],
493
+ "few_shots_split": "dev",
494
+ "few_shots_select": "sequential",
495
+ "generation_size": 1,
496
+ "stop_sequence": [
497
+ "\n"
498
+ ],
499
+ "output_regex": null,
500
+ "frozen": false,
501
+ "suite": [
502
+ "leaderboard",
503
+ "mmlu"
504
+ ],
505
+ "original_num_docs": 144,
506
+ "effective_num_docs": 144,
507
+ "trust_dataset": true
508
+ },
509
+ "leaderboard|mmlu:college_chemistry": {
510
+ "name": "mmlu:college_chemistry",
511
+ "prompt_function": "mmlu_harness",
512
+ "hf_repo": "lighteval/mmlu",
513
+ "hf_subset": "college_chemistry",
514
+ "metric": [
515
+ "loglikelihood_acc"
516
+ ],
517
+ "hf_avail_splits": [
518
+ "auxiliary_train",
519
+ "test",
520
+ "validation",
521
+ "dev"
522
+ ],
523
+ "evaluation_splits": [
524
+ "test"
525
+ ],
526
+ "few_shots_split": "dev",
527
+ "few_shots_select": "sequential",
528
+ "generation_size": 1,
529
+ "stop_sequence": [
530
+ "\n"
531
+ ],
532
+ "output_regex": null,
533
+ "frozen": false,
534
+ "suite": [
535
+ "leaderboard",
536
+ "mmlu"
537
+ ],
538
+ "original_num_docs": 100,
539
+ "effective_num_docs": 100,
540
+ "trust_dataset": true
541
+ },
542
+ "leaderboard|mmlu:college_computer_science": {
543
+ "name": "mmlu:college_computer_science",
544
+ "prompt_function": "mmlu_harness",
545
+ "hf_repo": "lighteval/mmlu",
546
+ "hf_subset": "college_computer_science",
547
+ "metric": [
548
+ "loglikelihood_acc"
549
+ ],
550
+ "hf_avail_splits": [
551
+ "auxiliary_train",
552
+ "test",
553
+ "validation",
554
+ "dev"
555
+ ],
556
+ "evaluation_splits": [
557
+ "test"
558
+ ],
559
+ "few_shots_split": "dev",
560
+ "few_shots_select": "sequential",
561
+ "generation_size": 1,
562
+ "stop_sequence": [
563
+ "\n"
564
+ ],
565
+ "output_regex": null,
566
+ "frozen": false,
567
+ "suite": [
568
+ "leaderboard",
569
+ "mmlu"
570
+ ],
571
+ "original_num_docs": 100,
572
+ "effective_num_docs": 100,
573
+ "trust_dataset": true
574
+ },
575
+ "leaderboard|mmlu:college_mathematics": {
576
+ "name": "mmlu:college_mathematics",
577
+ "prompt_function": "mmlu_harness",
578
+ "hf_repo": "lighteval/mmlu",
579
+ "hf_subset": "college_mathematics",
580
+ "metric": [
581
+ "loglikelihood_acc"
582
+ ],
583
+ "hf_avail_splits": [
584
+ "auxiliary_train",
585
+ "test",
586
+ "validation",
587
+ "dev"
588
+ ],
589
+ "evaluation_splits": [
590
+ "test"
591
+ ],
592
+ "few_shots_split": "dev",
593
+ "few_shots_select": "sequential",
594
+ "generation_size": 1,
595
+ "stop_sequence": [
596
+ "\n"
597
+ ],
598
+ "output_regex": null,
599
+ "frozen": false,
600
+ "suite": [
601
+ "leaderboard",
602
+ "mmlu"
603
+ ],
604
+ "original_num_docs": 100,
605
+ "effective_num_docs": 100,
606
+ "trust_dataset": true
607
+ },
608
+ "leaderboard|mmlu:college_medicine": {
609
+ "name": "mmlu:college_medicine",
610
+ "prompt_function": "mmlu_harness",
611
+ "hf_repo": "lighteval/mmlu",
612
+ "hf_subset": "college_medicine",
613
+ "metric": [
614
+ "loglikelihood_acc"
615
+ ],
616
+ "hf_avail_splits": [
617
+ "auxiliary_train",
618
+ "test",
619
+ "validation",
620
+ "dev"
621
+ ],
622
+ "evaluation_splits": [
623
+ "test"
624
+ ],
625
+ "few_shots_split": "dev",
626
+ "few_shots_select": "sequential",
627
+ "generation_size": 1,
628
+ "stop_sequence": [
629
+ "\n"
630
+ ],
631
+ "output_regex": null,
632
+ "frozen": false,
633
+ "suite": [
634
+ "leaderboard",
635
+ "mmlu"
636
+ ],
637
+ "original_num_docs": 173,
638
+ "effective_num_docs": 173,
639
+ "trust_dataset": true
640
+ },
641
+ "leaderboard|mmlu:college_physics": {
642
+ "name": "mmlu:college_physics",
643
+ "prompt_function": "mmlu_harness",
644
+ "hf_repo": "lighteval/mmlu",
645
+ "hf_subset": "college_physics",
646
+ "metric": [
647
+ "loglikelihood_acc"
648
+ ],
649
+ "hf_avail_splits": [
650
+ "auxiliary_train",
651
+ "test",
652
+ "validation",
653
+ "dev"
654
+ ],
655
+ "evaluation_splits": [
656
+ "test"
657
+ ],
658
+ "few_shots_split": "dev",
659
+ "few_shots_select": "sequential",
660
+ "generation_size": 1,
661
+ "stop_sequence": [
662
+ "\n"
663
+ ],
664
+ "output_regex": null,
665
+ "frozen": false,
666
+ "suite": [
667
+ "leaderboard",
668
+ "mmlu"
669
+ ],
670
+ "original_num_docs": 102,
671
+ "effective_num_docs": 102,
672
+ "trust_dataset": true
673
+ },
674
+ "leaderboard|mmlu:computer_security": {
675
+ "name": "mmlu:computer_security",
676
+ "prompt_function": "mmlu_harness",
677
+ "hf_repo": "lighteval/mmlu",
678
+ "hf_subset": "computer_security",
679
+ "metric": [
680
+ "loglikelihood_acc"
681
+ ],
682
+ "hf_avail_splits": [
683
+ "auxiliary_train",
684
+ "test",
685
+ "validation",
686
+ "dev"
687
+ ],
688
+ "evaluation_splits": [
689
+ "test"
690
+ ],
691
+ "few_shots_split": "dev",
692
+ "few_shots_select": "sequential",
693
+ "generation_size": 1,
694
+ "stop_sequence": [
695
+ "\n"
696
+ ],
697
+ "output_regex": null,
698
+ "frozen": false,
699
+ "suite": [
700
+ "leaderboard",
701
+ "mmlu"
702
+ ],
703
+ "original_num_docs": 100,
704
+ "effective_num_docs": 100,
705
+ "trust_dataset": true
706
+ },
707
+ "leaderboard|mmlu:conceptual_physics": {
708
+ "name": "mmlu:conceptual_physics",
709
+ "prompt_function": "mmlu_harness",
710
+ "hf_repo": "lighteval/mmlu",
711
+ "hf_subset": "conceptual_physics",
712
+ "metric": [
713
+ "loglikelihood_acc"
714
+ ],
715
+ "hf_avail_splits": [
716
+ "auxiliary_train",
717
+ "test",
718
+ "validation",
719
+ "dev"
720
+ ],
721
+ "evaluation_splits": [
722
+ "test"
723
+ ],
724
+ "few_shots_split": "dev",
725
+ "few_shots_select": "sequential",
726
+ "generation_size": 1,
727
+ "stop_sequence": [
728
+ "\n"
729
+ ],
730
+ "output_regex": null,
731
+ "frozen": false,
732
+ "suite": [
733
+ "leaderboard",
734
+ "mmlu"
735
+ ],
736
+ "original_num_docs": 235,
737
+ "effective_num_docs": 235,
738
+ "trust_dataset": true
739
+ },
740
+ "leaderboard|mmlu:econometrics": {
741
+ "name": "mmlu:econometrics",
742
+ "prompt_function": "mmlu_harness",
743
+ "hf_repo": "lighteval/mmlu",
744
+ "hf_subset": "econometrics",
745
+ "metric": [
746
+ "loglikelihood_acc"
747
+ ],
748
+ "hf_avail_splits": [
749
+ "auxiliary_train",
750
+ "test",
751
+ "validation",
752
+ "dev"
753
+ ],
754
+ "evaluation_splits": [
755
+ "test"
756
+ ],
757
+ "few_shots_split": "dev",
758
+ "few_shots_select": "sequential",
759
+ "generation_size": 1,
760
+ "stop_sequence": [
761
+ "\n"
762
+ ],
763
+ "output_regex": null,
764
+ "frozen": false,
765
+ "suite": [
766
+ "leaderboard",
767
+ "mmlu"
768
+ ],
769
+ "original_num_docs": 114,
770
+ "effective_num_docs": 114,
771
+ "trust_dataset": true
772
+ },
773
+ "leaderboard|mmlu:electrical_engineering": {
774
+ "name": "mmlu:electrical_engineering",
775
+ "prompt_function": "mmlu_harness",
776
+ "hf_repo": "lighteval/mmlu",
777
+ "hf_subset": "electrical_engineering",
778
+ "metric": [
779
+ "loglikelihood_acc"
780
+ ],
781
+ "hf_avail_splits": [
782
+ "auxiliary_train",
783
+ "test",
784
+ "validation",
785
+ "dev"
786
+ ],
787
+ "evaluation_splits": [
788
+ "test"
789
+ ],
790
+ "few_shots_split": "dev",
791
+ "few_shots_select": "sequential",
792
+ "generation_size": 1,
793
+ "stop_sequence": [
794
+ "\n"
795
+ ],
796
+ "output_regex": null,
797
+ "frozen": false,
798
+ "suite": [
799
+ "leaderboard",
800
+ "mmlu"
801
+ ],
802
+ "original_num_docs": 145,
803
+ "effective_num_docs": 145,
804
+ "trust_dataset": true
805
+ },
806
+ "leaderboard|mmlu:elementary_mathematics": {
807
+ "name": "mmlu:elementary_mathematics",
808
+ "prompt_function": "mmlu_harness",
809
+ "hf_repo": "lighteval/mmlu",
810
+ "hf_subset": "elementary_mathematics",
811
+ "metric": [
812
+ "loglikelihood_acc"
813
+ ],
814
+ "hf_avail_splits": [
815
+ "auxiliary_train",
816
+ "test",
817
+ "validation",
818
+ "dev"
819
+ ],
820
+ "evaluation_splits": [
821
+ "test"
822
+ ],
823
+ "few_shots_split": "dev",
824
+ "few_shots_select": "sequential",
825
+ "generation_size": 1,
826
+ "stop_sequence": [
827
+ "\n"
828
+ ],
829
+ "output_regex": null,
830
+ "frozen": false,
831
+ "suite": [
832
+ "leaderboard",
833
+ "mmlu"
834
+ ],
835
+ "original_num_docs": 378,
836
+ "effective_num_docs": 378,
837
+ "trust_dataset": true
838
+ },
839
+ "leaderboard|mmlu:formal_logic": {
840
+ "name": "mmlu:formal_logic",
841
+ "prompt_function": "mmlu_harness",
842
+ "hf_repo": "lighteval/mmlu",
843
+ "hf_subset": "formal_logic",
844
+ "metric": [
845
+ "loglikelihood_acc"
846
+ ],
847
+ "hf_avail_splits": [
848
+ "auxiliary_train",
849
+ "test",
850
+ "validation",
851
+ "dev"
852
+ ],
853
+ "evaluation_splits": [
854
+ "test"
855
+ ],
856
+ "few_shots_split": "dev",
857
+ "few_shots_select": "sequential",
858
+ "generation_size": 1,
859
+ "stop_sequence": [
860
+ "\n"
861
+ ],
862
+ "output_regex": null,
863
+ "frozen": false,
864
+ "suite": [
865
+ "leaderboard",
866
+ "mmlu"
867
+ ],
868
+ "original_num_docs": 126,
869
+ "effective_num_docs": 126,
870
+ "trust_dataset": true
871
+ },
872
+ "leaderboard|mmlu:global_facts": {
873
+ "name": "mmlu:global_facts",
874
+ "prompt_function": "mmlu_harness",
875
+ "hf_repo": "lighteval/mmlu",
876
+ "hf_subset": "global_facts",
877
+ "metric": [
878
+ "loglikelihood_acc"
879
+ ],
880
+ "hf_avail_splits": [
881
+ "auxiliary_train",
882
+ "test",
883
+ "validation",
884
+ "dev"
885
+ ],
886
+ "evaluation_splits": [
887
+ "test"
888
+ ],
889
+ "few_shots_split": "dev",
890
+ "few_shots_select": "sequential",
891
+ "generation_size": 1,
892
+ "stop_sequence": [
893
+ "\n"
894
+ ],
895
+ "output_regex": null,
896
+ "frozen": false,
897
+ "suite": [
898
+ "leaderboard",
899
+ "mmlu"
900
+ ],
901
+ "original_num_docs": 100,
902
+ "effective_num_docs": 100,
903
+ "trust_dataset": true
904
+ },
905
+ "leaderboard|mmlu:high_school_biology": {
906
+ "name": "mmlu:high_school_biology",
907
+ "prompt_function": "mmlu_harness",
908
+ "hf_repo": "lighteval/mmlu",
909
+ "hf_subset": "high_school_biology",
910
+ "metric": [
911
+ "loglikelihood_acc"
912
+ ],
913
+ "hf_avail_splits": [
914
+ "auxiliary_train",
915
+ "test",
916
+ "validation",
917
+ "dev"
918
+ ],
919
+ "evaluation_splits": [
920
+ "test"
921
+ ],
922
+ "few_shots_split": "dev",
923
+ "few_shots_select": "sequential",
924
+ "generation_size": 1,
925
+ "stop_sequence": [
926
+ "\n"
927
+ ],
928
+ "output_regex": null,
929
+ "frozen": false,
930
+ "suite": [
931
+ "leaderboard",
932
+ "mmlu"
933
+ ],
934
+ "original_num_docs": 310,
935
+ "effective_num_docs": 310,
936
+ "trust_dataset": true
937
+ },
938
+ "leaderboard|mmlu:high_school_chemistry": {
939
+ "name": "mmlu:high_school_chemistry",
940
+ "prompt_function": "mmlu_harness",
941
+ "hf_repo": "lighteval/mmlu",
942
+ "hf_subset": "high_school_chemistry",
943
+ "metric": [
944
+ "loglikelihood_acc"
945
+ ],
946
+ "hf_avail_splits": [
947
+ "auxiliary_train",
948
+ "test",
949
+ "validation",
950
+ "dev"
951
+ ],
952
+ "evaluation_splits": [
953
+ "test"
954
+ ],
955
+ "few_shots_split": "dev",
956
+ "few_shots_select": "sequential",
957
+ "generation_size": 1,
958
+ "stop_sequence": [
959
+ "\n"
960
+ ],
961
+ "output_regex": null,
962
+ "frozen": false,
963
+ "suite": [
964
+ "leaderboard",
965
+ "mmlu"
966
+ ],
967
+ "original_num_docs": 203,
968
+ "effective_num_docs": 203,
969
+ "trust_dataset": true
970
+ },
971
+ "leaderboard|mmlu:high_school_computer_science": {
972
+ "name": "mmlu:high_school_computer_science",
973
+ "prompt_function": "mmlu_harness",
974
+ "hf_repo": "lighteval/mmlu",
975
+ "hf_subset": "high_school_computer_science",
976
+ "metric": [
977
+ "loglikelihood_acc"
978
+ ],
979
+ "hf_avail_splits": [
980
+ "auxiliary_train",
981
+ "test",
982
+ "validation",
983
+ "dev"
984
+ ],
985
+ "evaluation_splits": [
986
+ "test"
987
+ ],
988
+ "few_shots_split": "dev",
989
+ "few_shots_select": "sequential",
990
+ "generation_size": 1,
991
+ "stop_sequence": [
992
+ "\n"
993
+ ],
994
+ "output_regex": null,
995
+ "frozen": false,
996
+ "suite": [
997
+ "leaderboard",
998
+ "mmlu"
999
+ ],
1000
+ "original_num_docs": 100,
1001
+ "effective_num_docs": 100,
1002
+ "trust_dataset": true
1003
+ },
1004
+ "leaderboard|mmlu:high_school_european_history": {
1005
+ "name": "mmlu:high_school_european_history",
1006
+ "prompt_function": "mmlu_harness",
1007
+ "hf_repo": "lighteval/mmlu",
1008
+ "hf_subset": "high_school_european_history",
1009
+ "metric": [
1010
+ "loglikelihood_acc"
1011
+ ],
1012
+ "hf_avail_splits": [
1013
+ "auxiliary_train",
1014
+ "test",
1015
+ "validation",
1016
+ "dev"
1017
+ ],
1018
+ "evaluation_splits": [
1019
+ "test"
1020
+ ],
1021
+ "few_shots_split": "dev",
1022
+ "few_shots_select": "sequential",
1023
+ "generation_size": 1,
1024
+ "stop_sequence": [
1025
+ "\n"
1026
+ ],
1027
+ "output_regex": null,
1028
+ "frozen": false,
1029
+ "suite": [
1030
+ "leaderboard",
1031
+ "mmlu"
1032
+ ],
1033
+ "original_num_docs": 165,
1034
+ "effective_num_docs": 165,
1035
+ "trust_dataset": true
1036
+ },
1037
+ "leaderboard|mmlu:high_school_geography": {
1038
+ "name": "mmlu:high_school_geography",
1039
+ "prompt_function": "mmlu_harness",
1040
+ "hf_repo": "lighteval/mmlu",
1041
+ "hf_subset": "high_school_geography",
1042
+ "metric": [
1043
+ "loglikelihood_acc"
1044
+ ],
1045
+ "hf_avail_splits": [
1046
+ "auxiliary_train",
1047
+ "test",
1048
+ "validation",
1049
+ "dev"
1050
+ ],
1051
+ "evaluation_splits": [
1052
+ "test"
1053
+ ],
1054
+ "few_shots_split": "dev",
1055
+ "few_shots_select": "sequential",
1056
+ "generation_size": 1,
1057
+ "stop_sequence": [
1058
+ "\n"
1059
+ ],
1060
+ "output_regex": null,
1061
+ "frozen": false,
1062
+ "suite": [
1063
+ "leaderboard",
1064
+ "mmlu"
1065
+ ],
1066
+ "original_num_docs": 198,
1067
+ "effective_num_docs": 198,
1068
+ "trust_dataset": true
1069
+ },
1070
+ "leaderboard|mmlu:high_school_government_and_politics": {
1071
+ "name": "mmlu:high_school_government_and_politics",
1072
+ "prompt_function": "mmlu_harness",
1073
+ "hf_repo": "lighteval/mmlu",
1074
+ "hf_subset": "high_school_government_and_politics",
1075
+ "metric": [
1076
+ "loglikelihood_acc"
1077
+ ],
1078
+ "hf_avail_splits": [
1079
+ "auxiliary_train",
1080
+ "test",
1081
+ "validation",
1082
+ "dev"
1083
+ ],
1084
+ "evaluation_splits": [
1085
+ "test"
1086
+ ],
1087
+ "few_shots_split": "dev",
1088
+ "few_shots_select": "sequential",
1089
+ "generation_size": 1,
1090
+ "stop_sequence": [
1091
+ "\n"
1092
+ ],
1093
+ "output_regex": null,
1094
+ "frozen": false,
1095
+ "suite": [
1096
+ "leaderboard",
1097
+ "mmlu"
1098
+ ],
1099
+ "original_num_docs": 193,
1100
+ "effective_num_docs": 193,
1101
+ "trust_dataset": true
1102
+ },
1103
+ "leaderboard|mmlu:high_school_macroeconomics": {
1104
+ "name": "mmlu:high_school_macroeconomics",
1105
+ "prompt_function": "mmlu_harness",
1106
+ "hf_repo": "lighteval/mmlu",
1107
+ "hf_subset": "high_school_macroeconomics",
1108
+ "metric": [
1109
+ "loglikelihood_acc"
1110
+ ],
1111
+ "hf_avail_splits": [
1112
+ "auxiliary_train",
1113
+ "test",
1114
+ "validation",
1115
+ "dev"
1116
+ ],
1117
+ "evaluation_splits": [
1118
+ "test"
1119
+ ],
1120
+ "few_shots_split": "dev",
1121
+ "few_shots_select": "sequential",
1122
+ "generation_size": 1,
1123
+ "stop_sequence": [
1124
+ "\n"
1125
+ ],
1126
+ "output_regex": null,
1127
+ "frozen": false,
1128
+ "suite": [
1129
+ "leaderboard",
1130
+ "mmlu"
1131
+ ],
1132
+ "original_num_docs": 390,
1133
+ "effective_num_docs": 390,
1134
+ "trust_dataset": true
1135
+ },
1136
+ "leaderboard|mmlu:high_school_mathematics": {
1137
+ "name": "mmlu:high_school_mathematics",
1138
+ "prompt_function": "mmlu_harness",
1139
+ "hf_repo": "lighteval/mmlu",
1140
+ "hf_subset": "high_school_mathematics",
1141
+ "metric": [
1142
+ "loglikelihood_acc"
1143
+ ],
1144
+ "hf_avail_splits": [
1145
+ "auxiliary_train",
1146
+ "test",
1147
+ "validation",
1148
+ "dev"
1149
+ ],
1150
+ "evaluation_splits": [
1151
+ "test"
1152
+ ],
1153
+ "few_shots_split": "dev",
1154
+ "few_shots_select": "sequential",
1155
+ "generation_size": 1,
1156
+ "stop_sequence": [
1157
+ "\n"
1158
+ ],
1159
+ "output_regex": null,
1160
+ "frozen": false,
1161
+ "suite": [
1162
+ "leaderboard",
1163
+ "mmlu"
1164
+ ],
1165
+ "original_num_docs": 270,
1166
+ "effective_num_docs": 270,
1167
+ "trust_dataset": true
1168
+ },
1169
+ "leaderboard|mmlu:high_school_microeconomics": {
1170
+ "name": "mmlu:high_school_microeconomics",
1171
+ "prompt_function": "mmlu_harness",
1172
+ "hf_repo": "lighteval/mmlu",
1173
+ "hf_subset": "high_school_microeconomics",
1174
+ "metric": [
1175
+ "loglikelihood_acc"
1176
+ ],
1177
+ "hf_avail_splits": [
1178
+ "auxiliary_train",
1179
+ "test",
1180
+ "validation",
1181
+ "dev"
1182
+ ],
1183
+ "evaluation_splits": [
1184
+ "test"
1185
+ ],
1186
+ "few_shots_split": "dev",
1187
+ "few_shots_select": "sequential",
1188
+ "generation_size": 1,
1189
+ "stop_sequence": [
1190
+ "\n"
1191
+ ],
1192
+ "output_regex": null,
1193
+ "frozen": false,
1194
+ "suite": [
1195
+ "leaderboard",
1196
+ "mmlu"
1197
+ ],
1198
+ "original_num_docs": 238,
1199
+ "effective_num_docs": 238,
1200
+ "trust_dataset": true
1201
+ },
1202
+ "leaderboard|mmlu:high_school_physics": {
1203
+ "name": "mmlu:high_school_physics",
1204
+ "prompt_function": "mmlu_harness",
1205
+ "hf_repo": "lighteval/mmlu",
1206
+ "hf_subset": "high_school_physics",
1207
+ "metric": [
1208
+ "loglikelihood_acc"
1209
+ ],
1210
+ "hf_avail_splits": [
1211
+ "auxiliary_train",
1212
+ "test",
1213
+ "validation",
1214
+ "dev"
1215
+ ],
1216
+ "evaluation_splits": [
1217
+ "test"
1218
+ ],
1219
+ "few_shots_split": "dev",
1220
+ "few_shots_select": "sequential",
1221
+ "generation_size": 1,
1222
+ "stop_sequence": [
1223
+ "\n"
1224
+ ],
1225
+ "output_regex": null,
1226
+ "frozen": false,
1227
+ "suite": [
1228
+ "leaderboard",
1229
+ "mmlu"
1230
+ ],
1231
+ "original_num_docs": 151,
1232
+ "effective_num_docs": 151,
1233
+ "trust_dataset": true
1234
+ },
1235
+ "leaderboard|mmlu:high_school_psychology": {
1236
+ "name": "mmlu:high_school_psychology",
1237
+ "prompt_function": "mmlu_harness",
1238
+ "hf_repo": "lighteval/mmlu",
1239
+ "hf_subset": "high_school_psychology",
1240
+ "metric": [
1241
+ "loglikelihood_acc"
1242
+ ],
1243
+ "hf_avail_splits": [
1244
+ "auxiliary_train",
1245
+ "test",
1246
+ "validation",
1247
+ "dev"
1248
+ ],
1249
+ "evaluation_splits": [
1250
+ "test"
1251
+ ],
1252
+ "few_shots_split": "dev",
1253
+ "few_shots_select": "sequential",
1254
+ "generation_size": 1,
1255
+ "stop_sequence": [
1256
+ "\n"
1257
+ ],
1258
+ "output_regex": null,
1259
+ "frozen": false,
1260
+ "suite": [
1261
+ "leaderboard",
1262
+ "mmlu"
1263
+ ],
1264
+ "original_num_docs": 545,
1265
+ "effective_num_docs": 545,
1266
+ "trust_dataset": true
1267
+ },
1268
+ "leaderboard|mmlu:high_school_statistics": {
1269
+ "name": "mmlu:high_school_statistics",
1270
+ "prompt_function": "mmlu_harness",
1271
+ "hf_repo": "lighteval/mmlu",
1272
+ "hf_subset": "high_school_statistics",
1273
+ "metric": [
1274
+ "loglikelihood_acc"
1275
+ ],
1276
+ "hf_avail_splits": [
1277
+ "auxiliary_train",
1278
+ "test",
1279
+ "validation",
1280
+ "dev"
1281
+ ],
1282
+ "evaluation_splits": [
1283
+ "test"
1284
+ ],
1285
+ "few_shots_split": "dev",
1286
+ "few_shots_select": "sequential",
1287
+ "generation_size": 1,
1288
+ "stop_sequence": [
1289
+ "\n"
1290
+ ],
1291
+ "output_regex": null,
1292
+ "frozen": false,
1293
+ "suite": [
1294
+ "leaderboard",
1295
+ "mmlu"
1296
+ ],
1297
+ "original_num_docs": 216,
1298
+ "effective_num_docs": 216,
1299
+ "trust_dataset": true
1300
+ },
1301
+ "leaderboard|mmlu:high_school_us_history": {
1302
+ "name": "mmlu:high_school_us_history",
1303
+ "prompt_function": "mmlu_harness",
1304
+ "hf_repo": "lighteval/mmlu",
1305
+ "hf_subset": "high_school_us_history",
1306
+ "metric": [
1307
+ "loglikelihood_acc"
1308
+ ],
1309
+ "hf_avail_splits": [
1310
+ "auxiliary_train",
1311
+ "test",
1312
+ "validation",
1313
+ "dev"
1314
+ ],
1315
+ "evaluation_splits": [
1316
+ "test"
1317
+ ],
1318
+ "few_shots_split": "dev",
1319
+ "few_shots_select": "sequential",
1320
+ "generation_size": 1,
1321
+ "stop_sequence": [
1322
+ "\n"
1323
+ ],
1324
+ "output_regex": null,
1325
+ "frozen": false,
1326
+ "suite": [
1327
+ "leaderboard",
1328
+ "mmlu"
1329
+ ],
1330
+ "original_num_docs": 204,
1331
+ "effective_num_docs": 204,
1332
+ "trust_dataset": true
1333
+ },
1334
+ "leaderboard|mmlu:high_school_world_history": {
1335
+ "name": "mmlu:high_school_world_history",
1336
+ "prompt_function": "mmlu_harness",
1337
+ "hf_repo": "lighteval/mmlu",
1338
+ "hf_subset": "high_school_world_history",
1339
+ "metric": [
1340
+ "loglikelihood_acc"
1341
+ ],
1342
+ "hf_avail_splits": [
1343
+ "auxiliary_train",
1344
+ "test",
1345
+ "validation",
1346
+ "dev"
1347
+ ],
1348
+ "evaluation_splits": [
1349
+ "test"
1350
+ ],
1351
+ "few_shots_split": "dev",
1352
+ "few_shots_select": "sequential",
1353
+ "generation_size": 1,
1354
+ "stop_sequence": [
1355
+ "\n"
1356
+ ],
1357
+ "output_regex": null,
1358
+ "frozen": false,
1359
+ "suite": [
1360
+ "leaderboard",
1361
+ "mmlu"
1362
+ ],
1363
+ "original_num_docs": 237,
1364
+ "effective_num_docs": 237,
1365
+ "trust_dataset": true
1366
+ },
1367
+ "leaderboard|mmlu:human_aging": {
1368
+ "name": "mmlu:human_aging",
1369
+ "prompt_function": "mmlu_harness",
1370
+ "hf_repo": "lighteval/mmlu",
1371
+ "hf_subset": "human_aging",
1372
+ "metric": [
1373
+ "loglikelihood_acc"
1374
+ ],
1375
+ "hf_avail_splits": [
1376
+ "auxiliary_train",
1377
+ "test",
1378
+ "validation",
1379
+ "dev"
1380
+ ],
1381
+ "evaluation_splits": [
1382
+ "test"
1383
+ ],
1384
+ "few_shots_split": "dev",
1385
+ "few_shots_select": "sequential",
1386
+ "generation_size": 1,
1387
+ "stop_sequence": [
1388
+ "\n"
1389
+ ],
1390
+ "output_regex": null,
1391
+ "frozen": false,
1392
+ "suite": [
1393
+ "leaderboard",
1394
+ "mmlu"
1395
+ ],
1396
+ "original_num_docs": 223,
1397
+ "effective_num_docs": 223,
1398
+ "trust_dataset": true
1399
+ },
1400
+ "leaderboard|mmlu:human_sexuality": {
1401
+ "name": "mmlu:human_sexuality",
1402
+ "prompt_function": "mmlu_harness",
1403
+ "hf_repo": "lighteval/mmlu",
1404
+ "hf_subset": "human_sexuality",
1405
+ "metric": [
1406
+ "loglikelihood_acc"
1407
+ ],
1408
+ "hf_avail_splits": [
1409
+ "auxiliary_train",
1410
+ "test",
1411
+ "validation",
1412
+ "dev"
1413
+ ],
1414
+ "evaluation_splits": [
1415
+ "test"
1416
+ ],
1417
+ "few_shots_split": "dev",
1418
+ "few_shots_select": "sequential",
1419
+ "generation_size": 1,
1420
+ "stop_sequence": [
1421
+ "\n"
1422
+ ],
1423
+ "output_regex": null,
1424
+ "frozen": false,
1425
+ "suite": [
1426
+ "leaderboard",
1427
+ "mmlu"
1428
+ ],
1429
+ "original_num_docs": 131,
1430
+ "effective_num_docs": 131,
1431
+ "trust_dataset": true
1432
+ },
1433
+ "leaderboard|mmlu:international_law": {
1434
+ "name": "mmlu:international_law",
1435
+ "prompt_function": "mmlu_harness",
1436
+ "hf_repo": "lighteval/mmlu",
1437
+ "hf_subset": "international_law",
1438
+ "metric": [
1439
+ "loglikelihood_acc"
1440
+ ],
1441
+ "hf_avail_splits": [
1442
+ "auxiliary_train",
1443
+ "test",
1444
+ "validation",
1445
+ "dev"
1446
+ ],
1447
+ "evaluation_splits": [
1448
+ "test"
1449
+ ],
1450
+ "few_shots_split": "dev",
1451
+ "few_shots_select": "sequential",
1452
+ "generation_size": 1,
1453
+ "stop_sequence": [
1454
+ "\n"
1455
+ ],
1456
+ "output_regex": null,
1457
+ "frozen": false,
1458
+ "suite": [
1459
+ "leaderboard",
1460
+ "mmlu"
1461
+ ],
1462
+ "original_num_docs": 121,
1463
+ "effective_num_docs": 121,
1464
+ "trust_dataset": true
1465
+ },
1466
+ "leaderboard|mmlu:jurisprudence": {
1467
+ "name": "mmlu:jurisprudence",
1468
+ "prompt_function": "mmlu_harness",
1469
+ "hf_repo": "lighteval/mmlu",
1470
+ "hf_subset": "jurisprudence",
1471
+ "metric": [
1472
+ "loglikelihood_acc"
1473
+ ],
1474
+ "hf_avail_splits": [
1475
+ "auxiliary_train",
1476
+ "test",
1477
+ "validation",
1478
+ "dev"
1479
+ ],
1480
+ "evaluation_splits": [
1481
+ "test"
1482
+ ],
1483
+ "few_shots_split": "dev",
1484
+ "few_shots_select": "sequential",
1485
+ "generation_size": 1,
1486
+ "stop_sequence": [
1487
+ "\n"
1488
+ ],
1489
+ "output_regex": null,
1490
+ "frozen": false,
1491
+ "suite": [
1492
+ "leaderboard",
1493
+ "mmlu"
1494
+ ],
1495
+ "original_num_docs": 108,
1496
+ "effective_num_docs": 108,
1497
+ "trust_dataset": true
1498
+ },
1499
+ "leaderboard|mmlu:logical_fallacies": {
1500
+ "name": "mmlu:logical_fallacies",
1501
+ "prompt_function": "mmlu_harness",
1502
+ "hf_repo": "lighteval/mmlu",
1503
+ "hf_subset": "logical_fallacies",
1504
+ "metric": [
1505
+ "loglikelihood_acc"
1506
+ ],
1507
+ "hf_avail_splits": [
1508
+ "auxiliary_train",
1509
+ "test",
1510
+ "validation",
1511
+ "dev"
1512
+ ],
1513
+ "evaluation_splits": [
1514
+ "test"
1515
+ ],
1516
+ "few_shots_split": "dev",
1517
+ "few_shots_select": "sequential",
1518
+ "generation_size": 1,
1519
+ "stop_sequence": [
1520
+ "\n"
1521
+ ],
1522
+ "output_regex": null,
1523
+ "frozen": false,
1524
+ "suite": [
1525
+ "leaderboard",
1526
+ "mmlu"
1527
+ ],
1528
+ "original_num_docs": 163,
1529
+ "effective_num_docs": 163,
1530
+ "trust_dataset": true
1531
+ },
1532
+ "leaderboard|mmlu:machine_learning": {
1533
+ "name": "mmlu:machine_learning",
1534
+ "prompt_function": "mmlu_harness",
1535
+ "hf_repo": "lighteval/mmlu",
1536
+ "hf_subset": "machine_learning",
1537
+ "metric": [
1538
+ "loglikelihood_acc"
1539
+ ],
1540
+ "hf_avail_splits": [
1541
+ "auxiliary_train",
1542
+ "test",
1543
+ "validation",
1544
+ "dev"
1545
+ ],
1546
+ "evaluation_splits": [
1547
+ "test"
1548
+ ],
1549
+ "few_shots_split": "dev",
1550
+ "few_shots_select": "sequential",
1551
+ "generation_size": 1,
1552
+ "stop_sequence": [
1553
+ "\n"
1554
+ ],
1555
+ "output_regex": null,
1556
+ "frozen": false,
1557
+ "suite": [
1558
+ "leaderboard",
1559
+ "mmlu"
1560
+ ],
1561
+ "original_num_docs": 112,
1562
+ "effective_num_docs": 112,
1563
+ "trust_dataset": true
1564
+ },
1565
+ "leaderboard|mmlu:management": {
1566
+ "name": "mmlu:management",
1567
+ "prompt_function": "mmlu_harness",
1568
+ "hf_repo": "lighteval/mmlu",
1569
+ "hf_subset": "management",
1570
+ "metric": [
1571
+ "loglikelihood_acc"
1572
+ ],
1573
+ "hf_avail_splits": [
1574
+ "auxiliary_train",
1575
+ "test",
1576
+ "validation",
1577
+ "dev"
1578
+ ],
1579
+ "evaluation_splits": [
1580
+ "test"
1581
+ ],
1582
+ "few_shots_split": "dev",
1583
+ "few_shots_select": "sequential",
1584
+ "generation_size": 1,
1585
+ "stop_sequence": [
1586
+ "\n"
1587
+ ],
1588
+ "output_regex": null,
1589
+ "frozen": false,
1590
+ "suite": [
1591
+ "leaderboard",
1592
+ "mmlu"
1593
+ ],
1594
+ "original_num_docs": 103,
1595
+ "effective_num_docs": 103,
1596
+ "trust_dataset": true
1597
+ },
1598
+ "leaderboard|mmlu:marketing": {
1599
+ "name": "mmlu:marketing",
1600
+ "prompt_function": "mmlu_harness",
1601
+ "hf_repo": "lighteval/mmlu",
1602
+ "hf_subset": "marketing",
1603
+ "metric": [
1604
+ "loglikelihood_acc"
1605
+ ],
1606
+ "hf_avail_splits": [
1607
+ "auxiliary_train",
1608
+ "test",
1609
+ "validation",
1610
+ "dev"
1611
+ ],
1612
+ "evaluation_splits": [
1613
+ "test"
1614
+ ],
1615
+ "few_shots_split": "dev",
1616
+ "few_shots_select": "sequential",
1617
+ "generation_size": 1,
1618
+ "stop_sequence": [
1619
+ "\n"
1620
+ ],
1621
+ "output_regex": null,
1622
+ "frozen": false,
1623
+ "suite": [
1624
+ "leaderboard",
1625
+ "mmlu"
1626
+ ],
1627
+ "original_num_docs": 234,
1628
+ "effective_num_docs": 234,
1629
+ "trust_dataset": true
1630
+ },
1631
+ "leaderboard|mmlu:medical_genetics": {
1632
+ "name": "mmlu:medical_genetics",
1633
+ "prompt_function": "mmlu_harness",
1634
+ "hf_repo": "lighteval/mmlu",
1635
+ "hf_subset": "medical_genetics",
1636
+ "metric": [
1637
+ "loglikelihood_acc"
1638
+ ],
1639
+ "hf_avail_splits": [
1640
+ "auxiliary_train",
1641
+ "test",
1642
+ "validation",
1643
+ "dev"
1644
+ ],
1645
+ "evaluation_splits": [
1646
+ "test"
1647
+ ],
1648
+ "few_shots_split": "dev",
1649
+ "few_shots_select": "sequential",
1650
+ "generation_size": 1,
1651
+ "stop_sequence": [
1652
+ "\n"
1653
+ ],
1654
+ "output_regex": null,
1655
+ "frozen": false,
1656
+ "suite": [
1657
+ "leaderboard",
1658
+ "mmlu"
1659
+ ],
1660
+ "original_num_docs": 100,
1661
+ "effective_num_docs": 100,
1662
+ "trust_dataset": true
1663
+ },
1664
+ "leaderboard|mmlu:miscellaneous": {
1665
+ "name": "mmlu:miscellaneous",
1666
+ "prompt_function": "mmlu_harness",
1667
+ "hf_repo": "lighteval/mmlu",
1668
+ "hf_subset": "miscellaneous",
1669
+ "metric": [
1670
+ "loglikelihood_acc"
1671
+ ],
1672
+ "hf_avail_splits": [
1673
+ "auxiliary_train",
1674
+ "test",
1675
+ "validation",
1676
+ "dev"
1677
+ ],
1678
+ "evaluation_splits": [
1679
+ "test"
1680
+ ],
1681
+ "few_shots_split": "dev",
1682
+ "few_shots_select": "sequential",
1683
+ "generation_size": 1,
1684
+ "stop_sequence": [
1685
+ "\n"
1686
+ ],
1687
+ "output_regex": null,
1688
+ "frozen": false,
1689
+ "suite": [
1690
+ "leaderboard",
1691
+ "mmlu"
1692
+ ],
1693
+ "original_num_docs": 783,
1694
+ "effective_num_docs": 783,
1695
+ "trust_dataset": true
1696
+ },
1697
+ "leaderboard|mmlu:moral_disputes": {
1698
+ "name": "mmlu:moral_disputes",
1699
+ "prompt_function": "mmlu_harness",
1700
+ "hf_repo": "lighteval/mmlu",
1701
+ "hf_subset": "moral_disputes",
1702
+ "metric": [
1703
+ "loglikelihood_acc"
1704
+ ],
1705
+ "hf_avail_splits": [
1706
+ "auxiliary_train",
1707
+ "test",
1708
+ "validation",
1709
+ "dev"
1710
+ ],
1711
+ "evaluation_splits": [
1712
+ "test"
1713
+ ],
1714
+ "few_shots_split": "dev",
1715
+ "few_shots_select": "sequential",
1716
+ "generation_size": 1,
1717
+ "stop_sequence": [
1718
+ "\n"
1719
+ ],
1720
+ "output_regex": null,
1721
+ "frozen": false,
1722
+ "suite": [
1723
+ "leaderboard",
1724
+ "mmlu"
1725
+ ],
1726
+ "original_num_docs": 346,
1727
+ "effective_num_docs": 346,
1728
+ "trust_dataset": true
1729
+ },
1730
+ "leaderboard|mmlu:moral_scenarios": {
1731
+ "name": "mmlu:moral_scenarios",
1732
+ "prompt_function": "mmlu_harness",
1733
+ "hf_repo": "lighteval/mmlu",
1734
+ "hf_subset": "moral_scenarios",
1735
+ "metric": [
1736
+ "loglikelihood_acc"
1737
+ ],
1738
+ "hf_avail_splits": [
1739
+ "auxiliary_train",
1740
+ "test",
1741
+ "validation",
1742
+ "dev"
1743
+ ],
1744
+ "evaluation_splits": [
1745
+ "test"
1746
+ ],
1747
+ "few_shots_split": "dev",
1748
+ "few_shots_select": "sequential",
1749
+ "generation_size": 1,
1750
+ "stop_sequence": [
1751
+ "\n"
1752
+ ],
1753
+ "output_regex": null,
1754
+ "frozen": false,
1755
+ "suite": [
1756
+ "leaderboard",
1757
+ "mmlu"
1758
+ ],
1759
+ "original_num_docs": 895,
1760
+ "effective_num_docs": 895,
1761
+ "trust_dataset": true
1762
+ },
1763
+ "leaderboard|mmlu:nutrition": {
1764
+ "name": "mmlu:nutrition",
1765
+ "prompt_function": "mmlu_harness",
1766
+ "hf_repo": "lighteval/mmlu",
1767
+ "hf_subset": "nutrition",
1768
+ "metric": [
1769
+ "loglikelihood_acc"
1770
+ ],
1771
+ "hf_avail_splits": [
1772
+ "auxiliary_train",
1773
+ "test",
1774
+ "validation",
1775
+ "dev"
1776
+ ],
1777
+ "evaluation_splits": [
1778
+ "test"
1779
+ ],
1780
+ "few_shots_split": "dev",
1781
+ "few_shots_select": "sequential",
1782
+ "generation_size": 1,
1783
+ "stop_sequence": [
1784
+ "\n"
1785
+ ],
1786
+ "output_regex": null,
1787
+ "frozen": false,
1788
+ "suite": [
1789
+ "leaderboard",
1790
+ "mmlu"
1791
+ ],
1792
+ "original_num_docs": 306,
1793
+ "effective_num_docs": 306,
1794
+ "trust_dataset": true
1795
+ },
1796
+ "leaderboard|mmlu:philosophy": {
1797
+ "name": "mmlu:philosophy",
1798
+ "prompt_function": "mmlu_harness",
1799
+ "hf_repo": "lighteval/mmlu",
1800
+ "hf_subset": "philosophy",
1801
+ "metric": [
1802
+ "loglikelihood_acc"
1803
+ ],
1804
+ "hf_avail_splits": [
1805
+ "auxiliary_train",
1806
+ "test",
1807
+ "validation",
1808
+ "dev"
1809
+ ],
1810
+ "evaluation_splits": [
1811
+ "test"
1812
+ ],
1813
+ "few_shots_split": "dev",
1814
+ "few_shots_select": "sequential",
1815
+ "generation_size": 1,
1816
+ "stop_sequence": [
1817
+ "\n"
1818
+ ],
1819
+ "output_regex": null,
1820
+ "frozen": false,
1821
+ "suite": [
1822
+ "leaderboard",
1823
+ "mmlu"
1824
+ ],
1825
+ "original_num_docs": 311,
1826
+ "effective_num_docs": 311,
1827
+ "trust_dataset": true
1828
+ },
1829
+ "leaderboard|mmlu:prehistory": {
1830
+ "name": "mmlu:prehistory",
1831
+ "prompt_function": "mmlu_harness",
1832
+ "hf_repo": "lighteval/mmlu",
1833
+ "hf_subset": "prehistory",
1834
+ "metric": [
1835
+ "loglikelihood_acc"
1836
+ ],
1837
+ "hf_avail_splits": [
1838
+ "auxiliary_train",
1839
+ "test",
1840
+ "validation",
1841
+ "dev"
1842
+ ],
1843
+ "evaluation_splits": [
1844
+ "test"
1845
+ ],
1846
+ "few_shots_split": "dev",
1847
+ "few_shots_select": "sequential",
1848
+ "generation_size": 1,
1849
+ "stop_sequence": [
1850
+ "\n"
1851
+ ],
1852
+ "output_regex": null,
1853
+ "frozen": false,
1854
+ "suite": [
1855
+ "leaderboard",
1856
+ "mmlu"
1857
+ ],
1858
+ "original_num_docs": 324,
1859
+ "effective_num_docs": 324,
1860
+ "trust_dataset": true
1861
+ },
1862
+ "leaderboard|mmlu:professional_accounting": {
1863
+ "name": "mmlu:professional_accounting",
1864
+ "prompt_function": "mmlu_harness",
1865
+ "hf_repo": "lighteval/mmlu",
1866
+ "hf_subset": "professional_accounting",
1867
+ "metric": [
1868
+ "loglikelihood_acc"
1869
+ ],
1870
+ "hf_avail_splits": [
1871
+ "auxiliary_train",
1872
+ "test",
1873
+ "validation",
1874
+ "dev"
1875
+ ],
1876
+ "evaluation_splits": [
1877
+ "test"
1878
+ ],
1879
+ "few_shots_split": "dev",
1880
+ "few_shots_select": "sequential",
1881
+ "generation_size": 1,
1882
+ "stop_sequence": [
1883
+ "\n"
1884
+ ],
1885
+ "output_regex": null,
1886
+ "frozen": false,
1887
+ "suite": [
1888
+ "leaderboard",
1889
+ "mmlu"
1890
+ ],
1891
+ "original_num_docs": 282,
1892
+ "effective_num_docs": 282,
1893
+ "trust_dataset": true
1894
+ },
1895
+ "leaderboard|mmlu:professional_law": {
1896
+ "name": "mmlu:professional_law",
1897
+ "prompt_function": "mmlu_harness",
1898
+ "hf_repo": "lighteval/mmlu",
1899
+ "hf_subset": "professional_law",
1900
+ "metric": [
1901
+ "loglikelihood_acc"
1902
+ ],
1903
+ "hf_avail_splits": [
1904
+ "auxiliary_train",
1905
+ "test",
1906
+ "validation",
1907
+ "dev"
1908
+ ],
1909
+ "evaluation_splits": [
1910
+ "test"
1911
+ ],
1912
+ "few_shots_split": "dev",
1913
+ "few_shots_select": "sequential",
1914
+ "generation_size": 1,
1915
+ "stop_sequence": [
1916
+ "\n"
1917
+ ],
1918
+ "output_regex": null,
1919
+ "frozen": false,
1920
+ "suite": [
1921
+ "leaderboard",
1922
+ "mmlu"
1923
+ ],
1924
+ "original_num_docs": 1534,
1925
+ "effective_num_docs": 1534,
1926
+ "trust_dataset": true
1927
+ },
1928
+ "leaderboard|mmlu:professional_medicine": {
1929
+ "name": "mmlu:professional_medicine",
1930
+ "prompt_function": "mmlu_harness",
1931
+ "hf_repo": "lighteval/mmlu",
1932
+ "hf_subset": "professional_medicine",
1933
+ "metric": [
1934
+ "loglikelihood_acc"
1935
+ ],
1936
+ "hf_avail_splits": [
1937
+ "auxiliary_train",
1938
+ "test",
1939
+ "validation",
1940
+ "dev"
1941
+ ],
1942
+ "evaluation_splits": [
1943
+ "test"
1944
+ ],
1945
+ "few_shots_split": "dev",
1946
+ "few_shots_select": "sequential",
1947
+ "generation_size": 1,
1948
+ "stop_sequence": [
1949
+ "\n"
1950
+ ],
1951
+ "output_regex": null,
1952
+ "frozen": false,
1953
+ "suite": [
1954
+ "leaderboard",
1955
+ "mmlu"
1956
+ ],
1957
+ "original_num_docs": 272,
1958
+ "effective_num_docs": 272,
1959
+ "trust_dataset": true
1960
+ },
1961
+ "leaderboard|mmlu:professional_psychology": {
1962
+ "name": "mmlu:professional_psychology",
1963
+ "prompt_function": "mmlu_harness",
1964
+ "hf_repo": "lighteval/mmlu",
1965
+ "hf_subset": "professional_psychology",
1966
+ "metric": [
1967
+ "loglikelihood_acc"
1968
+ ],
1969
+ "hf_avail_splits": [
1970
+ "auxiliary_train",
1971
+ "test",
1972
+ "validation",
1973
+ "dev"
1974
+ ],
1975
+ "evaluation_splits": [
1976
+ "test"
1977
+ ],
1978
+ "few_shots_split": "dev",
1979
+ "few_shots_select": "sequential",
1980
+ "generation_size": 1,
1981
+ "stop_sequence": [
1982
+ "\n"
1983
+ ],
1984
+ "output_regex": null,
1985
+ "frozen": false,
1986
+ "suite": [
1987
+ "leaderboard",
1988
+ "mmlu"
1989
+ ],
1990
+ "original_num_docs": 612,
1991
+ "effective_num_docs": 612,
1992
+ "trust_dataset": true
1993
+ },
1994
+ "leaderboard|mmlu:public_relations": {
1995
+ "name": "mmlu:public_relations",
1996
+ "prompt_function": "mmlu_harness",
1997
+ "hf_repo": "lighteval/mmlu",
1998
+ "hf_subset": "public_relations",
1999
+ "metric": [
2000
+ "loglikelihood_acc"
2001
+ ],
2002
+ "hf_avail_splits": [
2003
+ "auxiliary_train",
2004
+ "test",
2005
+ "validation",
2006
+ "dev"
2007
+ ],
2008
+ "evaluation_splits": [
2009
+ "test"
2010
+ ],
2011
+ "few_shots_split": "dev",
2012
+ "few_shots_select": "sequential",
2013
+ "generation_size": 1,
2014
+ "stop_sequence": [
2015
+ "\n"
2016
+ ],
2017
+ "output_regex": null,
2018
+ "frozen": false,
2019
+ "suite": [
2020
+ "leaderboard",
2021
+ "mmlu"
2022
+ ],
2023
+ "original_num_docs": 110,
2024
+ "effective_num_docs": 110,
2025
+ "trust_dataset": true
2026
+ },
2027
+ "leaderboard|mmlu:security_studies": {
2028
+ "name": "mmlu:security_studies",
2029
+ "prompt_function": "mmlu_harness",
2030
+ "hf_repo": "lighteval/mmlu",
2031
+ "hf_subset": "security_studies",
2032
+ "metric": [
2033
+ "loglikelihood_acc"
2034
+ ],
2035
+ "hf_avail_splits": [
2036
+ "auxiliary_train",
2037
+ "test",
2038
+ "validation",
2039
+ "dev"
2040
+ ],
2041
+ "evaluation_splits": [
2042
+ "test"
2043
+ ],
2044
+ "few_shots_split": "dev",
2045
+ "few_shots_select": "sequential",
2046
+ "generation_size": 1,
2047
+ "stop_sequence": [
2048
+ "\n"
2049
+ ],
2050
+ "output_regex": null,
2051
+ "frozen": false,
2052
+ "suite": [
2053
+ "leaderboard",
2054
+ "mmlu"
2055
+ ],
2056
+ "original_num_docs": 245,
2057
+ "effective_num_docs": 245,
2058
+ "trust_dataset": true
2059
+ },
2060
+ "leaderboard|mmlu:sociology": {
2061
+ "name": "mmlu:sociology",
2062
+ "prompt_function": "mmlu_harness",
2063
+ "hf_repo": "lighteval/mmlu",
2064
+ "hf_subset": "sociology",
2065
+ "metric": [
2066
+ "loglikelihood_acc"
2067
+ ],
2068
+ "hf_avail_splits": [
2069
+ "auxiliary_train",
2070
+ "test",
2071
+ "validation",
2072
+ "dev"
2073
+ ],
2074
+ "evaluation_splits": [
2075
+ "test"
2076
+ ],
2077
+ "few_shots_split": "dev",
2078
+ "few_shots_select": "sequential",
2079
+ "generation_size": 1,
2080
+ "stop_sequence": [
2081
+ "\n"
2082
+ ],
2083
+ "output_regex": null,
2084
+ "frozen": false,
2085
+ "suite": [
2086
+ "leaderboard",
2087
+ "mmlu"
2088
+ ],
2089
+ "original_num_docs": 201,
2090
+ "effective_num_docs": 201,
2091
+ "trust_dataset": true
2092
+ },
2093
+ "leaderboard|mmlu:us_foreign_policy": {
2094
+ "name": "mmlu:us_foreign_policy",
2095
+ "prompt_function": "mmlu_harness",
2096
+ "hf_repo": "lighteval/mmlu",
2097
+ "hf_subset": "us_foreign_policy",
2098
+ "metric": [
2099
+ "loglikelihood_acc"
2100
+ ],
2101
+ "hf_avail_splits": [
2102
+ "auxiliary_train",
2103
+ "test",
2104
+ "validation",
2105
+ "dev"
2106
+ ],
2107
+ "evaluation_splits": [
2108
+ "test"
2109
+ ],
2110
+ "few_shots_split": "dev",
2111
+ "few_shots_select": "sequential",
2112
+ "generation_size": 1,
2113
+ "stop_sequence": [
2114
+ "\n"
2115
+ ],
2116
+ "output_regex": null,
2117
+ "frozen": false,
2118
+ "suite": [
2119
+ "leaderboard",
2120
+ "mmlu"
2121
+ ],
2122
+ "original_num_docs": 100,
2123
+ "effective_num_docs": 100,
2124
+ "trust_dataset": true
2125
+ },
2126
+ "leaderboard|mmlu:virology": {
2127
+ "name": "mmlu:virology",
2128
+ "prompt_function": "mmlu_harness",
2129
+ "hf_repo": "lighteval/mmlu",
2130
+ "hf_subset": "virology",
2131
+ "metric": [
2132
+ "loglikelihood_acc"
2133
+ ],
2134
+ "hf_avail_splits": [
2135
+ "auxiliary_train",
2136
+ "test",
2137
+ "validation",
2138
+ "dev"
2139
+ ],
2140
+ "evaluation_splits": [
2141
+ "test"
2142
+ ],
2143
+ "few_shots_split": "dev",
2144
+ "few_shots_select": "sequential",
2145
+ "generation_size": 1,
2146
+ "stop_sequence": [
2147
+ "\n"
2148
+ ],
2149
+ "output_regex": null,
2150
+ "frozen": false,
2151
+ "suite": [
2152
+ "leaderboard",
2153
+ "mmlu"
2154
+ ],
2155
+ "original_num_docs": 166,
2156
+ "effective_num_docs": 166,
2157
+ "trust_dataset": true
2158
+ },
2159
+ "leaderboard|mmlu:world_religions": {
2160
+ "name": "mmlu:world_religions",
2161
+ "prompt_function": "mmlu_harness",
2162
+ "hf_repo": "lighteval/mmlu",
2163
+ "hf_subset": "world_religions",
2164
+ "metric": [
2165
+ "loglikelihood_acc"
2166
+ ],
2167
+ "hf_avail_splits": [
2168
+ "auxiliary_train",
2169
+ "test",
2170
+ "validation",
2171
+ "dev"
2172
+ ],
2173
+ "evaluation_splits": [
2174
+ "test"
2175
+ ],
2176
+ "few_shots_split": "dev",
2177
+ "few_shots_select": "sequential",
2178
+ "generation_size": 1,
2179
+ "stop_sequence": [
2180
+ "\n"
2181
+ ],
2182
+ "output_regex": null,
2183
+ "frozen": false,
2184
+ "suite": [
2185
+ "leaderboard",
2186
+ "mmlu"
2187
+ ],
2188
+ "original_num_docs": 171,
2189
+ "effective_num_docs": 171,
2190
+ "trust_dataset": true
2191
+ }
2192
+ },
2193
+ "summary_tasks": {
2194
+ "leaderboard|mmlu:abstract_algebra|5": {
2195
+ "hashes": {
2196
+ "hash_examples": "4c76229e00c9c0e9",
2197
+ "hash_full_prompts": "a45d01c3409c889c",
2198
+ "hash_input_tokens": "d0571b6ffb835507",
2199
+ "hash_cont_tokens": "00520b0ec06da34f"
2200
+ },
2201
+ "truncated": 0,
2202
+ "non_truncated": 100,
2203
+ "padded": 400,
2204
+ "non_padded": 0,
2205
+ "effective_few_shots": 5.0,
2206
+ "num_truncated_few_shots": 0
2207
+ },
2208
+ "leaderboard|mmlu:anatomy|5": {
2209
+ "hashes": {
2210
+ "hash_examples": "6a1f8104dccbd33b",
2211
+ "hash_full_prompts": "e245c6600e03cc32",
2212
+ "hash_input_tokens": "8dd20ec55e9ad889",
2213
+ "hash_cont_tokens": "263324e6ce7f9b36"
2214
+ },
2215
+ "truncated": 0,
2216
+ "non_truncated": 135,
2217
+ "padded": 540,
2218
+ "non_padded": 0,
2219
+ "effective_few_shots": 5.0,
2220
+ "num_truncated_few_shots": 0
2221
+ },
2222
+ "leaderboard|mmlu:astronomy|5": {
2223
+ "hashes": {
2224
+ "hash_examples": "1302effa3a76ce4c",
2225
+ "hash_full_prompts": "390f9bddf857ad04",
2226
+ "hash_input_tokens": "81e8167c0c820f24",
2227
+ "hash_cont_tokens": "18ba399c6801138e"
2228
+ },
2229
+ "truncated": 0,
2230
+ "non_truncated": 152,
2231
+ "padded": 608,
2232
+ "non_padded": 0,
2233
+ "effective_few_shots": 5.0,
2234
+ "num_truncated_few_shots": 0
2235
+ },
2236
+ "leaderboard|mmlu:business_ethics|5": {
2237
+ "hashes": {
2238
+ "hash_examples": "03cb8bce5336419a",
2239
+ "hash_full_prompts": "5504f893bc4f2fa1",
2240
+ "hash_input_tokens": "668443aa86633b73",
2241
+ "hash_cont_tokens": "00520b0ec06da34f"
2242
+ },
2243
+ "truncated": 0,
2244
+ "non_truncated": 100,
2245
+ "padded": 400,
2246
+ "non_padded": 0,
2247
+ "effective_few_shots": 5.0,
2248
+ "num_truncated_few_shots": 0
2249
+ },
2250
+ "leaderboard|mmlu:clinical_knowledge|5": {
2251
+ "hashes": {
2252
+ "hash_examples": "ffbb9c7b2be257f9",
2253
+ "hash_full_prompts": "106ad0bab4b90b78",
2254
+ "hash_input_tokens": "726c176b444e3c55",
2255
+ "hash_cont_tokens": "9d7500060e0dd995"
2256
+ },
2257
+ "truncated": 0,
2258
+ "non_truncated": 265,
2259
+ "padded": 1060,
2260
+ "non_padded": 0,
2261
+ "effective_few_shots": 5.0,
2262
+ "num_truncated_few_shots": 0
2263
+ },
2264
+ "leaderboard|mmlu:college_biology|5": {
2265
+ "hashes": {
2266
+ "hash_examples": "3ee77f176f38eb8e",
2267
+ "hash_full_prompts": "59f9bdf2695cb226",
2268
+ "hash_input_tokens": "7535ef44daca8b2e",
2269
+ "hash_cont_tokens": "78a731af5d2f6472"
2270
+ },
2271
+ "truncated": 0,
2272
+ "non_truncated": 144,
2273
+ "padded": 576,
2274
+ "non_padded": 0,
2275
+ "effective_few_shots": 5.0,
2276
+ "num_truncated_few_shots": 0
2277
+ },
2278
+ "leaderboard|mmlu:college_chemistry|5": {
2279
+ "hashes": {
2280
+ "hash_examples": "ce61a69c46d47aeb",
2281
+ "hash_full_prompts": "3cac9b759fcff7a0",
2282
+ "hash_input_tokens": "e98bdaf1fa27ef3b",
2283
+ "hash_cont_tokens": "00520b0ec06da34f"
2284
+ },
2285
+ "truncated": 0,
2286
+ "non_truncated": 100,
2287
+ "padded": 400,
2288
+ "non_padded": 0,
2289
+ "effective_few_shots": 5.0,
2290
+ "num_truncated_few_shots": 0
2291
+ },
2292
+ "leaderboard|mmlu:college_computer_science|5": {
2293
+ "hashes": {
2294
+ "hash_examples": "32805b52d7d5daab",
2295
+ "hash_full_prompts": "010b0cca35070130",
2296
+ "hash_input_tokens": "40494a193cf906d1",
2297
+ "hash_cont_tokens": "00520b0ec06da34f"
2298
+ },
2299
+ "truncated": 0,
2300
+ "non_truncated": 100,
2301
+ "padded": 400,
2302
+ "non_padded": 0,
2303
+ "effective_few_shots": 5.0,
2304
+ "num_truncated_few_shots": 0
2305
+ },
2306
+ "leaderboard|mmlu:college_mathematics|5": {
2307
+ "hashes": {
2308
+ "hash_examples": "55da1a0a0bd33722",
2309
+ "hash_full_prompts": "511422eb9eefc773",
2310
+ "hash_input_tokens": "2f512892d24b0086",
2311
+ "hash_cont_tokens": "00520b0ec06da34f"
2312
+ },
2313
+ "truncated": 0,
2314
+ "non_truncated": 100,
2315
+ "padded": 400,
2316
+ "non_padded": 0,
2317
+ "effective_few_shots": 5.0,
2318
+ "num_truncated_few_shots": 0
2319
+ },
2320
+ "leaderboard|mmlu:college_medicine|5": {
2321
+ "hashes": {
2322
+ "hash_examples": "c33e143163049176",
2323
+ "hash_full_prompts": "c8cc1a82a51a046e",
2324
+ "hash_input_tokens": "41ba4385551feaf3",
2325
+ "hash_cont_tokens": "699c8eb24e3e446b"
2326
+ },
2327
+ "truncated": 0,
2328
+ "non_truncated": 173,
2329
+ "padded": 692,
2330
+ "non_padded": 0,
2331
+ "effective_few_shots": 5.0,
2332
+ "num_truncated_few_shots": 0
2333
+ },
2334
+ "leaderboard|mmlu:college_physics|5": {
2335
+ "hashes": {
2336
+ "hash_examples": "ebdab1cdb7e555df",
2337
+ "hash_full_prompts": "e40721b5059c5818",
2338
+ "hash_input_tokens": "1f357d859f4e78c2",
2339
+ "hash_cont_tokens": "075997110cbe055e"
2340
+ },
2341
+ "truncated": 0,
2342
+ "non_truncated": 102,
2343
+ "padded": 408,
2344
+ "non_padded": 0,
2345
+ "effective_few_shots": 5.0,
2346
+ "num_truncated_few_shots": 0
2347
+ },
2348
+ "leaderboard|mmlu:computer_security|5": {
2349
+ "hashes": {
2350
+ "hash_examples": "a24fd7d08a560921",
2351
+ "hash_full_prompts": "946c9be5964ac44a",
2352
+ "hash_input_tokens": "def9fb5a2fab003a",
2353
+ "hash_cont_tokens": "00520b0ec06da34f"
2354
+ },
2355
+ "truncated": 0,
2356
+ "non_truncated": 100,
2357
+ "padded": 400,
2358
+ "non_padded": 0,
2359
+ "effective_few_shots": 5.0,
2360
+ "num_truncated_few_shots": 0
2361
+ },
2362
+ "leaderboard|mmlu:conceptual_physics|5": {
2363
+ "hashes": {
2364
+ "hash_examples": "8300977a79386993",
2365
+ "hash_full_prompts": "506a4f6094cc40c9",
2366
+ "hash_input_tokens": "b398cceaff8512f7",
2367
+ "hash_cont_tokens": "f22daa6d4818086f"
2368
+ },
2369
+ "truncated": 0,
2370
+ "non_truncated": 235,
2371
+ "padded": 940,
2372
+ "non_padded": 0,
2373
+ "effective_few_shots": 5.0,
2374
+ "num_truncated_few_shots": 0
2375
+ },
2376
+ "leaderboard|mmlu:econometrics|5": {
2377
+ "hashes": {
2378
+ "hash_examples": "ddde36788a04a46f",
2379
+ "hash_full_prompts": "4ed2703f27f1ed05",
2380
+ "hash_input_tokens": "cf227ca8af4bc815",
2381
+ "hash_cont_tokens": "26791a0b1941b4c4"
2382
+ },
2383
+ "truncated": 0,
2384
+ "non_truncated": 114,
2385
+ "padded": 456,
2386
+ "non_padded": 0,
2387
+ "effective_few_shots": 5.0,
2388
+ "num_truncated_few_shots": 0
2389
+ },
2390
+ "leaderboard|mmlu:electrical_engineering|5": {
2391
+ "hashes": {
2392
+ "hash_examples": "acbc5def98c19b3f",
2393
+ "hash_full_prompts": "d8f4b3e11c23653c",
2394
+ "hash_input_tokens": "295e278cbce7ed04",
2395
+ "hash_cont_tokens": "3e336577994f6c0d"
2396
+ },
2397
+ "truncated": 0,
2398
+ "non_truncated": 145,
2399
+ "padded": 580,
2400
+ "non_padded": 0,
2401
+ "effective_few_shots": 5.0,
2402
+ "num_truncated_few_shots": 0
2403
+ },
2404
+ "leaderboard|mmlu:elementary_mathematics|5": {
2405
+ "hashes": {
2406
+ "hash_examples": "146e61d07497a9bd",
2407
+ "hash_full_prompts": "256d111bd15647ff",
2408
+ "hash_input_tokens": "2474a420d7b931ff",
2409
+ "hash_cont_tokens": "1d6bbfa8a67327c8"
2410
+ },
2411
+ "truncated": 0,
2412
+ "non_truncated": 378,
2413
+ "padded": 1512,
2414
+ "non_padded": 0,
2415
+ "effective_few_shots": 5.0,
2416
+ "num_truncated_few_shots": 0
2417
+ },
2418
+ "leaderboard|mmlu:formal_logic|5": {
2419
+ "hashes": {
2420
+ "hash_examples": "8635216e1909a03f",
2421
+ "hash_full_prompts": "1171d04f3b1a11f5",
2422
+ "hash_input_tokens": "f269941d7dabea05",
2423
+ "hash_cont_tokens": "60508d85eb7693a4"
2424
+ },
2425
+ "truncated": 0,
2426
+ "non_truncated": 126,
2427
+ "padded": 504,
2428
+ "non_padded": 0,
2429
+ "effective_few_shots": 5.0,
2430
+ "num_truncated_few_shots": 0
2431
+ },
2432
+ "leaderboard|mmlu:global_facts|5": {
2433
+ "hashes": {
2434
+ "hash_examples": "30b315aa6353ee47",
2435
+ "hash_full_prompts": "a7e56dbc074c7529",
2436
+ "hash_input_tokens": "2036a912407797e6",
2437
+ "hash_cont_tokens": "00520b0ec06da34f"
2438
+ },
2439
+ "truncated": 0,
2440
+ "non_truncated": 100,
2441
+ "padded": 400,
2442
+ "non_padded": 0,
2443
+ "effective_few_shots": 5.0,
2444
+ "num_truncated_few_shots": 0
2445
+ },
2446
+ "leaderboard|mmlu:high_school_biology|5": {
2447
+ "hashes": {
2448
+ "hash_examples": "c9136373af2180de",
2449
+ "hash_full_prompts": "ad6e859ed978e04a",
2450
+ "hash_input_tokens": "1bc8ad087ca8f65b",
2451
+ "hash_cont_tokens": "d236ce982144e65f"
2452
+ },
2453
+ "truncated": 0,
2454
+ "non_truncated": 310,
2455
+ "padded": 1240,
2456
+ "non_padded": 0,
2457
+ "effective_few_shots": 5.0,
2458
+ "num_truncated_few_shots": 0
2459
+ },
2460
+ "leaderboard|mmlu:high_school_chemistry|5": {
2461
+ "hashes": {
2462
+ "hash_examples": "b0661bfa1add6404",
2463
+ "hash_full_prompts": "6eb9c04bcc8a8f2a",
2464
+ "hash_input_tokens": "ead708921e3a1c93",
2465
+ "hash_cont_tokens": "59f93238ec5aead6"
2466
+ },
2467
+ "truncated": 0,
2468
+ "non_truncated": 203,
2469
+ "padded": 812,
2470
+ "non_padded": 0,
2471
+ "effective_few_shots": 5.0,
2472
+ "num_truncated_few_shots": 0
2473
+ },
2474
+ "leaderboard|mmlu:high_school_computer_science|5": {
2475
+ "hashes": {
2476
+ "hash_examples": "80fc1d623a3d665f",
2477
+ "hash_full_prompts": "8e51bc91c81cf8dd",
2478
+ "hash_input_tokens": "604f88a2f17d5159",
2479
+ "hash_cont_tokens": "00520b0ec06da34f"
2480
+ },
2481
+ "truncated": 0,
2482
+ "non_truncated": 100,
2483
+ "padded": 400,
2484
+ "non_padded": 0,
2485
+ "effective_few_shots": 5.0,
2486
+ "num_truncated_few_shots": 0
2487
+ },
2488
+ "leaderboard|mmlu:high_school_european_history|5": {
2489
+ "hashes": {
2490
+ "hash_examples": "854da6e5af0fe1a1",
2491
+ "hash_full_prompts": "664a1f16c9f3195c",
2492
+ "hash_input_tokens": "1dfe455312f2e6cf",
2493
+ "hash_cont_tokens": "7b7414d6a5da3d91"
2494
+ },
2495
+ "truncated": 0,
2496
+ "non_truncated": 165,
2497
+ "padded": 656,
2498
+ "non_padded": 4,
2499
+ "effective_few_shots": 5.0,
2500
+ "num_truncated_few_shots": 0
2501
+ },
2502
+ "leaderboard|mmlu:high_school_geography|5": {
2503
+ "hashes": {
2504
+ "hash_examples": "7dc963c7acd19ad8",
2505
+ "hash_full_prompts": "f3acf911f4023c8a",
2506
+ "hash_input_tokens": "1985ba6f69f57d66",
2507
+ "hash_cont_tokens": "1b66289e10988f84"
2508
+ },
2509
+ "truncated": 0,
2510
+ "non_truncated": 198,
2511
+ "padded": 792,
2512
+ "non_padded": 0,
2513
+ "effective_few_shots": 5.0,
2514
+ "num_truncated_few_shots": 0
2515
+ },
2516
+ "leaderboard|mmlu:high_school_government_and_politics|5": {
2517
+ "hashes": {
2518
+ "hash_examples": "1f675dcdebc9758f",
2519
+ "hash_full_prompts": "066254feaa3158ae",
2520
+ "hash_input_tokens": "e6960d7d906ffb15",
2521
+ "hash_cont_tokens": "5ab3c3415b1d3a55"
2522
+ },
2523
+ "truncated": 0,
2524
+ "non_truncated": 193,
2525
+ "padded": 772,
2526
+ "non_padded": 0,
2527
+ "effective_few_shots": 5.0,
2528
+ "num_truncated_few_shots": 0
2529
+ },
2530
+ "leaderboard|mmlu:high_school_macroeconomics|5": {
2531
+ "hashes": {
2532
+ "hash_examples": "2fb32cf2d80f0b35",
2533
+ "hash_full_prompts": "19a7fa502aa85c95",
2534
+ "hash_input_tokens": "4ea59b7b8c4856d2",
2535
+ "hash_cont_tokens": "2f5457058d187374"
2536
+ },
2537
+ "truncated": 0,
2538
+ "non_truncated": 390,
2539
+ "padded": 1557,
2540
+ "non_padded": 3,
2541
+ "effective_few_shots": 5.0,
2542
+ "num_truncated_few_shots": 0
2543
+ },
2544
+ "leaderboard|mmlu:high_school_mathematics|5": {
2545
+ "hashes": {
2546
+ "hash_examples": "fd6646fdb5d58a1f",
2547
+ "hash_full_prompts": "4f704e369778b5b0",
2548
+ "hash_input_tokens": "7d39279726411bb3",
2549
+ "hash_cont_tokens": "e35137cb972e1918"
2550
+ },
2551
+ "truncated": 0,
2552
+ "non_truncated": 270,
2553
+ "padded": 1080,
2554
+ "non_padded": 0,
2555
+ "effective_few_shots": 5.0,
2556
+ "num_truncated_few_shots": 0
2557
+ },
2558
+ "leaderboard|mmlu:high_school_microeconomics|5": {
2559
+ "hashes": {
2560
+ "hash_examples": "2118f21f71d87d84",
2561
+ "hash_full_prompts": "4350f9e2240f8010",
2562
+ "hash_input_tokens": "2be919ac2e73f3d1",
2563
+ "hash_cont_tokens": "f756093278ebb83e"
2564
+ },
2565
+ "truncated": 0,
2566
+ "non_truncated": 238,
2567
+ "padded": 908,
2568
+ "non_padded": 44,
2569
+ "effective_few_shots": 5.0,
2570
+ "num_truncated_few_shots": 0
2571
+ },
2572
+ "leaderboard|mmlu:high_school_physics|5": {
2573
+ "hashes": {
2574
+ "hash_examples": "dc3ce06378548565",
2575
+ "hash_full_prompts": "5dc0d6831b66188f",
2576
+ "hash_input_tokens": "9b2e07d3183ade24",
2577
+ "hash_cont_tokens": "9cf883ebf1c82176"
2578
+ },
2579
+ "truncated": 0,
2580
+ "non_truncated": 151,
2581
+ "padded": 604,
2582
+ "non_padded": 0,
2583
+ "effective_few_shots": 5.0,
2584
+ "num_truncated_few_shots": 0
2585
+ },
2586
+ "leaderboard|mmlu:high_school_psychology|5": {
2587
+ "hashes": {
2588
+ "hash_examples": "c8d1d98a40e11f2f",
2589
+ "hash_full_prompts": "af2b097da6d50365",
2590
+ "hash_input_tokens": "a0f7b561c0177eb7",
2591
+ "hash_cont_tokens": "bda0f77331ebb21a"
2592
+ },
2593
+ "truncated": 0,
2594
+ "non_truncated": 545,
2595
+ "padded": 2178,
2596
+ "non_padded": 2,
2597
+ "effective_few_shots": 5.0,
2598
+ "num_truncated_few_shots": 0
2599
+ },
2600
+ "leaderboard|mmlu:high_school_statistics|5": {
2601
+ "hashes": {
2602
+ "hash_examples": "666c8759b98ee4ff",
2603
+ "hash_full_prompts": "c757694421d6d68d",
2604
+ "hash_input_tokens": "0e353fc06f61e59b",
2605
+ "hash_cont_tokens": "4d04f014105a0bad"
2606
+ },
2607
+ "truncated": 0,
2608
+ "non_truncated": 216,
2609
+ "padded": 864,
2610
+ "non_padded": 0,
2611
+ "effective_few_shots": 5.0,
2612
+ "num_truncated_few_shots": 0
2613
+ },
2614
+ "leaderboard|mmlu:high_school_us_history|5": {
2615
+ "hashes": {
2616
+ "hash_examples": "95fef1c4b7d3f81e",
2617
+ "hash_full_prompts": "e34a028d0ddeec5e",
2618
+ "hash_input_tokens": "7c7f37778e6ccda2",
2619
+ "hash_cont_tokens": "f4590c58f12f2766"
2620
+ },
2621
+ "truncated": 0,
2622
+ "non_truncated": 204,
2623
+ "padded": 816,
2624
+ "non_padded": 0,
2625
+ "effective_few_shots": 5.0,
2626
+ "num_truncated_few_shots": 0
2627
+ },
2628
+ "leaderboard|mmlu:high_school_world_history|5": {
2629
+ "hashes": {
2630
+ "hash_examples": "7e5085b6184b0322",
2631
+ "hash_full_prompts": "1fa3d51392765601",
2632
+ "hash_input_tokens": "71993d416140265b",
2633
+ "hash_cont_tokens": "db6bcddd891df5d9"
2634
+ },
2635
+ "truncated": 0,
2636
+ "non_truncated": 237,
2637
+ "padded": 948,
2638
+ "non_padded": 0,
2639
+ "effective_few_shots": 5.0,
2640
+ "num_truncated_few_shots": 0
2641
+ },
2642
+ "leaderboard|mmlu:human_aging|5": {
2643
+ "hashes": {
2644
+ "hash_examples": "c17333e7c7c10797",
2645
+ "hash_full_prompts": "cac900721f9a1a94",
2646
+ "hash_input_tokens": "b0fa52119d4303e9",
2647
+ "hash_cont_tokens": "25cec8d640319105"
2648
+ },
2649
+ "truncated": 0,
2650
+ "non_truncated": 223,
2651
+ "padded": 892,
2652
+ "non_padded": 0,
2653
+ "effective_few_shots": 5.0,
2654
+ "num_truncated_few_shots": 0
2655
+ },
2656
+ "leaderboard|mmlu:human_sexuality|5": {
2657
+ "hashes": {
2658
+ "hash_examples": "4edd1e9045df5e3d",
2659
+ "hash_full_prompts": "0d6567bafee0a13c",
2660
+ "hash_input_tokens": "879018ae27bdf5b0",
2661
+ "hash_cont_tokens": "6778302b4a10b645"
2662
+ },
2663
+ "truncated": 0,
2664
+ "non_truncated": 131,
2665
+ "padded": 524,
2666
+ "non_padded": 0,
2667
+ "effective_few_shots": 5.0,
2668
+ "num_truncated_few_shots": 0
2669
+ },
2670
+ "leaderboard|mmlu:international_law|5": {
2671
+ "hashes": {
2672
+ "hash_examples": "db2fa00d771a062a",
2673
+ "hash_full_prompts": "d018f9116479795e",
2674
+ "hash_input_tokens": "be4409fc3ab936f3",
2675
+ "hash_cont_tokens": "9eb54e1a46032749"
2676
+ },
2677
+ "truncated": 0,
2678
+ "non_truncated": 121,
2679
+ "padded": 484,
2680
+ "non_padded": 0,
2681
+ "effective_few_shots": 5.0,
2682
+ "num_truncated_few_shots": 0
2683
+ },
2684
+ "leaderboard|mmlu:jurisprudence|5": {
2685
+ "hashes": {
2686
+ "hash_examples": "e956f86b124076fe",
2687
+ "hash_full_prompts": "1487e89a10ec58b7",
2688
+ "hash_input_tokens": "888c2eab4655e553",
2689
+ "hash_cont_tokens": "f17d9a372cfd66b1"
2690
+ },
2691
+ "truncated": 0,
2692
+ "non_truncated": 108,
2693
+ "padded": 420,
2694
+ "non_padded": 12,
2695
+ "effective_few_shots": 5.0,
2696
+ "num_truncated_few_shots": 0
2697
+ },
2698
+ "leaderboard|mmlu:logical_fallacies|5": {
2699
+ "hashes": {
2700
+ "hash_examples": "956e0e6365ab79f1",
2701
+ "hash_full_prompts": "677785b2181f9243",
2702
+ "hash_input_tokens": "8cee26c610ab13a1",
2703
+ "hash_cont_tokens": "cf44a68f5bca9a96"
2704
+ },
2705
+ "truncated": 0,
2706
+ "non_truncated": 163,
2707
+ "padded": 648,
2708
+ "non_padded": 4,
2709
+ "effective_few_shots": 5.0,
2710
+ "num_truncated_few_shots": 0
2711
+ },
2712
+ "leaderboard|mmlu:machine_learning|5": {
2713
+ "hashes": {
2714
+ "hash_examples": "397997cc6f4d581e",
2715
+ "hash_full_prompts": "769ee14a2aea49bb",
2716
+ "hash_input_tokens": "1d8a213f41f96aee",
2717
+ "hash_cont_tokens": "eace00d420f4f32c"
2718
+ },
2719
+ "truncated": 0,
2720
+ "non_truncated": 112,
2721
+ "padded": 448,
2722
+ "non_padded": 0,
2723
+ "effective_few_shots": 5.0,
2724
+ "num_truncated_few_shots": 0
2725
+ },
2726
+ "leaderboard|mmlu:management|5": {
2727
+ "hashes": {
2728
+ "hash_examples": "2bcbe6f6ca63d740",
2729
+ "hash_full_prompts": "cb1ff9dac9582144",
2730
+ "hash_input_tokens": "44ba435973dce9d1",
2731
+ "hash_cont_tokens": "b7c51d0250c252d8"
2732
+ },
2733
+ "truncated": 0,
2734
+ "non_truncated": 103,
2735
+ "padded": 412,
2736
+ "non_padded": 0,
2737
+ "effective_few_shots": 5.0,
2738
+ "num_truncated_few_shots": 0
2739
+ },
2740
+ "leaderboard|mmlu:marketing|5": {
2741
+ "hashes": {
2742
+ "hash_examples": "8ddb20d964a1b065",
2743
+ "hash_full_prompts": "9fc2114a187ad9a2",
2744
+ "hash_input_tokens": "e86c7f7e4f27bcb7",
2745
+ "hash_cont_tokens": "086fb63f8b1d1339"
2746
+ },
2747
+ "truncated": 0,
2748
+ "non_truncated": 234,
2749
+ "padded": 924,
2750
+ "non_padded": 12,
2751
+ "effective_few_shots": 5.0,
2752
+ "num_truncated_few_shots": 0
2753
+ },
2754
+ "leaderboard|mmlu:medical_genetics|5": {
2755
+ "hashes": {
2756
+ "hash_examples": "182a71f4763d2cea",
2757
+ "hash_full_prompts": "46a616fa51878959",
2758
+ "hash_input_tokens": "84615035f844ffa0",
2759
+ "hash_cont_tokens": "00520b0ec06da34f"
2760
+ },
2761
+ "truncated": 0,
2762
+ "non_truncated": 100,
2763
+ "padded": 400,
2764
+ "non_padded": 0,
2765
+ "effective_few_shots": 5.0,
2766
+ "num_truncated_few_shots": 0
2767
+ },
2768
+ "leaderboard|mmlu:miscellaneous|5": {
2769
+ "hashes": {
2770
+ "hash_examples": "4c404fdbb4ca57fc",
2771
+ "hash_full_prompts": "0813e1be36dbaae1",
2772
+ "hash_input_tokens": "f816152d0e727938",
2773
+ "hash_cont_tokens": "1827274fa6537077"
2774
+ },
2775
+ "truncated": 0,
2776
+ "non_truncated": 783,
2777
+ "padded": 3132,
2778
+ "non_padded": 0,
2779
+ "effective_few_shots": 5.0,
2780
+ "num_truncated_few_shots": 0
2781
+ },
2782
+ "leaderboard|mmlu:moral_disputes|5": {
2783
+ "hashes": {
2784
+ "hash_examples": "60cbd2baa3fea5c9",
2785
+ "hash_full_prompts": "1d14adebb9b62519",
2786
+ "hash_input_tokens": "53082748f1b5e440",
2787
+ "hash_cont_tokens": "472c223f6f28cfc7"
2788
+ },
2789
+ "truncated": 0,
2790
+ "non_truncated": 346,
2791
+ "padded": 1384,
2792
+ "non_padded": 0,
2793
+ "effective_few_shots": 5.0,
2794
+ "num_truncated_few_shots": 0
2795
+ },
2796
+ "leaderboard|mmlu:moral_scenarios|5": {
2797
+ "hashes": {
2798
+ "hash_examples": "fd8b0431fbdd75ef",
2799
+ "hash_full_prompts": "b80d3d236165e3de",
2800
+ "hash_input_tokens": "b5318303d9c36325",
2801
+ "hash_cont_tokens": "e90dade00a092f9e"
2802
+ },
2803
+ "truncated": 0,
2804
+ "non_truncated": 895,
2805
+ "padded": 3567,
2806
+ "non_padded": 13,
2807
+ "effective_few_shots": 5.0,
2808
+ "num_truncated_few_shots": 0
2809
+ },
2810
+ "leaderboard|mmlu:nutrition|5": {
2811
+ "hashes": {
2812
+ "hash_examples": "71e55e2b829b6528",
2813
+ "hash_full_prompts": "2bfb18e5fab8dea7",
2814
+ "hash_input_tokens": "2ed8503c57d6afbf",
2815
+ "hash_cont_tokens": "128e0ec97d96b165"
2816
+ },
2817
+ "truncated": 0,
2818
+ "non_truncated": 306,
2819
+ "padded": 1224,
2820
+ "non_padded": 0,
2821
+ "effective_few_shots": 5.0,
2822
+ "num_truncated_few_shots": 0
2823
+ },
2824
+ "leaderboard|mmlu:philosophy|5": {
2825
+ "hashes": {
2826
+ "hash_examples": "a6d489a8d208fa4b",
2827
+ "hash_full_prompts": "e8c0d5b6dae3ccc8",
2828
+ "hash_input_tokens": "7e8ad59a08a00f3b",
2829
+ "hash_cont_tokens": "cbfd7829a3e0f082"
2830
+ },
2831
+ "truncated": 0,
2832
+ "non_truncated": 311,
2833
+ "padded": 1244,
2834
+ "non_padded": 0,
2835
+ "effective_few_shots": 5.0,
2836
+ "num_truncated_few_shots": 0
2837
+ },
2838
+ "leaderboard|mmlu:prehistory|5": {
2839
+ "hashes": {
2840
+ "hash_examples": "6cc50f032a19acaa",
2841
+ "hash_full_prompts": "4a6a1d3ab1bf28e4",
2842
+ "hash_input_tokens": "8bba5be57a92c467",
2843
+ "hash_cont_tokens": "9c0cf5a2f71afa7e"
2844
+ },
2845
+ "truncated": 0,
2846
+ "non_truncated": 324,
2847
+ "padded": 1284,
2848
+ "non_padded": 12,
2849
+ "effective_few_shots": 5.0,
2850
+ "num_truncated_few_shots": 0
2851
+ },
2852
+ "leaderboard|mmlu:professional_accounting|5": {
2853
+ "hashes": {
2854
+ "hash_examples": "50f57ab32f5f6cea",
2855
+ "hash_full_prompts": "e60129bd2d82ffc6",
2856
+ "hash_input_tokens": "236927cb4e27f724",
2857
+ "hash_cont_tokens": "50f011c2453517ee"
2858
+ },
2859
+ "truncated": 0,
2860
+ "non_truncated": 282,
2861
+ "padded": 1128,
2862
+ "non_padded": 0,
2863
+ "effective_few_shots": 5.0,
2864
+ "num_truncated_few_shots": 0
2865
+ },
2866
+ "leaderboard|mmlu:professional_law|5": {
2867
+ "hashes": {
2868
+ "hash_examples": "a8fdc85c64f4b215",
2869
+ "hash_full_prompts": "0dbb1d9b72dcea03",
2870
+ "hash_input_tokens": "7958ac5eb01fed27",
2871
+ "hash_cont_tokens": "73527e852c24186c"
2872
+ },
2873
+ "truncated": 0,
2874
+ "non_truncated": 1534,
2875
+ "padded": 6136,
2876
+ "non_padded": 0,
2877
+ "effective_few_shots": 5.0,
2878
+ "num_truncated_few_shots": 0
2879
+ },
2880
+ "leaderboard|mmlu:professional_medicine|5": {
2881
+ "hashes": {
2882
+ "hash_examples": "c373a28a3050a73a",
2883
+ "hash_full_prompts": "5e040f9ca68b089e",
2884
+ "hash_input_tokens": "f520600f7896a87b",
2885
+ "hash_cont_tokens": "ceb7af5e2e789abc"
2886
+ },
2887
+ "truncated": 0,
2888
+ "non_truncated": 272,
2889
+ "padded": 1088,
2890
+ "non_padded": 0,
2891
+ "effective_few_shots": 5.0,
2892
+ "num_truncated_few_shots": 0
2893
+ },
2894
+ "leaderboard|mmlu:professional_psychology|5": {
2895
+ "hashes": {
2896
+ "hash_examples": "bf5254fe818356af",
2897
+ "hash_full_prompts": "b386ecda8b87150e",
2898
+ "hash_input_tokens": "fb3f225a047d0f0f",
2899
+ "hash_cont_tokens": "8cfdced8a9667380"
2900
+ },
2901
+ "truncated": 0,
2902
+ "non_truncated": 612,
2903
+ "padded": 2428,
2904
+ "non_padded": 20,
2905
+ "effective_few_shots": 5.0,
2906
+ "num_truncated_few_shots": 0
2907
+ },
2908
+ "leaderboard|mmlu:public_relations|5": {
2909
+ "hashes": {
2910
+ "hash_examples": "b66d52e28e7d14e0",
2911
+ "hash_full_prompts": "fe43562263e25677",
2912
+ "hash_input_tokens": "9dfb929ef5e3362b",
2913
+ "hash_cont_tokens": "f8327461a9cc5123"
2914
+ },
2915
+ "truncated": 0,
2916
+ "non_truncated": 110,
2917
+ "padded": 436,
2918
+ "non_padded": 4,
2919
+ "effective_few_shots": 5.0,
2920
+ "num_truncated_few_shots": 0
2921
+ },
2922
+ "leaderboard|mmlu:security_studies|5": {
2923
+ "hashes": {
2924
+ "hash_examples": "514c14feaf000ad9",
2925
+ "hash_full_prompts": "27d4a2ac541ef4b9",
2926
+ "hash_input_tokens": "f620744b07919b24",
2927
+ "hash_cont_tokens": "c30b0c4d52c2875d"
2928
+ },
2929
+ "truncated": 0,
2930
+ "non_truncated": 245,
2931
+ "padded": 980,
2932
+ "non_padded": 0,
2933
+ "effective_few_shots": 5.0,
2934
+ "num_truncated_few_shots": 0
2935
+ },
2936
+ "leaderboard|mmlu:sociology|5": {
2937
+ "hashes": {
2938
+ "hash_examples": "f6c9bc9d18c80870",
2939
+ "hash_full_prompts": "c072ea7d1a1524f2",
2940
+ "hash_input_tokens": "76d03f98f30dbe11",
2941
+ "hash_cont_tokens": "eef4bd16d536fbd6"
2942
+ },
2943
+ "truncated": 0,
2944
+ "non_truncated": 201,
2945
+ "padded": 804,
2946
+ "non_padded": 0,
2947
+ "effective_few_shots": 5.0,
2948
+ "num_truncated_few_shots": 0
2949
+ },
2950
+ "leaderboard|mmlu:us_foreign_policy|5": {
2951
+ "hashes": {
2952
+ "hash_examples": "ed7b78629db6678f",
2953
+ "hash_full_prompts": "341a97ca3e4d699d",
2954
+ "hash_input_tokens": "f0b4b93f91f3d7f4",
2955
+ "hash_cont_tokens": "00520b0ec06da34f"
2956
+ },
2957
+ "truncated": 0,
2958
+ "non_truncated": 100,
2959
+ "padded": 400,
2960
+ "non_padded": 0,
2961
+ "effective_few_shots": 5.0,
2962
+ "num_truncated_few_shots": 0
2963
+ },
2964
+ "leaderboard|mmlu:virology|5": {
2965
+ "hashes": {
2966
+ "hash_examples": "bc52ffdc3f9b994a",
2967
+ "hash_full_prompts": "651d471e2eb8b5e9",
2968
+ "hash_input_tokens": "1c7d23a204c7cbf6",
2969
+ "hash_cont_tokens": "f5fc195e049353c0"
2970
+ },
2971
+ "truncated": 0,
2972
+ "non_truncated": 166,
2973
+ "padded": 664,
2974
+ "non_padded": 0,
2975
+ "effective_few_shots": 5.0,
2976
+ "num_truncated_few_shots": 0
2977
+ },
2978
+ "leaderboard|mmlu:world_religions|5": {
2979
+ "hashes": {
2980
+ "hash_examples": "ecdb4a4f94f62930",
2981
+ "hash_full_prompts": "3773f03542ce44a3",
2982
+ "hash_input_tokens": "be42fd2c9cc2da08",
2983
+ "hash_cont_tokens": "ada548665e87b1e0"
2984
+ },
2985
+ "truncated": 0,
2986
+ "non_truncated": 171,
2987
+ "padded": 684,
2988
+ "non_padded": 0,
2989
+ "effective_few_shots": 5.0,
2990
+ "num_truncated_few_shots": 0
2991
+ }
2992
+ },
2993
+ "summary_general": {
2994
+ "hashes": {
2995
+ "hash_examples": "341a076d0beb7048",
2996
+ "hash_full_prompts": "a5c8f2b7ff4f5ae2",
2997
+ "hash_input_tokens": "917c40aba1546e12",
2998
+ "hash_cont_tokens": "3672212ca582e2d0"
2999
+ },
3000
+ "truncated": 0,
3001
+ "non_truncated": 14042,
3002
+ "padded": 56038,
3003
+ "non_padded": 130,
3004
+ "num_truncated_few_shots": 0
3005
+ }
3006
+ }