lewtun HF Staff commited on
Commit
b9d64e1
·
verified ·
1 Parent(s): 29394c9

Upload eval_results/Qwen/Qwen1.5-0.5B-Chat/main/mmlu/results_2024-03-18T17-01-47.545823.json with huggingface_hub

Browse files
eval_results/Qwen/Qwen1.5-0.5B-Chat/main/mmlu/results_2024-03-18T17-01-47.545823.json ADDED
@@ -0,0 +1,3063 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "config_general": {
3
+ "lighteval_sha": "?",
4
+ "num_fewshot_seeds": 1,
5
+ "override_batch_size": 1,
6
+ "max_samples": null,
7
+ "job_id": "",
8
+ "start_time": 466417.344750626,
9
+ "end_time": 467196.667835617,
10
+ "total_evaluation_time_secondes": "779.3230849909596",
11
+ "model_name": "Qwen/Qwen1.5-0.5B-Chat",
12
+ "model_sha": "6c705984bb8b5591dd4e1a9e66e1a127965fd08d",
13
+ "model_dtype": "torch.bfloat16",
14
+ "model_size": "1.05 GB",
15
+ "config": null
16
+ },
17
+ "results": {
18
+ "leaderboard|mmlu:abstract_algebra|5": {
19
+ "acc": 0.26,
20
+ "acc_stderr": 0.0440844002276808
21
+ },
22
+ "leaderboard|mmlu:anatomy|5": {
23
+ "acc": 0.31851851851851853,
24
+ "acc_stderr": 0.04024778401977111
25
+ },
26
+ "leaderboard|mmlu:astronomy|5": {
27
+ "acc": 0.34210526315789475,
28
+ "acc_stderr": 0.038607315993160925
29
+ },
30
+ "leaderboard|mmlu:business_ethics|5": {
31
+ "acc": 0.5,
32
+ "acc_stderr": 0.050251890762960605
33
+ },
34
+ "leaderboard|mmlu:clinical_knowledge|5": {
35
+ "acc": 0.3622641509433962,
36
+ "acc_stderr": 0.0295822451283843
37
+ },
38
+ "leaderboard|mmlu:college_biology|5": {
39
+ "acc": 0.3194444444444444,
40
+ "acc_stderr": 0.038990736873573344
41
+ },
42
+ "leaderboard|mmlu:college_chemistry|5": {
43
+ "acc": 0.36,
44
+ "acc_stderr": 0.048241815132442176
45
+ },
46
+ "leaderboard|mmlu:college_computer_science|5": {
47
+ "acc": 0.37,
48
+ "acc_stderr": 0.048523658709391
49
+ },
50
+ "leaderboard|mmlu:college_mathematics|5": {
51
+ "acc": 0.33,
52
+ "acc_stderr": 0.047258156262526045
53
+ },
54
+ "leaderboard|mmlu:college_medicine|5": {
55
+ "acc": 0.3468208092485549,
56
+ "acc_stderr": 0.036291466701596636
57
+ },
58
+ "leaderboard|mmlu:college_physics|5": {
59
+ "acc": 0.29411764705882354,
60
+ "acc_stderr": 0.04533838195929775
61
+ },
62
+ "leaderboard|mmlu:computer_security|5": {
63
+ "acc": 0.41,
64
+ "acc_stderr": 0.049431107042371025
65
+ },
66
+ "leaderboard|mmlu:conceptual_physics|5": {
67
+ "acc": 0.2680851063829787,
68
+ "acc_stderr": 0.028957342788342347
69
+ },
70
+ "leaderboard|mmlu:econometrics|5": {
71
+ "acc": 0.2894736842105263,
72
+ "acc_stderr": 0.04266339443159393
73
+ },
74
+ "leaderboard|mmlu:electrical_engineering|5": {
75
+ "acc": 0.4,
76
+ "acc_stderr": 0.04082482904638628
77
+ },
78
+ "leaderboard|mmlu:elementary_mathematics|5": {
79
+ "acc": 0.2830687830687831,
80
+ "acc_stderr": 0.023201392938194978
81
+ },
82
+ "leaderboard|mmlu:formal_logic|5": {
83
+ "acc": 0.25396825396825395,
84
+ "acc_stderr": 0.03893259610604671
85
+ },
86
+ "leaderboard|mmlu:global_facts|5": {
87
+ "acc": 0.31,
88
+ "acc_stderr": 0.04648231987117316
89
+ },
90
+ "leaderboard|mmlu:high_school_biology|5": {
91
+ "acc": 0.3870967741935484,
92
+ "acc_stderr": 0.02770935967503249
93
+ },
94
+ "leaderboard|mmlu:high_school_chemistry|5": {
95
+ "acc": 0.3103448275862069,
96
+ "acc_stderr": 0.03255086769970103
97
+ },
98
+ "leaderboard|mmlu:high_school_computer_science|5": {
99
+ "acc": 0.33,
100
+ "acc_stderr": 0.04725815626252605
101
+ },
102
+ "leaderboard|mmlu:high_school_european_history|5": {
103
+ "acc": 0.48484848484848486,
104
+ "acc_stderr": 0.03902551007374449
105
+ },
106
+ "leaderboard|mmlu:high_school_geography|5": {
107
+ "acc": 0.5050505050505051,
108
+ "acc_stderr": 0.035621707606254015
109
+ },
110
+ "leaderboard|mmlu:high_school_government_and_politics|5": {
111
+ "acc": 0.41968911917098445,
112
+ "acc_stderr": 0.035615873276858834
113
+ },
114
+ "leaderboard|mmlu:high_school_macroeconomics|5": {
115
+ "acc": 0.3717948717948718,
116
+ "acc_stderr": 0.02450347255711094
117
+ },
118
+ "leaderboard|mmlu:high_school_mathematics|5": {
119
+ "acc": 0.25925925925925924,
120
+ "acc_stderr": 0.026719240783712166
121
+ },
122
+ "leaderboard|mmlu:high_school_microeconomics|5": {
123
+ "acc": 0.3739495798319328,
124
+ "acc_stderr": 0.031429466378837076
125
+ },
126
+ "leaderboard|mmlu:high_school_physics|5": {
127
+ "acc": 0.2847682119205298,
128
+ "acc_stderr": 0.03684881521389023
129
+ },
130
+ "leaderboard|mmlu:high_school_psychology|5": {
131
+ "acc": 0.3889908256880734,
132
+ "acc_stderr": 0.020902300887392866
133
+ },
134
+ "leaderboard|mmlu:high_school_statistics|5": {
135
+ "acc": 0.3194444444444444,
136
+ "acc_stderr": 0.0317987634217685
137
+ },
138
+ "leaderboard|mmlu:high_school_us_history|5": {
139
+ "acc": 0.4411764705882353,
140
+ "acc_stderr": 0.034849415144292316
141
+ },
142
+ "leaderboard|mmlu:high_school_world_history|5": {
143
+ "acc": 0.5021097046413502,
144
+ "acc_stderr": 0.032546938018020076
145
+ },
146
+ "leaderboard|mmlu:human_aging|5": {
147
+ "acc": 0.3273542600896861,
148
+ "acc_stderr": 0.03149384670994131
149
+ },
150
+ "leaderboard|mmlu:human_sexuality|5": {
151
+ "acc": 0.37404580152671757,
152
+ "acc_stderr": 0.042438692422305246
153
+ },
154
+ "leaderboard|mmlu:international_law|5": {
155
+ "acc": 0.5289256198347108,
156
+ "acc_stderr": 0.04556710331269498
157
+ },
158
+ "leaderboard|mmlu:jurisprudence|5": {
159
+ "acc": 0.4074074074074074,
160
+ "acc_stderr": 0.04750077341199985
161
+ },
162
+ "leaderboard|mmlu:logical_fallacies|5": {
163
+ "acc": 0.34355828220858897,
164
+ "acc_stderr": 0.03731133519673893
165
+ },
166
+ "leaderboard|mmlu:machine_learning|5": {
167
+ "acc": 0.2767857142857143,
168
+ "acc_stderr": 0.042466243366976256
169
+ },
170
+ "leaderboard|mmlu:management|5": {
171
+ "acc": 0.4563106796116505,
172
+ "acc_stderr": 0.04931801994220414
173
+ },
174
+ "leaderboard|mmlu:marketing|5": {
175
+ "acc": 0.5,
176
+ "acc_stderr": 0.03275608910402091
177
+ },
178
+ "leaderboard|mmlu:medical_genetics|5": {
179
+ "acc": 0.37,
180
+ "acc_stderr": 0.04852365870939098
181
+ },
182
+ "leaderboard|mmlu:miscellaneous|5": {
183
+ "acc": 0.38697318007662834,
184
+ "acc_stderr": 0.017417138059440132
185
+ },
186
+ "leaderboard|mmlu:moral_disputes|5": {
187
+ "acc": 0.43641618497109824,
188
+ "acc_stderr": 0.026700545424943687
189
+ },
190
+ "leaderboard|mmlu:moral_scenarios|5": {
191
+ "acc": 0.24134078212290502,
192
+ "acc_stderr": 0.014310999547961441
193
+ },
194
+ "leaderboard|mmlu:nutrition|5": {
195
+ "acc": 0.42483660130718953,
196
+ "acc_stderr": 0.028304576673141124
197
+ },
198
+ "leaderboard|mmlu:philosophy|5": {
199
+ "acc": 0.3987138263665595,
200
+ "acc_stderr": 0.0278093225857745
201
+ },
202
+ "leaderboard|mmlu:prehistory|5": {
203
+ "acc": 0.3734567901234568,
204
+ "acc_stderr": 0.026915003011380157
205
+ },
206
+ "leaderboard|mmlu:professional_accounting|5": {
207
+ "acc": 0.2907801418439716,
208
+ "acc_stderr": 0.027090664368353178
209
+ },
210
+ "leaderboard|mmlu:professional_law|5": {
211
+ "acc": 0.33116036505867014,
212
+ "acc_stderr": 0.01202012819598575
213
+ },
214
+ "leaderboard|mmlu:professional_medicine|5": {
215
+ "acc": 0.3860294117647059,
216
+ "acc_stderr": 0.029573269134411124
217
+ },
218
+ "leaderboard|mmlu:professional_psychology|5": {
219
+ "acc": 0.3562091503267974,
220
+ "acc_stderr": 0.019373332420724493
221
+ },
222
+ "leaderboard|mmlu:public_relations|5": {
223
+ "acc": 0.42727272727272725,
224
+ "acc_stderr": 0.04738198703545483
225
+ },
226
+ "leaderboard|mmlu:security_studies|5": {
227
+ "acc": 0.2693877551020408,
228
+ "acc_stderr": 0.02840125202902294
229
+ },
230
+ "leaderboard|mmlu:sociology|5": {
231
+ "acc": 0.5174129353233831,
232
+ "acc_stderr": 0.03533389234739245
233
+ },
234
+ "leaderboard|mmlu:us_foreign_policy|5": {
235
+ "acc": 0.54,
236
+ "acc_stderr": 0.05009082659620333
237
+ },
238
+ "leaderboard|mmlu:virology|5": {
239
+ "acc": 0.3433734939759036,
240
+ "acc_stderr": 0.036965843170106004
241
+ },
242
+ "leaderboard|mmlu:world_religions|5": {
243
+ "acc": 0.2807017543859649,
244
+ "acc_stderr": 0.034462962170884265
245
+ },
246
+ "leaderboard|mmlu:_average|5": {
247
+ "acc": 0.3681551334211768,
248
+ "acc_stderr": 0.035698565367394484
249
+ }
250
+ },
251
+ "versions": {
252
+ "leaderboard|mmlu:abstract_algebra|5": 0,
253
+ "leaderboard|mmlu:anatomy|5": 0,
254
+ "leaderboard|mmlu:astronomy|5": 0,
255
+ "leaderboard|mmlu:business_ethics|5": 0,
256
+ "leaderboard|mmlu:clinical_knowledge|5": 0,
257
+ "leaderboard|mmlu:college_biology|5": 0,
258
+ "leaderboard|mmlu:college_chemistry|5": 0,
259
+ "leaderboard|mmlu:college_computer_science|5": 0,
260
+ "leaderboard|mmlu:college_mathematics|5": 0,
261
+ "leaderboard|mmlu:college_medicine|5": 0,
262
+ "leaderboard|mmlu:college_physics|5": 0,
263
+ "leaderboard|mmlu:computer_security|5": 0,
264
+ "leaderboard|mmlu:conceptual_physics|5": 0,
265
+ "leaderboard|mmlu:econometrics|5": 0,
266
+ "leaderboard|mmlu:electrical_engineering|5": 0,
267
+ "leaderboard|mmlu:elementary_mathematics|5": 0,
268
+ "leaderboard|mmlu:formal_logic|5": 0,
269
+ "leaderboard|mmlu:global_facts|5": 0,
270
+ "leaderboard|mmlu:high_school_biology|5": 0,
271
+ "leaderboard|mmlu:high_school_chemistry|5": 0,
272
+ "leaderboard|mmlu:high_school_computer_science|5": 0,
273
+ "leaderboard|mmlu:high_school_european_history|5": 0,
274
+ "leaderboard|mmlu:high_school_geography|5": 0,
275
+ "leaderboard|mmlu:high_school_government_and_politics|5": 0,
276
+ "leaderboard|mmlu:high_school_macroeconomics|5": 0,
277
+ "leaderboard|mmlu:high_school_mathematics|5": 0,
278
+ "leaderboard|mmlu:high_school_microeconomics|5": 0,
279
+ "leaderboard|mmlu:high_school_physics|5": 0,
280
+ "leaderboard|mmlu:high_school_psychology|5": 0,
281
+ "leaderboard|mmlu:high_school_statistics|5": 0,
282
+ "leaderboard|mmlu:high_school_us_history|5": 0,
283
+ "leaderboard|mmlu:high_school_world_history|5": 0,
284
+ "leaderboard|mmlu:human_aging|5": 0,
285
+ "leaderboard|mmlu:human_sexuality|5": 0,
286
+ "leaderboard|mmlu:international_law|5": 0,
287
+ "leaderboard|mmlu:jurisprudence|5": 0,
288
+ "leaderboard|mmlu:logical_fallacies|5": 0,
289
+ "leaderboard|mmlu:machine_learning|5": 0,
290
+ "leaderboard|mmlu:management|5": 0,
291
+ "leaderboard|mmlu:marketing|5": 0,
292
+ "leaderboard|mmlu:medical_genetics|5": 0,
293
+ "leaderboard|mmlu:miscellaneous|5": 0,
294
+ "leaderboard|mmlu:moral_disputes|5": 0,
295
+ "leaderboard|mmlu:moral_scenarios|5": 0,
296
+ "leaderboard|mmlu:nutrition|5": 0,
297
+ "leaderboard|mmlu:philosophy|5": 0,
298
+ "leaderboard|mmlu:prehistory|5": 0,
299
+ "leaderboard|mmlu:professional_accounting|5": 0,
300
+ "leaderboard|mmlu:professional_law|5": 0,
301
+ "leaderboard|mmlu:professional_medicine|5": 0,
302
+ "leaderboard|mmlu:professional_psychology|5": 0,
303
+ "leaderboard|mmlu:public_relations|5": 0,
304
+ "leaderboard|mmlu:security_studies|5": 0,
305
+ "leaderboard|mmlu:sociology|5": 0,
306
+ "leaderboard|mmlu:us_foreign_policy|5": 0,
307
+ "leaderboard|mmlu:virology|5": 0,
308
+ "leaderboard|mmlu:world_religions|5": 0
309
+ },
310
+ "config_tasks": {
311
+ "leaderboard|mmlu:abstract_algebra": {
312
+ "name": "mmlu:abstract_algebra",
313
+ "prompt_function": "mmlu_harness",
314
+ "hf_repo": "lighteval/mmlu",
315
+ "hf_subset": "abstract_algebra",
316
+ "metric": [
317
+ "loglikelihood_acc"
318
+ ],
319
+ "hf_avail_splits": [
320
+ "auxiliary_train",
321
+ "test",
322
+ "validation",
323
+ "dev"
324
+ ],
325
+ "evaluation_splits": [
326
+ "test"
327
+ ],
328
+ "few_shots_split": "dev",
329
+ "few_shots_select": "sequential",
330
+ "generation_size": 1,
331
+ "stop_sequence": [
332
+ "\n"
333
+ ],
334
+ "output_regex": null,
335
+ "frozen": false,
336
+ "suite": [
337
+ "leaderboard",
338
+ "mmlu"
339
+ ],
340
+ "original_num_docs": 100,
341
+ "effective_num_docs": 100,
342
+ "trust_dataset": true,
343
+ "must_remove_duplicate_docs": null
344
+ },
345
+ "leaderboard|mmlu:anatomy": {
346
+ "name": "mmlu:anatomy",
347
+ "prompt_function": "mmlu_harness",
348
+ "hf_repo": "lighteval/mmlu",
349
+ "hf_subset": "anatomy",
350
+ "metric": [
351
+ "loglikelihood_acc"
352
+ ],
353
+ "hf_avail_splits": [
354
+ "auxiliary_train",
355
+ "test",
356
+ "validation",
357
+ "dev"
358
+ ],
359
+ "evaluation_splits": [
360
+ "test"
361
+ ],
362
+ "few_shots_split": "dev",
363
+ "few_shots_select": "sequential",
364
+ "generation_size": 1,
365
+ "stop_sequence": [
366
+ "\n"
367
+ ],
368
+ "output_regex": null,
369
+ "frozen": false,
370
+ "suite": [
371
+ "leaderboard",
372
+ "mmlu"
373
+ ],
374
+ "original_num_docs": 135,
375
+ "effective_num_docs": 135,
376
+ "trust_dataset": true,
377
+ "must_remove_duplicate_docs": null
378
+ },
379
+ "leaderboard|mmlu:astronomy": {
380
+ "name": "mmlu:astronomy",
381
+ "prompt_function": "mmlu_harness",
382
+ "hf_repo": "lighteval/mmlu",
383
+ "hf_subset": "astronomy",
384
+ "metric": [
385
+ "loglikelihood_acc"
386
+ ],
387
+ "hf_avail_splits": [
388
+ "auxiliary_train",
389
+ "test",
390
+ "validation",
391
+ "dev"
392
+ ],
393
+ "evaluation_splits": [
394
+ "test"
395
+ ],
396
+ "few_shots_split": "dev",
397
+ "few_shots_select": "sequential",
398
+ "generation_size": 1,
399
+ "stop_sequence": [
400
+ "\n"
401
+ ],
402
+ "output_regex": null,
403
+ "frozen": false,
404
+ "suite": [
405
+ "leaderboard",
406
+ "mmlu"
407
+ ],
408
+ "original_num_docs": 152,
409
+ "effective_num_docs": 152,
410
+ "trust_dataset": true,
411
+ "must_remove_duplicate_docs": null
412
+ },
413
+ "leaderboard|mmlu:business_ethics": {
414
+ "name": "mmlu:business_ethics",
415
+ "prompt_function": "mmlu_harness",
416
+ "hf_repo": "lighteval/mmlu",
417
+ "hf_subset": "business_ethics",
418
+ "metric": [
419
+ "loglikelihood_acc"
420
+ ],
421
+ "hf_avail_splits": [
422
+ "auxiliary_train",
423
+ "test",
424
+ "validation",
425
+ "dev"
426
+ ],
427
+ "evaluation_splits": [
428
+ "test"
429
+ ],
430
+ "few_shots_split": "dev",
431
+ "few_shots_select": "sequential",
432
+ "generation_size": 1,
433
+ "stop_sequence": [
434
+ "\n"
435
+ ],
436
+ "output_regex": null,
437
+ "frozen": false,
438
+ "suite": [
439
+ "leaderboard",
440
+ "mmlu"
441
+ ],
442
+ "original_num_docs": 100,
443
+ "effective_num_docs": 100,
444
+ "trust_dataset": true,
445
+ "must_remove_duplicate_docs": null
446
+ },
447
+ "leaderboard|mmlu:clinical_knowledge": {
448
+ "name": "mmlu:clinical_knowledge",
449
+ "prompt_function": "mmlu_harness",
450
+ "hf_repo": "lighteval/mmlu",
451
+ "hf_subset": "clinical_knowledge",
452
+ "metric": [
453
+ "loglikelihood_acc"
454
+ ],
455
+ "hf_avail_splits": [
456
+ "auxiliary_train",
457
+ "test",
458
+ "validation",
459
+ "dev"
460
+ ],
461
+ "evaluation_splits": [
462
+ "test"
463
+ ],
464
+ "few_shots_split": "dev",
465
+ "few_shots_select": "sequential",
466
+ "generation_size": 1,
467
+ "stop_sequence": [
468
+ "\n"
469
+ ],
470
+ "output_regex": null,
471
+ "frozen": false,
472
+ "suite": [
473
+ "leaderboard",
474
+ "mmlu"
475
+ ],
476
+ "original_num_docs": 265,
477
+ "effective_num_docs": 265,
478
+ "trust_dataset": true,
479
+ "must_remove_duplicate_docs": null
480
+ },
481
+ "leaderboard|mmlu:college_biology": {
482
+ "name": "mmlu:college_biology",
483
+ "prompt_function": "mmlu_harness",
484
+ "hf_repo": "lighteval/mmlu",
485
+ "hf_subset": "college_biology",
486
+ "metric": [
487
+ "loglikelihood_acc"
488
+ ],
489
+ "hf_avail_splits": [
490
+ "auxiliary_train",
491
+ "test",
492
+ "validation",
493
+ "dev"
494
+ ],
495
+ "evaluation_splits": [
496
+ "test"
497
+ ],
498
+ "few_shots_split": "dev",
499
+ "few_shots_select": "sequential",
500
+ "generation_size": 1,
501
+ "stop_sequence": [
502
+ "\n"
503
+ ],
504
+ "output_regex": null,
505
+ "frozen": false,
506
+ "suite": [
507
+ "leaderboard",
508
+ "mmlu"
509
+ ],
510
+ "original_num_docs": 144,
511
+ "effective_num_docs": 144,
512
+ "trust_dataset": true,
513
+ "must_remove_duplicate_docs": null
514
+ },
515
+ "leaderboard|mmlu:college_chemistry": {
516
+ "name": "mmlu:college_chemistry",
517
+ "prompt_function": "mmlu_harness",
518
+ "hf_repo": "lighteval/mmlu",
519
+ "hf_subset": "college_chemistry",
520
+ "metric": [
521
+ "loglikelihood_acc"
522
+ ],
523
+ "hf_avail_splits": [
524
+ "auxiliary_train",
525
+ "test",
526
+ "validation",
527
+ "dev"
528
+ ],
529
+ "evaluation_splits": [
530
+ "test"
531
+ ],
532
+ "few_shots_split": "dev",
533
+ "few_shots_select": "sequential",
534
+ "generation_size": 1,
535
+ "stop_sequence": [
536
+ "\n"
537
+ ],
538
+ "output_regex": null,
539
+ "frozen": false,
540
+ "suite": [
541
+ "leaderboard",
542
+ "mmlu"
543
+ ],
544
+ "original_num_docs": 100,
545
+ "effective_num_docs": 100,
546
+ "trust_dataset": true,
547
+ "must_remove_duplicate_docs": null
548
+ },
549
+ "leaderboard|mmlu:college_computer_science": {
550
+ "name": "mmlu:college_computer_science",
551
+ "prompt_function": "mmlu_harness",
552
+ "hf_repo": "lighteval/mmlu",
553
+ "hf_subset": "college_computer_science",
554
+ "metric": [
555
+ "loglikelihood_acc"
556
+ ],
557
+ "hf_avail_splits": [
558
+ "auxiliary_train",
559
+ "test",
560
+ "validation",
561
+ "dev"
562
+ ],
563
+ "evaluation_splits": [
564
+ "test"
565
+ ],
566
+ "few_shots_split": "dev",
567
+ "few_shots_select": "sequential",
568
+ "generation_size": 1,
569
+ "stop_sequence": [
570
+ "\n"
571
+ ],
572
+ "output_regex": null,
573
+ "frozen": false,
574
+ "suite": [
575
+ "leaderboard",
576
+ "mmlu"
577
+ ],
578
+ "original_num_docs": 100,
579
+ "effective_num_docs": 100,
580
+ "trust_dataset": true,
581
+ "must_remove_duplicate_docs": null
582
+ },
583
+ "leaderboard|mmlu:college_mathematics": {
584
+ "name": "mmlu:college_mathematics",
585
+ "prompt_function": "mmlu_harness",
586
+ "hf_repo": "lighteval/mmlu",
587
+ "hf_subset": "college_mathematics",
588
+ "metric": [
589
+ "loglikelihood_acc"
590
+ ],
591
+ "hf_avail_splits": [
592
+ "auxiliary_train",
593
+ "test",
594
+ "validation",
595
+ "dev"
596
+ ],
597
+ "evaluation_splits": [
598
+ "test"
599
+ ],
600
+ "few_shots_split": "dev",
601
+ "few_shots_select": "sequential",
602
+ "generation_size": 1,
603
+ "stop_sequence": [
604
+ "\n"
605
+ ],
606
+ "output_regex": null,
607
+ "frozen": false,
608
+ "suite": [
609
+ "leaderboard",
610
+ "mmlu"
611
+ ],
612
+ "original_num_docs": 100,
613
+ "effective_num_docs": 100,
614
+ "trust_dataset": true,
615
+ "must_remove_duplicate_docs": null
616
+ },
617
+ "leaderboard|mmlu:college_medicine": {
618
+ "name": "mmlu:college_medicine",
619
+ "prompt_function": "mmlu_harness",
620
+ "hf_repo": "lighteval/mmlu",
621
+ "hf_subset": "college_medicine",
622
+ "metric": [
623
+ "loglikelihood_acc"
624
+ ],
625
+ "hf_avail_splits": [
626
+ "auxiliary_train",
627
+ "test",
628
+ "validation",
629
+ "dev"
630
+ ],
631
+ "evaluation_splits": [
632
+ "test"
633
+ ],
634
+ "few_shots_split": "dev",
635
+ "few_shots_select": "sequential",
636
+ "generation_size": 1,
637
+ "stop_sequence": [
638
+ "\n"
639
+ ],
640
+ "output_regex": null,
641
+ "frozen": false,
642
+ "suite": [
643
+ "leaderboard",
644
+ "mmlu"
645
+ ],
646
+ "original_num_docs": 173,
647
+ "effective_num_docs": 173,
648
+ "trust_dataset": true,
649
+ "must_remove_duplicate_docs": null
650
+ },
651
+ "leaderboard|mmlu:college_physics": {
652
+ "name": "mmlu:college_physics",
653
+ "prompt_function": "mmlu_harness",
654
+ "hf_repo": "lighteval/mmlu",
655
+ "hf_subset": "college_physics",
656
+ "metric": [
657
+ "loglikelihood_acc"
658
+ ],
659
+ "hf_avail_splits": [
660
+ "auxiliary_train",
661
+ "test",
662
+ "validation",
663
+ "dev"
664
+ ],
665
+ "evaluation_splits": [
666
+ "test"
667
+ ],
668
+ "few_shots_split": "dev",
669
+ "few_shots_select": "sequential",
670
+ "generation_size": 1,
671
+ "stop_sequence": [
672
+ "\n"
673
+ ],
674
+ "output_regex": null,
675
+ "frozen": false,
676
+ "suite": [
677
+ "leaderboard",
678
+ "mmlu"
679
+ ],
680
+ "original_num_docs": 102,
681
+ "effective_num_docs": 102,
682
+ "trust_dataset": true,
683
+ "must_remove_duplicate_docs": null
684
+ },
685
+ "leaderboard|mmlu:computer_security": {
686
+ "name": "mmlu:computer_security",
687
+ "prompt_function": "mmlu_harness",
688
+ "hf_repo": "lighteval/mmlu",
689
+ "hf_subset": "computer_security",
690
+ "metric": [
691
+ "loglikelihood_acc"
692
+ ],
693
+ "hf_avail_splits": [
694
+ "auxiliary_train",
695
+ "test",
696
+ "validation",
697
+ "dev"
698
+ ],
699
+ "evaluation_splits": [
700
+ "test"
701
+ ],
702
+ "few_shots_split": "dev",
703
+ "few_shots_select": "sequential",
704
+ "generation_size": 1,
705
+ "stop_sequence": [
706
+ "\n"
707
+ ],
708
+ "output_regex": null,
709
+ "frozen": false,
710
+ "suite": [
711
+ "leaderboard",
712
+ "mmlu"
713
+ ],
714
+ "original_num_docs": 100,
715
+ "effective_num_docs": 100,
716
+ "trust_dataset": true,
717
+ "must_remove_duplicate_docs": null
718
+ },
719
+ "leaderboard|mmlu:conceptual_physics": {
720
+ "name": "mmlu:conceptual_physics",
721
+ "prompt_function": "mmlu_harness",
722
+ "hf_repo": "lighteval/mmlu",
723
+ "hf_subset": "conceptual_physics",
724
+ "metric": [
725
+ "loglikelihood_acc"
726
+ ],
727
+ "hf_avail_splits": [
728
+ "auxiliary_train",
729
+ "test",
730
+ "validation",
731
+ "dev"
732
+ ],
733
+ "evaluation_splits": [
734
+ "test"
735
+ ],
736
+ "few_shots_split": "dev",
737
+ "few_shots_select": "sequential",
738
+ "generation_size": 1,
739
+ "stop_sequence": [
740
+ "\n"
741
+ ],
742
+ "output_regex": null,
743
+ "frozen": false,
744
+ "suite": [
745
+ "leaderboard",
746
+ "mmlu"
747
+ ],
748
+ "original_num_docs": 235,
749
+ "effective_num_docs": 235,
750
+ "trust_dataset": true,
751
+ "must_remove_duplicate_docs": null
752
+ },
753
+ "leaderboard|mmlu:econometrics": {
754
+ "name": "mmlu:econometrics",
755
+ "prompt_function": "mmlu_harness",
756
+ "hf_repo": "lighteval/mmlu",
757
+ "hf_subset": "econometrics",
758
+ "metric": [
759
+ "loglikelihood_acc"
760
+ ],
761
+ "hf_avail_splits": [
762
+ "auxiliary_train",
763
+ "test",
764
+ "validation",
765
+ "dev"
766
+ ],
767
+ "evaluation_splits": [
768
+ "test"
769
+ ],
770
+ "few_shots_split": "dev",
771
+ "few_shots_select": "sequential",
772
+ "generation_size": 1,
773
+ "stop_sequence": [
774
+ "\n"
775
+ ],
776
+ "output_regex": null,
777
+ "frozen": false,
778
+ "suite": [
779
+ "leaderboard",
780
+ "mmlu"
781
+ ],
782
+ "original_num_docs": 114,
783
+ "effective_num_docs": 114,
784
+ "trust_dataset": true,
785
+ "must_remove_duplicate_docs": null
786
+ },
787
+ "leaderboard|mmlu:electrical_engineering": {
788
+ "name": "mmlu:electrical_engineering",
789
+ "prompt_function": "mmlu_harness",
790
+ "hf_repo": "lighteval/mmlu",
791
+ "hf_subset": "electrical_engineering",
792
+ "metric": [
793
+ "loglikelihood_acc"
794
+ ],
795
+ "hf_avail_splits": [
796
+ "auxiliary_train",
797
+ "test",
798
+ "validation",
799
+ "dev"
800
+ ],
801
+ "evaluation_splits": [
802
+ "test"
803
+ ],
804
+ "few_shots_split": "dev",
805
+ "few_shots_select": "sequential",
806
+ "generation_size": 1,
807
+ "stop_sequence": [
808
+ "\n"
809
+ ],
810
+ "output_regex": null,
811
+ "frozen": false,
812
+ "suite": [
813
+ "leaderboard",
814
+ "mmlu"
815
+ ],
816
+ "original_num_docs": 145,
817
+ "effective_num_docs": 145,
818
+ "trust_dataset": true,
819
+ "must_remove_duplicate_docs": null
820
+ },
821
+ "leaderboard|mmlu:elementary_mathematics": {
822
+ "name": "mmlu:elementary_mathematics",
823
+ "prompt_function": "mmlu_harness",
824
+ "hf_repo": "lighteval/mmlu",
825
+ "hf_subset": "elementary_mathematics",
826
+ "metric": [
827
+ "loglikelihood_acc"
828
+ ],
829
+ "hf_avail_splits": [
830
+ "auxiliary_train",
831
+ "test",
832
+ "validation",
833
+ "dev"
834
+ ],
835
+ "evaluation_splits": [
836
+ "test"
837
+ ],
838
+ "few_shots_split": "dev",
839
+ "few_shots_select": "sequential",
840
+ "generation_size": 1,
841
+ "stop_sequence": [
842
+ "\n"
843
+ ],
844
+ "output_regex": null,
845
+ "frozen": false,
846
+ "suite": [
847
+ "leaderboard",
848
+ "mmlu"
849
+ ],
850
+ "original_num_docs": 378,
851
+ "effective_num_docs": 378,
852
+ "trust_dataset": true,
853
+ "must_remove_duplicate_docs": null
854
+ },
855
+ "leaderboard|mmlu:formal_logic": {
856
+ "name": "mmlu:formal_logic",
857
+ "prompt_function": "mmlu_harness",
858
+ "hf_repo": "lighteval/mmlu",
859
+ "hf_subset": "formal_logic",
860
+ "metric": [
861
+ "loglikelihood_acc"
862
+ ],
863
+ "hf_avail_splits": [
864
+ "auxiliary_train",
865
+ "test",
866
+ "validation",
867
+ "dev"
868
+ ],
869
+ "evaluation_splits": [
870
+ "test"
871
+ ],
872
+ "few_shots_split": "dev",
873
+ "few_shots_select": "sequential",
874
+ "generation_size": 1,
875
+ "stop_sequence": [
876
+ "\n"
877
+ ],
878
+ "output_regex": null,
879
+ "frozen": false,
880
+ "suite": [
881
+ "leaderboard",
882
+ "mmlu"
883
+ ],
884
+ "original_num_docs": 126,
885
+ "effective_num_docs": 126,
886
+ "trust_dataset": true,
887
+ "must_remove_duplicate_docs": null
888
+ },
889
+ "leaderboard|mmlu:global_facts": {
890
+ "name": "mmlu:global_facts",
891
+ "prompt_function": "mmlu_harness",
892
+ "hf_repo": "lighteval/mmlu",
893
+ "hf_subset": "global_facts",
894
+ "metric": [
895
+ "loglikelihood_acc"
896
+ ],
897
+ "hf_avail_splits": [
898
+ "auxiliary_train",
899
+ "test",
900
+ "validation",
901
+ "dev"
902
+ ],
903
+ "evaluation_splits": [
904
+ "test"
905
+ ],
906
+ "few_shots_split": "dev",
907
+ "few_shots_select": "sequential",
908
+ "generation_size": 1,
909
+ "stop_sequence": [
910
+ "\n"
911
+ ],
912
+ "output_regex": null,
913
+ "frozen": false,
914
+ "suite": [
915
+ "leaderboard",
916
+ "mmlu"
917
+ ],
918
+ "original_num_docs": 100,
919
+ "effective_num_docs": 100,
920
+ "trust_dataset": true,
921
+ "must_remove_duplicate_docs": null
922
+ },
923
+ "leaderboard|mmlu:high_school_biology": {
924
+ "name": "mmlu:high_school_biology",
925
+ "prompt_function": "mmlu_harness",
926
+ "hf_repo": "lighteval/mmlu",
927
+ "hf_subset": "high_school_biology",
928
+ "metric": [
929
+ "loglikelihood_acc"
930
+ ],
931
+ "hf_avail_splits": [
932
+ "auxiliary_train",
933
+ "test",
934
+ "validation",
935
+ "dev"
936
+ ],
937
+ "evaluation_splits": [
938
+ "test"
939
+ ],
940
+ "few_shots_split": "dev",
941
+ "few_shots_select": "sequential",
942
+ "generation_size": 1,
943
+ "stop_sequence": [
944
+ "\n"
945
+ ],
946
+ "output_regex": null,
947
+ "frozen": false,
948
+ "suite": [
949
+ "leaderboard",
950
+ "mmlu"
951
+ ],
952
+ "original_num_docs": 310,
953
+ "effective_num_docs": 310,
954
+ "trust_dataset": true,
955
+ "must_remove_duplicate_docs": null
956
+ },
957
+ "leaderboard|mmlu:high_school_chemistry": {
958
+ "name": "mmlu:high_school_chemistry",
959
+ "prompt_function": "mmlu_harness",
960
+ "hf_repo": "lighteval/mmlu",
961
+ "hf_subset": "high_school_chemistry",
962
+ "metric": [
963
+ "loglikelihood_acc"
964
+ ],
965
+ "hf_avail_splits": [
966
+ "auxiliary_train",
967
+ "test",
968
+ "validation",
969
+ "dev"
970
+ ],
971
+ "evaluation_splits": [
972
+ "test"
973
+ ],
974
+ "few_shots_split": "dev",
975
+ "few_shots_select": "sequential",
976
+ "generation_size": 1,
977
+ "stop_sequence": [
978
+ "\n"
979
+ ],
980
+ "output_regex": null,
981
+ "frozen": false,
982
+ "suite": [
983
+ "leaderboard",
984
+ "mmlu"
985
+ ],
986
+ "original_num_docs": 203,
987
+ "effective_num_docs": 203,
988
+ "trust_dataset": true,
989
+ "must_remove_duplicate_docs": null
990
+ },
991
+ "leaderboard|mmlu:high_school_computer_science": {
992
+ "name": "mmlu:high_school_computer_science",
993
+ "prompt_function": "mmlu_harness",
994
+ "hf_repo": "lighteval/mmlu",
995
+ "hf_subset": "high_school_computer_science",
996
+ "metric": [
997
+ "loglikelihood_acc"
998
+ ],
999
+ "hf_avail_splits": [
1000
+ "auxiliary_train",
1001
+ "test",
1002
+ "validation",
1003
+ "dev"
1004
+ ],
1005
+ "evaluation_splits": [
1006
+ "test"
1007
+ ],
1008
+ "few_shots_split": "dev",
1009
+ "few_shots_select": "sequential",
1010
+ "generation_size": 1,
1011
+ "stop_sequence": [
1012
+ "\n"
1013
+ ],
1014
+ "output_regex": null,
1015
+ "frozen": false,
1016
+ "suite": [
1017
+ "leaderboard",
1018
+ "mmlu"
1019
+ ],
1020
+ "original_num_docs": 100,
1021
+ "effective_num_docs": 100,
1022
+ "trust_dataset": true,
1023
+ "must_remove_duplicate_docs": null
1024
+ },
1025
+ "leaderboard|mmlu:high_school_european_history": {
1026
+ "name": "mmlu:high_school_european_history",
1027
+ "prompt_function": "mmlu_harness",
1028
+ "hf_repo": "lighteval/mmlu",
1029
+ "hf_subset": "high_school_european_history",
1030
+ "metric": [
1031
+ "loglikelihood_acc"
1032
+ ],
1033
+ "hf_avail_splits": [
1034
+ "auxiliary_train",
1035
+ "test",
1036
+ "validation",
1037
+ "dev"
1038
+ ],
1039
+ "evaluation_splits": [
1040
+ "test"
1041
+ ],
1042
+ "few_shots_split": "dev",
1043
+ "few_shots_select": "sequential",
1044
+ "generation_size": 1,
1045
+ "stop_sequence": [
1046
+ "\n"
1047
+ ],
1048
+ "output_regex": null,
1049
+ "frozen": false,
1050
+ "suite": [
1051
+ "leaderboard",
1052
+ "mmlu"
1053
+ ],
1054
+ "original_num_docs": 165,
1055
+ "effective_num_docs": 165,
1056
+ "trust_dataset": true,
1057
+ "must_remove_duplicate_docs": null
1058
+ },
1059
+ "leaderboard|mmlu:high_school_geography": {
1060
+ "name": "mmlu:high_school_geography",
1061
+ "prompt_function": "mmlu_harness",
1062
+ "hf_repo": "lighteval/mmlu",
1063
+ "hf_subset": "high_school_geography",
1064
+ "metric": [
1065
+ "loglikelihood_acc"
1066
+ ],
1067
+ "hf_avail_splits": [
1068
+ "auxiliary_train",
1069
+ "test",
1070
+ "validation",
1071
+ "dev"
1072
+ ],
1073
+ "evaluation_splits": [
1074
+ "test"
1075
+ ],
1076
+ "few_shots_split": "dev",
1077
+ "few_shots_select": "sequential",
1078
+ "generation_size": 1,
1079
+ "stop_sequence": [
1080
+ "\n"
1081
+ ],
1082
+ "output_regex": null,
1083
+ "frozen": false,
1084
+ "suite": [
1085
+ "leaderboard",
1086
+ "mmlu"
1087
+ ],
1088
+ "original_num_docs": 198,
1089
+ "effective_num_docs": 198,
1090
+ "trust_dataset": true,
1091
+ "must_remove_duplicate_docs": null
1092
+ },
1093
+ "leaderboard|mmlu:high_school_government_and_politics": {
1094
+ "name": "mmlu:high_school_government_and_politics",
1095
+ "prompt_function": "mmlu_harness",
1096
+ "hf_repo": "lighteval/mmlu",
1097
+ "hf_subset": "high_school_government_and_politics",
1098
+ "metric": [
1099
+ "loglikelihood_acc"
1100
+ ],
1101
+ "hf_avail_splits": [
1102
+ "auxiliary_train",
1103
+ "test",
1104
+ "validation",
1105
+ "dev"
1106
+ ],
1107
+ "evaluation_splits": [
1108
+ "test"
1109
+ ],
1110
+ "few_shots_split": "dev",
1111
+ "few_shots_select": "sequential",
1112
+ "generation_size": 1,
1113
+ "stop_sequence": [
1114
+ "\n"
1115
+ ],
1116
+ "output_regex": null,
1117
+ "frozen": false,
1118
+ "suite": [
1119
+ "leaderboard",
1120
+ "mmlu"
1121
+ ],
1122
+ "original_num_docs": 193,
1123
+ "effective_num_docs": 193,
1124
+ "trust_dataset": true,
1125
+ "must_remove_duplicate_docs": null
1126
+ },
1127
+ "leaderboard|mmlu:high_school_macroeconomics": {
1128
+ "name": "mmlu:high_school_macroeconomics",
1129
+ "prompt_function": "mmlu_harness",
1130
+ "hf_repo": "lighteval/mmlu",
1131
+ "hf_subset": "high_school_macroeconomics",
1132
+ "metric": [
1133
+ "loglikelihood_acc"
1134
+ ],
1135
+ "hf_avail_splits": [
1136
+ "auxiliary_train",
1137
+ "test",
1138
+ "validation",
1139
+ "dev"
1140
+ ],
1141
+ "evaluation_splits": [
1142
+ "test"
1143
+ ],
1144
+ "few_shots_split": "dev",
1145
+ "few_shots_select": "sequential",
1146
+ "generation_size": 1,
1147
+ "stop_sequence": [
1148
+ "\n"
1149
+ ],
1150
+ "output_regex": null,
1151
+ "frozen": false,
1152
+ "suite": [
1153
+ "leaderboard",
1154
+ "mmlu"
1155
+ ],
1156
+ "original_num_docs": 390,
1157
+ "effective_num_docs": 390,
1158
+ "trust_dataset": true,
1159
+ "must_remove_duplicate_docs": null
1160
+ },
1161
+ "leaderboard|mmlu:high_school_mathematics": {
1162
+ "name": "mmlu:high_school_mathematics",
1163
+ "prompt_function": "mmlu_harness",
1164
+ "hf_repo": "lighteval/mmlu",
1165
+ "hf_subset": "high_school_mathematics",
1166
+ "metric": [
1167
+ "loglikelihood_acc"
1168
+ ],
1169
+ "hf_avail_splits": [
1170
+ "auxiliary_train",
1171
+ "test",
1172
+ "validation",
1173
+ "dev"
1174
+ ],
1175
+ "evaluation_splits": [
1176
+ "test"
1177
+ ],
1178
+ "few_shots_split": "dev",
1179
+ "few_shots_select": "sequential",
1180
+ "generation_size": 1,
1181
+ "stop_sequence": [
1182
+ "\n"
1183
+ ],
1184
+ "output_regex": null,
1185
+ "frozen": false,
1186
+ "suite": [
1187
+ "leaderboard",
1188
+ "mmlu"
1189
+ ],
1190
+ "original_num_docs": 270,
1191
+ "effective_num_docs": 270,
1192
+ "trust_dataset": true,
1193
+ "must_remove_duplicate_docs": null
1194
+ },
1195
+ "leaderboard|mmlu:high_school_microeconomics": {
1196
+ "name": "mmlu:high_school_microeconomics",
1197
+ "prompt_function": "mmlu_harness",
1198
+ "hf_repo": "lighteval/mmlu",
1199
+ "hf_subset": "high_school_microeconomics",
1200
+ "metric": [
1201
+ "loglikelihood_acc"
1202
+ ],
1203
+ "hf_avail_splits": [
1204
+ "auxiliary_train",
1205
+ "test",
1206
+ "validation",
1207
+ "dev"
1208
+ ],
1209
+ "evaluation_splits": [
1210
+ "test"
1211
+ ],
1212
+ "few_shots_split": "dev",
1213
+ "few_shots_select": "sequential",
1214
+ "generation_size": 1,
1215
+ "stop_sequence": [
1216
+ "\n"
1217
+ ],
1218
+ "output_regex": null,
1219
+ "frozen": false,
1220
+ "suite": [
1221
+ "leaderboard",
1222
+ "mmlu"
1223
+ ],
1224
+ "original_num_docs": 238,
1225
+ "effective_num_docs": 238,
1226
+ "trust_dataset": true,
1227
+ "must_remove_duplicate_docs": null
1228
+ },
1229
+ "leaderboard|mmlu:high_school_physics": {
1230
+ "name": "mmlu:high_school_physics",
1231
+ "prompt_function": "mmlu_harness",
1232
+ "hf_repo": "lighteval/mmlu",
1233
+ "hf_subset": "high_school_physics",
1234
+ "metric": [
1235
+ "loglikelihood_acc"
1236
+ ],
1237
+ "hf_avail_splits": [
1238
+ "auxiliary_train",
1239
+ "test",
1240
+ "validation",
1241
+ "dev"
1242
+ ],
1243
+ "evaluation_splits": [
1244
+ "test"
1245
+ ],
1246
+ "few_shots_split": "dev",
1247
+ "few_shots_select": "sequential",
1248
+ "generation_size": 1,
1249
+ "stop_sequence": [
1250
+ "\n"
1251
+ ],
1252
+ "output_regex": null,
1253
+ "frozen": false,
1254
+ "suite": [
1255
+ "leaderboard",
1256
+ "mmlu"
1257
+ ],
1258
+ "original_num_docs": 151,
1259
+ "effective_num_docs": 151,
1260
+ "trust_dataset": true,
1261
+ "must_remove_duplicate_docs": null
1262
+ },
1263
+ "leaderboard|mmlu:high_school_psychology": {
1264
+ "name": "mmlu:high_school_psychology",
1265
+ "prompt_function": "mmlu_harness",
1266
+ "hf_repo": "lighteval/mmlu",
1267
+ "hf_subset": "high_school_psychology",
1268
+ "metric": [
1269
+ "loglikelihood_acc"
1270
+ ],
1271
+ "hf_avail_splits": [
1272
+ "auxiliary_train",
1273
+ "test",
1274
+ "validation",
1275
+ "dev"
1276
+ ],
1277
+ "evaluation_splits": [
1278
+ "test"
1279
+ ],
1280
+ "few_shots_split": "dev",
1281
+ "few_shots_select": "sequential",
1282
+ "generation_size": 1,
1283
+ "stop_sequence": [
1284
+ "\n"
1285
+ ],
1286
+ "output_regex": null,
1287
+ "frozen": false,
1288
+ "suite": [
1289
+ "leaderboard",
1290
+ "mmlu"
1291
+ ],
1292
+ "original_num_docs": 545,
1293
+ "effective_num_docs": 545,
1294
+ "trust_dataset": true,
1295
+ "must_remove_duplicate_docs": null
1296
+ },
1297
+ "leaderboard|mmlu:high_school_statistics": {
1298
+ "name": "mmlu:high_school_statistics",
1299
+ "prompt_function": "mmlu_harness",
1300
+ "hf_repo": "lighteval/mmlu",
1301
+ "hf_subset": "high_school_statistics",
1302
+ "metric": [
1303
+ "loglikelihood_acc"
1304
+ ],
1305
+ "hf_avail_splits": [
1306
+ "auxiliary_train",
1307
+ "test",
1308
+ "validation",
1309
+ "dev"
1310
+ ],
1311
+ "evaluation_splits": [
1312
+ "test"
1313
+ ],
1314
+ "few_shots_split": "dev",
1315
+ "few_shots_select": "sequential",
1316
+ "generation_size": 1,
1317
+ "stop_sequence": [
1318
+ "\n"
1319
+ ],
1320
+ "output_regex": null,
1321
+ "frozen": false,
1322
+ "suite": [
1323
+ "leaderboard",
1324
+ "mmlu"
1325
+ ],
1326
+ "original_num_docs": 216,
1327
+ "effective_num_docs": 216,
1328
+ "trust_dataset": true,
1329
+ "must_remove_duplicate_docs": null
1330
+ },
1331
+ "leaderboard|mmlu:high_school_us_history": {
1332
+ "name": "mmlu:high_school_us_history",
1333
+ "prompt_function": "mmlu_harness",
1334
+ "hf_repo": "lighteval/mmlu",
1335
+ "hf_subset": "high_school_us_history",
1336
+ "metric": [
1337
+ "loglikelihood_acc"
1338
+ ],
1339
+ "hf_avail_splits": [
1340
+ "auxiliary_train",
1341
+ "test",
1342
+ "validation",
1343
+ "dev"
1344
+ ],
1345
+ "evaluation_splits": [
1346
+ "test"
1347
+ ],
1348
+ "few_shots_split": "dev",
1349
+ "few_shots_select": "sequential",
1350
+ "generation_size": 1,
1351
+ "stop_sequence": [
1352
+ "\n"
1353
+ ],
1354
+ "output_regex": null,
1355
+ "frozen": false,
1356
+ "suite": [
1357
+ "leaderboard",
1358
+ "mmlu"
1359
+ ],
1360
+ "original_num_docs": 204,
1361
+ "effective_num_docs": 204,
1362
+ "trust_dataset": true,
1363
+ "must_remove_duplicate_docs": null
1364
+ },
1365
+ "leaderboard|mmlu:high_school_world_history": {
1366
+ "name": "mmlu:high_school_world_history",
1367
+ "prompt_function": "mmlu_harness",
1368
+ "hf_repo": "lighteval/mmlu",
1369
+ "hf_subset": "high_school_world_history",
1370
+ "metric": [
1371
+ "loglikelihood_acc"
1372
+ ],
1373
+ "hf_avail_splits": [
1374
+ "auxiliary_train",
1375
+ "test",
1376
+ "validation",
1377
+ "dev"
1378
+ ],
1379
+ "evaluation_splits": [
1380
+ "test"
1381
+ ],
1382
+ "few_shots_split": "dev",
1383
+ "few_shots_select": "sequential",
1384
+ "generation_size": 1,
1385
+ "stop_sequence": [
1386
+ "\n"
1387
+ ],
1388
+ "output_regex": null,
1389
+ "frozen": false,
1390
+ "suite": [
1391
+ "leaderboard",
1392
+ "mmlu"
1393
+ ],
1394
+ "original_num_docs": 237,
1395
+ "effective_num_docs": 237,
1396
+ "trust_dataset": true,
1397
+ "must_remove_duplicate_docs": null
1398
+ },
1399
+ "leaderboard|mmlu:human_aging": {
1400
+ "name": "mmlu:human_aging",
1401
+ "prompt_function": "mmlu_harness",
1402
+ "hf_repo": "lighteval/mmlu",
1403
+ "hf_subset": "human_aging",
1404
+ "metric": [
1405
+ "loglikelihood_acc"
1406
+ ],
1407
+ "hf_avail_splits": [
1408
+ "auxiliary_train",
1409
+ "test",
1410
+ "validation",
1411
+ "dev"
1412
+ ],
1413
+ "evaluation_splits": [
1414
+ "test"
1415
+ ],
1416
+ "few_shots_split": "dev",
1417
+ "few_shots_select": "sequential",
1418
+ "generation_size": 1,
1419
+ "stop_sequence": [
1420
+ "\n"
1421
+ ],
1422
+ "output_regex": null,
1423
+ "frozen": false,
1424
+ "suite": [
1425
+ "leaderboard",
1426
+ "mmlu"
1427
+ ],
1428
+ "original_num_docs": 223,
1429
+ "effective_num_docs": 223,
1430
+ "trust_dataset": true,
1431
+ "must_remove_duplicate_docs": null
1432
+ },
1433
+ "leaderboard|mmlu:human_sexuality": {
1434
+ "name": "mmlu:human_sexuality",
1435
+ "prompt_function": "mmlu_harness",
1436
+ "hf_repo": "lighteval/mmlu",
1437
+ "hf_subset": "human_sexuality",
1438
+ "metric": [
1439
+ "loglikelihood_acc"
1440
+ ],
1441
+ "hf_avail_splits": [
1442
+ "auxiliary_train",
1443
+ "test",
1444
+ "validation",
1445
+ "dev"
1446
+ ],
1447
+ "evaluation_splits": [
1448
+ "test"
1449
+ ],
1450
+ "few_shots_split": "dev",
1451
+ "few_shots_select": "sequential",
1452
+ "generation_size": 1,
1453
+ "stop_sequence": [
1454
+ "\n"
1455
+ ],
1456
+ "output_regex": null,
1457
+ "frozen": false,
1458
+ "suite": [
1459
+ "leaderboard",
1460
+ "mmlu"
1461
+ ],
1462
+ "original_num_docs": 131,
1463
+ "effective_num_docs": 131,
1464
+ "trust_dataset": true,
1465
+ "must_remove_duplicate_docs": null
1466
+ },
1467
+ "leaderboard|mmlu:international_law": {
1468
+ "name": "mmlu:international_law",
1469
+ "prompt_function": "mmlu_harness",
1470
+ "hf_repo": "lighteval/mmlu",
1471
+ "hf_subset": "international_law",
1472
+ "metric": [
1473
+ "loglikelihood_acc"
1474
+ ],
1475
+ "hf_avail_splits": [
1476
+ "auxiliary_train",
1477
+ "test",
1478
+ "validation",
1479
+ "dev"
1480
+ ],
1481
+ "evaluation_splits": [
1482
+ "test"
1483
+ ],
1484
+ "few_shots_split": "dev",
1485
+ "few_shots_select": "sequential",
1486
+ "generation_size": 1,
1487
+ "stop_sequence": [
1488
+ "\n"
1489
+ ],
1490
+ "output_regex": null,
1491
+ "frozen": false,
1492
+ "suite": [
1493
+ "leaderboard",
1494
+ "mmlu"
1495
+ ],
1496
+ "original_num_docs": 121,
1497
+ "effective_num_docs": 121,
1498
+ "trust_dataset": true,
1499
+ "must_remove_duplicate_docs": null
1500
+ },
1501
+ "leaderboard|mmlu:jurisprudence": {
1502
+ "name": "mmlu:jurisprudence",
1503
+ "prompt_function": "mmlu_harness",
1504
+ "hf_repo": "lighteval/mmlu",
1505
+ "hf_subset": "jurisprudence",
1506
+ "metric": [
1507
+ "loglikelihood_acc"
1508
+ ],
1509
+ "hf_avail_splits": [
1510
+ "auxiliary_train",
1511
+ "test",
1512
+ "validation",
1513
+ "dev"
1514
+ ],
1515
+ "evaluation_splits": [
1516
+ "test"
1517
+ ],
1518
+ "few_shots_split": "dev",
1519
+ "few_shots_select": "sequential",
1520
+ "generation_size": 1,
1521
+ "stop_sequence": [
1522
+ "\n"
1523
+ ],
1524
+ "output_regex": null,
1525
+ "frozen": false,
1526
+ "suite": [
1527
+ "leaderboard",
1528
+ "mmlu"
1529
+ ],
1530
+ "original_num_docs": 108,
1531
+ "effective_num_docs": 108,
1532
+ "trust_dataset": true,
1533
+ "must_remove_duplicate_docs": null
1534
+ },
1535
+ "leaderboard|mmlu:logical_fallacies": {
1536
+ "name": "mmlu:logical_fallacies",
1537
+ "prompt_function": "mmlu_harness",
1538
+ "hf_repo": "lighteval/mmlu",
1539
+ "hf_subset": "logical_fallacies",
1540
+ "metric": [
1541
+ "loglikelihood_acc"
1542
+ ],
1543
+ "hf_avail_splits": [
1544
+ "auxiliary_train",
1545
+ "test",
1546
+ "validation",
1547
+ "dev"
1548
+ ],
1549
+ "evaluation_splits": [
1550
+ "test"
1551
+ ],
1552
+ "few_shots_split": "dev",
1553
+ "few_shots_select": "sequential",
1554
+ "generation_size": 1,
1555
+ "stop_sequence": [
1556
+ "\n"
1557
+ ],
1558
+ "output_regex": null,
1559
+ "frozen": false,
1560
+ "suite": [
1561
+ "leaderboard",
1562
+ "mmlu"
1563
+ ],
1564
+ "original_num_docs": 163,
1565
+ "effective_num_docs": 163,
1566
+ "trust_dataset": true,
1567
+ "must_remove_duplicate_docs": null
1568
+ },
1569
+ "leaderboard|mmlu:machine_learning": {
1570
+ "name": "mmlu:machine_learning",
1571
+ "prompt_function": "mmlu_harness",
1572
+ "hf_repo": "lighteval/mmlu",
1573
+ "hf_subset": "machine_learning",
1574
+ "metric": [
1575
+ "loglikelihood_acc"
1576
+ ],
1577
+ "hf_avail_splits": [
1578
+ "auxiliary_train",
1579
+ "test",
1580
+ "validation",
1581
+ "dev"
1582
+ ],
1583
+ "evaluation_splits": [
1584
+ "test"
1585
+ ],
1586
+ "few_shots_split": "dev",
1587
+ "few_shots_select": "sequential",
1588
+ "generation_size": 1,
1589
+ "stop_sequence": [
1590
+ "\n"
1591
+ ],
1592
+ "output_regex": null,
1593
+ "frozen": false,
1594
+ "suite": [
1595
+ "leaderboard",
1596
+ "mmlu"
1597
+ ],
1598
+ "original_num_docs": 112,
1599
+ "effective_num_docs": 112,
1600
+ "trust_dataset": true,
1601
+ "must_remove_duplicate_docs": null
1602
+ },
1603
+ "leaderboard|mmlu:management": {
1604
+ "name": "mmlu:management",
1605
+ "prompt_function": "mmlu_harness",
1606
+ "hf_repo": "lighteval/mmlu",
1607
+ "hf_subset": "management",
1608
+ "metric": [
1609
+ "loglikelihood_acc"
1610
+ ],
1611
+ "hf_avail_splits": [
1612
+ "auxiliary_train",
1613
+ "test",
1614
+ "validation",
1615
+ "dev"
1616
+ ],
1617
+ "evaluation_splits": [
1618
+ "test"
1619
+ ],
1620
+ "few_shots_split": "dev",
1621
+ "few_shots_select": "sequential",
1622
+ "generation_size": 1,
1623
+ "stop_sequence": [
1624
+ "\n"
1625
+ ],
1626
+ "output_regex": null,
1627
+ "frozen": false,
1628
+ "suite": [
1629
+ "leaderboard",
1630
+ "mmlu"
1631
+ ],
1632
+ "original_num_docs": 103,
1633
+ "effective_num_docs": 103,
1634
+ "trust_dataset": true,
1635
+ "must_remove_duplicate_docs": null
1636
+ },
1637
+ "leaderboard|mmlu:marketing": {
1638
+ "name": "mmlu:marketing",
1639
+ "prompt_function": "mmlu_harness",
1640
+ "hf_repo": "lighteval/mmlu",
1641
+ "hf_subset": "marketing",
1642
+ "metric": [
1643
+ "loglikelihood_acc"
1644
+ ],
1645
+ "hf_avail_splits": [
1646
+ "auxiliary_train",
1647
+ "test",
1648
+ "validation",
1649
+ "dev"
1650
+ ],
1651
+ "evaluation_splits": [
1652
+ "test"
1653
+ ],
1654
+ "few_shots_split": "dev",
1655
+ "few_shots_select": "sequential",
1656
+ "generation_size": 1,
1657
+ "stop_sequence": [
1658
+ "\n"
1659
+ ],
1660
+ "output_regex": null,
1661
+ "frozen": false,
1662
+ "suite": [
1663
+ "leaderboard",
1664
+ "mmlu"
1665
+ ],
1666
+ "original_num_docs": 234,
1667
+ "effective_num_docs": 234,
1668
+ "trust_dataset": true,
1669
+ "must_remove_duplicate_docs": null
1670
+ },
1671
+ "leaderboard|mmlu:medical_genetics": {
1672
+ "name": "mmlu:medical_genetics",
1673
+ "prompt_function": "mmlu_harness",
1674
+ "hf_repo": "lighteval/mmlu",
1675
+ "hf_subset": "medical_genetics",
1676
+ "metric": [
1677
+ "loglikelihood_acc"
1678
+ ],
1679
+ "hf_avail_splits": [
1680
+ "auxiliary_train",
1681
+ "test",
1682
+ "validation",
1683
+ "dev"
1684
+ ],
1685
+ "evaluation_splits": [
1686
+ "test"
1687
+ ],
1688
+ "few_shots_split": "dev",
1689
+ "few_shots_select": "sequential",
1690
+ "generation_size": 1,
1691
+ "stop_sequence": [
1692
+ "\n"
1693
+ ],
1694
+ "output_regex": null,
1695
+ "frozen": false,
1696
+ "suite": [
1697
+ "leaderboard",
1698
+ "mmlu"
1699
+ ],
1700
+ "original_num_docs": 100,
1701
+ "effective_num_docs": 100,
1702
+ "trust_dataset": true,
1703
+ "must_remove_duplicate_docs": null
1704
+ },
1705
+ "leaderboard|mmlu:miscellaneous": {
1706
+ "name": "mmlu:miscellaneous",
1707
+ "prompt_function": "mmlu_harness",
1708
+ "hf_repo": "lighteval/mmlu",
1709
+ "hf_subset": "miscellaneous",
1710
+ "metric": [
1711
+ "loglikelihood_acc"
1712
+ ],
1713
+ "hf_avail_splits": [
1714
+ "auxiliary_train",
1715
+ "test",
1716
+ "validation",
1717
+ "dev"
1718
+ ],
1719
+ "evaluation_splits": [
1720
+ "test"
1721
+ ],
1722
+ "few_shots_split": "dev",
1723
+ "few_shots_select": "sequential",
1724
+ "generation_size": 1,
1725
+ "stop_sequence": [
1726
+ "\n"
1727
+ ],
1728
+ "output_regex": null,
1729
+ "frozen": false,
1730
+ "suite": [
1731
+ "leaderboard",
1732
+ "mmlu"
1733
+ ],
1734
+ "original_num_docs": 783,
1735
+ "effective_num_docs": 783,
1736
+ "trust_dataset": true,
1737
+ "must_remove_duplicate_docs": null
1738
+ },
1739
+ "leaderboard|mmlu:moral_disputes": {
1740
+ "name": "mmlu:moral_disputes",
1741
+ "prompt_function": "mmlu_harness",
1742
+ "hf_repo": "lighteval/mmlu",
1743
+ "hf_subset": "moral_disputes",
1744
+ "metric": [
1745
+ "loglikelihood_acc"
1746
+ ],
1747
+ "hf_avail_splits": [
1748
+ "auxiliary_train",
1749
+ "test",
1750
+ "validation",
1751
+ "dev"
1752
+ ],
1753
+ "evaluation_splits": [
1754
+ "test"
1755
+ ],
1756
+ "few_shots_split": "dev",
1757
+ "few_shots_select": "sequential",
1758
+ "generation_size": 1,
1759
+ "stop_sequence": [
1760
+ "\n"
1761
+ ],
1762
+ "output_regex": null,
1763
+ "frozen": false,
1764
+ "suite": [
1765
+ "leaderboard",
1766
+ "mmlu"
1767
+ ],
1768
+ "original_num_docs": 346,
1769
+ "effective_num_docs": 346,
1770
+ "trust_dataset": true,
1771
+ "must_remove_duplicate_docs": null
1772
+ },
1773
+ "leaderboard|mmlu:moral_scenarios": {
1774
+ "name": "mmlu:moral_scenarios",
1775
+ "prompt_function": "mmlu_harness",
1776
+ "hf_repo": "lighteval/mmlu",
1777
+ "hf_subset": "moral_scenarios",
1778
+ "metric": [
1779
+ "loglikelihood_acc"
1780
+ ],
1781
+ "hf_avail_splits": [
1782
+ "auxiliary_train",
1783
+ "test",
1784
+ "validation",
1785
+ "dev"
1786
+ ],
1787
+ "evaluation_splits": [
1788
+ "test"
1789
+ ],
1790
+ "few_shots_split": "dev",
1791
+ "few_shots_select": "sequential",
1792
+ "generation_size": 1,
1793
+ "stop_sequence": [
1794
+ "\n"
1795
+ ],
1796
+ "output_regex": null,
1797
+ "frozen": false,
1798
+ "suite": [
1799
+ "leaderboard",
1800
+ "mmlu"
1801
+ ],
1802
+ "original_num_docs": 895,
1803
+ "effective_num_docs": 895,
1804
+ "trust_dataset": true,
1805
+ "must_remove_duplicate_docs": null
1806
+ },
1807
+ "leaderboard|mmlu:nutrition": {
1808
+ "name": "mmlu:nutrition",
1809
+ "prompt_function": "mmlu_harness",
1810
+ "hf_repo": "lighteval/mmlu",
1811
+ "hf_subset": "nutrition",
1812
+ "metric": [
1813
+ "loglikelihood_acc"
1814
+ ],
1815
+ "hf_avail_splits": [
1816
+ "auxiliary_train",
1817
+ "test",
1818
+ "validation",
1819
+ "dev"
1820
+ ],
1821
+ "evaluation_splits": [
1822
+ "test"
1823
+ ],
1824
+ "few_shots_split": "dev",
1825
+ "few_shots_select": "sequential",
1826
+ "generation_size": 1,
1827
+ "stop_sequence": [
1828
+ "\n"
1829
+ ],
1830
+ "output_regex": null,
1831
+ "frozen": false,
1832
+ "suite": [
1833
+ "leaderboard",
1834
+ "mmlu"
1835
+ ],
1836
+ "original_num_docs": 306,
1837
+ "effective_num_docs": 306,
1838
+ "trust_dataset": true,
1839
+ "must_remove_duplicate_docs": null
1840
+ },
1841
+ "leaderboard|mmlu:philosophy": {
1842
+ "name": "mmlu:philosophy",
1843
+ "prompt_function": "mmlu_harness",
1844
+ "hf_repo": "lighteval/mmlu",
1845
+ "hf_subset": "philosophy",
1846
+ "metric": [
1847
+ "loglikelihood_acc"
1848
+ ],
1849
+ "hf_avail_splits": [
1850
+ "auxiliary_train",
1851
+ "test",
1852
+ "validation",
1853
+ "dev"
1854
+ ],
1855
+ "evaluation_splits": [
1856
+ "test"
1857
+ ],
1858
+ "few_shots_split": "dev",
1859
+ "few_shots_select": "sequential",
1860
+ "generation_size": 1,
1861
+ "stop_sequence": [
1862
+ "\n"
1863
+ ],
1864
+ "output_regex": null,
1865
+ "frozen": false,
1866
+ "suite": [
1867
+ "leaderboard",
1868
+ "mmlu"
1869
+ ],
1870
+ "original_num_docs": 311,
1871
+ "effective_num_docs": 311,
1872
+ "trust_dataset": true,
1873
+ "must_remove_duplicate_docs": null
1874
+ },
1875
+ "leaderboard|mmlu:prehistory": {
1876
+ "name": "mmlu:prehistory",
1877
+ "prompt_function": "mmlu_harness",
1878
+ "hf_repo": "lighteval/mmlu",
1879
+ "hf_subset": "prehistory",
1880
+ "metric": [
1881
+ "loglikelihood_acc"
1882
+ ],
1883
+ "hf_avail_splits": [
1884
+ "auxiliary_train",
1885
+ "test",
1886
+ "validation",
1887
+ "dev"
1888
+ ],
1889
+ "evaluation_splits": [
1890
+ "test"
1891
+ ],
1892
+ "few_shots_split": "dev",
1893
+ "few_shots_select": "sequential",
1894
+ "generation_size": 1,
1895
+ "stop_sequence": [
1896
+ "\n"
1897
+ ],
1898
+ "output_regex": null,
1899
+ "frozen": false,
1900
+ "suite": [
1901
+ "leaderboard",
1902
+ "mmlu"
1903
+ ],
1904
+ "original_num_docs": 324,
1905
+ "effective_num_docs": 324,
1906
+ "trust_dataset": true,
1907
+ "must_remove_duplicate_docs": null
1908
+ },
1909
+ "leaderboard|mmlu:professional_accounting": {
1910
+ "name": "mmlu:professional_accounting",
1911
+ "prompt_function": "mmlu_harness",
1912
+ "hf_repo": "lighteval/mmlu",
1913
+ "hf_subset": "professional_accounting",
1914
+ "metric": [
1915
+ "loglikelihood_acc"
1916
+ ],
1917
+ "hf_avail_splits": [
1918
+ "auxiliary_train",
1919
+ "test",
1920
+ "validation",
1921
+ "dev"
1922
+ ],
1923
+ "evaluation_splits": [
1924
+ "test"
1925
+ ],
1926
+ "few_shots_split": "dev",
1927
+ "few_shots_select": "sequential",
1928
+ "generation_size": 1,
1929
+ "stop_sequence": [
1930
+ "\n"
1931
+ ],
1932
+ "output_regex": null,
1933
+ "frozen": false,
1934
+ "suite": [
1935
+ "leaderboard",
1936
+ "mmlu"
1937
+ ],
1938
+ "original_num_docs": 282,
1939
+ "effective_num_docs": 282,
1940
+ "trust_dataset": true,
1941
+ "must_remove_duplicate_docs": null
1942
+ },
1943
+ "leaderboard|mmlu:professional_law": {
1944
+ "name": "mmlu:professional_law",
1945
+ "prompt_function": "mmlu_harness",
1946
+ "hf_repo": "lighteval/mmlu",
1947
+ "hf_subset": "professional_law",
1948
+ "metric": [
1949
+ "loglikelihood_acc"
1950
+ ],
1951
+ "hf_avail_splits": [
1952
+ "auxiliary_train",
1953
+ "test",
1954
+ "validation",
1955
+ "dev"
1956
+ ],
1957
+ "evaluation_splits": [
1958
+ "test"
1959
+ ],
1960
+ "few_shots_split": "dev",
1961
+ "few_shots_select": "sequential",
1962
+ "generation_size": 1,
1963
+ "stop_sequence": [
1964
+ "\n"
1965
+ ],
1966
+ "output_regex": null,
1967
+ "frozen": false,
1968
+ "suite": [
1969
+ "leaderboard",
1970
+ "mmlu"
1971
+ ],
1972
+ "original_num_docs": 1534,
1973
+ "effective_num_docs": 1534,
1974
+ "trust_dataset": true,
1975
+ "must_remove_duplicate_docs": null
1976
+ },
1977
+ "leaderboard|mmlu:professional_medicine": {
1978
+ "name": "mmlu:professional_medicine",
1979
+ "prompt_function": "mmlu_harness",
1980
+ "hf_repo": "lighteval/mmlu",
1981
+ "hf_subset": "professional_medicine",
1982
+ "metric": [
1983
+ "loglikelihood_acc"
1984
+ ],
1985
+ "hf_avail_splits": [
1986
+ "auxiliary_train",
1987
+ "test",
1988
+ "validation",
1989
+ "dev"
1990
+ ],
1991
+ "evaluation_splits": [
1992
+ "test"
1993
+ ],
1994
+ "few_shots_split": "dev",
1995
+ "few_shots_select": "sequential",
1996
+ "generation_size": 1,
1997
+ "stop_sequence": [
1998
+ "\n"
1999
+ ],
2000
+ "output_regex": null,
2001
+ "frozen": false,
2002
+ "suite": [
2003
+ "leaderboard",
2004
+ "mmlu"
2005
+ ],
2006
+ "original_num_docs": 272,
2007
+ "effective_num_docs": 272,
2008
+ "trust_dataset": true,
2009
+ "must_remove_duplicate_docs": null
2010
+ },
2011
+ "leaderboard|mmlu:professional_psychology": {
2012
+ "name": "mmlu:professional_psychology",
2013
+ "prompt_function": "mmlu_harness",
2014
+ "hf_repo": "lighteval/mmlu",
2015
+ "hf_subset": "professional_psychology",
2016
+ "metric": [
2017
+ "loglikelihood_acc"
2018
+ ],
2019
+ "hf_avail_splits": [
2020
+ "auxiliary_train",
2021
+ "test",
2022
+ "validation",
2023
+ "dev"
2024
+ ],
2025
+ "evaluation_splits": [
2026
+ "test"
2027
+ ],
2028
+ "few_shots_split": "dev",
2029
+ "few_shots_select": "sequential",
2030
+ "generation_size": 1,
2031
+ "stop_sequence": [
2032
+ "\n"
2033
+ ],
2034
+ "output_regex": null,
2035
+ "frozen": false,
2036
+ "suite": [
2037
+ "leaderboard",
2038
+ "mmlu"
2039
+ ],
2040
+ "original_num_docs": 612,
2041
+ "effective_num_docs": 612,
2042
+ "trust_dataset": true,
2043
+ "must_remove_duplicate_docs": null
2044
+ },
2045
+ "leaderboard|mmlu:public_relations": {
2046
+ "name": "mmlu:public_relations",
2047
+ "prompt_function": "mmlu_harness",
2048
+ "hf_repo": "lighteval/mmlu",
2049
+ "hf_subset": "public_relations",
2050
+ "metric": [
2051
+ "loglikelihood_acc"
2052
+ ],
2053
+ "hf_avail_splits": [
2054
+ "auxiliary_train",
2055
+ "test",
2056
+ "validation",
2057
+ "dev"
2058
+ ],
2059
+ "evaluation_splits": [
2060
+ "test"
2061
+ ],
2062
+ "few_shots_split": "dev",
2063
+ "few_shots_select": "sequential",
2064
+ "generation_size": 1,
2065
+ "stop_sequence": [
2066
+ "\n"
2067
+ ],
2068
+ "output_regex": null,
2069
+ "frozen": false,
2070
+ "suite": [
2071
+ "leaderboard",
2072
+ "mmlu"
2073
+ ],
2074
+ "original_num_docs": 110,
2075
+ "effective_num_docs": 110,
2076
+ "trust_dataset": true,
2077
+ "must_remove_duplicate_docs": null
2078
+ },
2079
+ "leaderboard|mmlu:security_studies": {
2080
+ "name": "mmlu:security_studies",
2081
+ "prompt_function": "mmlu_harness",
2082
+ "hf_repo": "lighteval/mmlu",
2083
+ "hf_subset": "security_studies",
2084
+ "metric": [
2085
+ "loglikelihood_acc"
2086
+ ],
2087
+ "hf_avail_splits": [
2088
+ "auxiliary_train",
2089
+ "test",
2090
+ "validation",
2091
+ "dev"
2092
+ ],
2093
+ "evaluation_splits": [
2094
+ "test"
2095
+ ],
2096
+ "few_shots_split": "dev",
2097
+ "few_shots_select": "sequential",
2098
+ "generation_size": 1,
2099
+ "stop_sequence": [
2100
+ "\n"
2101
+ ],
2102
+ "output_regex": null,
2103
+ "frozen": false,
2104
+ "suite": [
2105
+ "leaderboard",
2106
+ "mmlu"
2107
+ ],
2108
+ "original_num_docs": 245,
2109
+ "effective_num_docs": 245,
2110
+ "trust_dataset": true,
2111
+ "must_remove_duplicate_docs": null
2112
+ },
2113
+ "leaderboard|mmlu:sociology": {
2114
+ "name": "mmlu:sociology",
2115
+ "prompt_function": "mmlu_harness",
2116
+ "hf_repo": "lighteval/mmlu",
2117
+ "hf_subset": "sociology",
2118
+ "metric": [
2119
+ "loglikelihood_acc"
2120
+ ],
2121
+ "hf_avail_splits": [
2122
+ "auxiliary_train",
2123
+ "test",
2124
+ "validation",
2125
+ "dev"
2126
+ ],
2127
+ "evaluation_splits": [
2128
+ "test"
2129
+ ],
2130
+ "few_shots_split": "dev",
2131
+ "few_shots_select": "sequential",
2132
+ "generation_size": 1,
2133
+ "stop_sequence": [
2134
+ "\n"
2135
+ ],
2136
+ "output_regex": null,
2137
+ "frozen": false,
2138
+ "suite": [
2139
+ "leaderboard",
2140
+ "mmlu"
2141
+ ],
2142
+ "original_num_docs": 201,
2143
+ "effective_num_docs": 201,
2144
+ "trust_dataset": true,
2145
+ "must_remove_duplicate_docs": null
2146
+ },
2147
+ "leaderboard|mmlu:us_foreign_policy": {
2148
+ "name": "mmlu:us_foreign_policy",
2149
+ "prompt_function": "mmlu_harness",
2150
+ "hf_repo": "lighteval/mmlu",
2151
+ "hf_subset": "us_foreign_policy",
2152
+ "metric": [
2153
+ "loglikelihood_acc"
2154
+ ],
2155
+ "hf_avail_splits": [
2156
+ "auxiliary_train",
2157
+ "test",
2158
+ "validation",
2159
+ "dev"
2160
+ ],
2161
+ "evaluation_splits": [
2162
+ "test"
2163
+ ],
2164
+ "few_shots_split": "dev",
2165
+ "few_shots_select": "sequential",
2166
+ "generation_size": 1,
2167
+ "stop_sequence": [
2168
+ "\n"
2169
+ ],
2170
+ "output_regex": null,
2171
+ "frozen": false,
2172
+ "suite": [
2173
+ "leaderboard",
2174
+ "mmlu"
2175
+ ],
2176
+ "original_num_docs": 100,
2177
+ "effective_num_docs": 100,
2178
+ "trust_dataset": true,
2179
+ "must_remove_duplicate_docs": null
2180
+ },
2181
+ "leaderboard|mmlu:virology": {
2182
+ "name": "mmlu:virology",
2183
+ "prompt_function": "mmlu_harness",
2184
+ "hf_repo": "lighteval/mmlu",
2185
+ "hf_subset": "virology",
2186
+ "metric": [
2187
+ "loglikelihood_acc"
2188
+ ],
2189
+ "hf_avail_splits": [
2190
+ "auxiliary_train",
2191
+ "test",
2192
+ "validation",
2193
+ "dev"
2194
+ ],
2195
+ "evaluation_splits": [
2196
+ "test"
2197
+ ],
2198
+ "few_shots_split": "dev",
2199
+ "few_shots_select": "sequential",
2200
+ "generation_size": 1,
2201
+ "stop_sequence": [
2202
+ "\n"
2203
+ ],
2204
+ "output_regex": null,
2205
+ "frozen": false,
2206
+ "suite": [
2207
+ "leaderboard",
2208
+ "mmlu"
2209
+ ],
2210
+ "original_num_docs": 166,
2211
+ "effective_num_docs": 166,
2212
+ "trust_dataset": true,
2213
+ "must_remove_duplicate_docs": null
2214
+ },
2215
+ "leaderboard|mmlu:world_religions": {
2216
+ "name": "mmlu:world_religions",
2217
+ "prompt_function": "mmlu_harness",
2218
+ "hf_repo": "lighteval/mmlu",
2219
+ "hf_subset": "world_religions",
2220
+ "metric": [
2221
+ "loglikelihood_acc"
2222
+ ],
2223
+ "hf_avail_splits": [
2224
+ "auxiliary_train",
2225
+ "test",
2226
+ "validation",
2227
+ "dev"
2228
+ ],
2229
+ "evaluation_splits": [
2230
+ "test"
2231
+ ],
2232
+ "few_shots_split": "dev",
2233
+ "few_shots_select": "sequential",
2234
+ "generation_size": 1,
2235
+ "stop_sequence": [
2236
+ "\n"
2237
+ ],
2238
+ "output_regex": null,
2239
+ "frozen": false,
2240
+ "suite": [
2241
+ "leaderboard",
2242
+ "mmlu"
2243
+ ],
2244
+ "original_num_docs": 171,
2245
+ "effective_num_docs": 171,
2246
+ "trust_dataset": true,
2247
+ "must_remove_duplicate_docs": null
2248
+ }
2249
+ },
2250
+ "summary_tasks": {
2251
+ "leaderboard|mmlu:abstract_algebra|5": {
2252
+ "hashes": {
2253
+ "hash_examples": "4c76229e00c9c0e9",
2254
+ "hash_full_prompts": "273278cb9fb5ac01",
2255
+ "hash_input_tokens": "caf9777ccf71eab5",
2256
+ "hash_cont_tokens": "00520b0ec06da34f"
2257
+ },
2258
+ "truncated": 0,
2259
+ "non_truncated": 100,
2260
+ "padded": 400,
2261
+ "non_padded": 0,
2262
+ "effective_few_shots": 5.0,
2263
+ "num_truncated_few_shots": 0
2264
+ },
2265
+ "leaderboard|mmlu:anatomy|5": {
2266
+ "hashes": {
2267
+ "hash_examples": "6a1f8104dccbd33b",
2268
+ "hash_full_prompts": "e77b5ebe030aabba",
2269
+ "hash_input_tokens": "d192cd7584fda4dc",
2270
+ "hash_cont_tokens": "263324e6ce7f9b36"
2271
+ },
2272
+ "truncated": 0,
2273
+ "non_truncated": 135,
2274
+ "padded": 540,
2275
+ "non_padded": 0,
2276
+ "effective_few_shots": 5.0,
2277
+ "num_truncated_few_shots": 0
2278
+ },
2279
+ "leaderboard|mmlu:astronomy|5": {
2280
+ "hashes": {
2281
+ "hash_examples": "1302effa3a76ce4c",
2282
+ "hash_full_prompts": "0ff37ef4519e63f9",
2283
+ "hash_input_tokens": "d241783f0bfdf860",
2284
+ "hash_cont_tokens": "18ba399c6801138e"
2285
+ },
2286
+ "truncated": 0,
2287
+ "non_truncated": 152,
2288
+ "padded": 608,
2289
+ "non_padded": 0,
2290
+ "effective_few_shots": 5.0,
2291
+ "num_truncated_few_shots": 0
2292
+ },
2293
+ "leaderboard|mmlu:business_ethics|5": {
2294
+ "hashes": {
2295
+ "hash_examples": "03cb8bce5336419a",
2296
+ "hash_full_prompts": "7c4d312a23bdd669",
2297
+ "hash_input_tokens": "0aee5ed969278926",
2298
+ "hash_cont_tokens": "00520b0ec06da34f"
2299
+ },
2300
+ "truncated": 0,
2301
+ "non_truncated": 100,
2302
+ "padded": 400,
2303
+ "non_padded": 0,
2304
+ "effective_few_shots": 5.0,
2305
+ "num_truncated_few_shots": 0
2306
+ },
2307
+ "leaderboard|mmlu:clinical_knowledge|5": {
2308
+ "hashes": {
2309
+ "hash_examples": "ffbb9c7b2be257f9",
2310
+ "hash_full_prompts": "472d93369b1a8382",
2311
+ "hash_input_tokens": "aa05960be77863d3",
2312
+ "hash_cont_tokens": "9d7500060e0dd995"
2313
+ },
2314
+ "truncated": 0,
2315
+ "non_truncated": 265,
2316
+ "padded": 1060,
2317
+ "non_padded": 0,
2318
+ "effective_few_shots": 5.0,
2319
+ "num_truncated_few_shots": 0
2320
+ },
2321
+ "leaderboard|mmlu:college_biology|5": {
2322
+ "hashes": {
2323
+ "hash_examples": "3ee77f176f38eb8e",
2324
+ "hash_full_prompts": "6853bf027b349083",
2325
+ "hash_input_tokens": "3843b5375a04262c",
2326
+ "hash_cont_tokens": "78a731af5d2f6472"
2327
+ },
2328
+ "truncated": 0,
2329
+ "non_truncated": 144,
2330
+ "padded": 576,
2331
+ "non_padded": 0,
2332
+ "effective_few_shots": 5.0,
2333
+ "num_truncated_few_shots": 0
2334
+ },
2335
+ "leaderboard|mmlu:college_chemistry|5": {
2336
+ "hashes": {
2337
+ "hash_examples": "ce61a69c46d47aeb",
2338
+ "hash_full_prompts": "e0f8624971f7af71",
2339
+ "hash_input_tokens": "2096d1652e232764",
2340
+ "hash_cont_tokens": "00520b0ec06da34f"
2341
+ },
2342
+ "truncated": 0,
2343
+ "non_truncated": 100,
2344
+ "padded": 400,
2345
+ "non_padded": 0,
2346
+ "effective_few_shots": 5.0,
2347
+ "num_truncated_few_shots": 0
2348
+ },
2349
+ "leaderboard|mmlu:college_computer_science|5": {
2350
+ "hashes": {
2351
+ "hash_examples": "32805b52d7d5daab",
2352
+ "hash_full_prompts": "841e9d2ecfbb104d",
2353
+ "hash_input_tokens": "1e007ac047722e9b",
2354
+ "hash_cont_tokens": "00520b0ec06da34f"
2355
+ },
2356
+ "truncated": 0,
2357
+ "non_truncated": 100,
2358
+ "padded": 400,
2359
+ "non_padded": 0,
2360
+ "effective_few_shots": 5.0,
2361
+ "num_truncated_few_shots": 0
2362
+ },
2363
+ "leaderboard|mmlu:college_mathematics|5": {
2364
+ "hashes": {
2365
+ "hash_examples": "55da1a0a0bd33722",
2366
+ "hash_full_prompts": "696c5f73522b8706",
2367
+ "hash_input_tokens": "c3061d57b5a4ad7e",
2368
+ "hash_cont_tokens": "00520b0ec06da34f"
2369
+ },
2370
+ "truncated": 0,
2371
+ "non_truncated": 100,
2372
+ "padded": 400,
2373
+ "non_padded": 0,
2374
+ "effective_few_shots": 5.0,
2375
+ "num_truncated_few_shots": 0
2376
+ },
2377
+ "leaderboard|mmlu:college_medicine|5": {
2378
+ "hashes": {
2379
+ "hash_examples": "c33e143163049176",
2380
+ "hash_full_prompts": "7d2530816f672426",
2381
+ "hash_input_tokens": "4cddd091001776d7",
2382
+ "hash_cont_tokens": "699c8eb24e3e446b"
2383
+ },
2384
+ "truncated": 0,
2385
+ "non_truncated": 173,
2386
+ "padded": 692,
2387
+ "non_padded": 0,
2388
+ "effective_few_shots": 5.0,
2389
+ "num_truncated_few_shots": 0
2390
+ },
2391
+ "leaderboard|mmlu:college_physics|5": {
2392
+ "hashes": {
2393
+ "hash_examples": "ebdab1cdb7e555df",
2394
+ "hash_full_prompts": "66b3a61507c4c92b",
2395
+ "hash_input_tokens": "821b169941167548",
2396
+ "hash_cont_tokens": "075997110cbe055e"
2397
+ },
2398
+ "truncated": 0,
2399
+ "non_truncated": 102,
2400
+ "padded": 408,
2401
+ "non_padded": 0,
2402
+ "effective_few_shots": 5.0,
2403
+ "num_truncated_few_shots": 0
2404
+ },
2405
+ "leaderboard|mmlu:computer_security|5": {
2406
+ "hashes": {
2407
+ "hash_examples": "a24fd7d08a560921",
2408
+ "hash_full_prompts": "f1143da88158bf03",
2409
+ "hash_input_tokens": "02e64465d74344b4",
2410
+ "hash_cont_tokens": "00520b0ec06da34f"
2411
+ },
2412
+ "truncated": 0,
2413
+ "non_truncated": 100,
2414
+ "padded": 400,
2415
+ "non_padded": 0,
2416
+ "effective_few_shots": 5.0,
2417
+ "num_truncated_few_shots": 0
2418
+ },
2419
+ "leaderboard|mmlu:conceptual_physics|5": {
2420
+ "hashes": {
2421
+ "hash_examples": "8300977a79386993",
2422
+ "hash_full_prompts": "d2b4c706b65a71d9",
2423
+ "hash_input_tokens": "5c7a2235529d2821",
2424
+ "hash_cont_tokens": "f22daa6d4818086f"
2425
+ },
2426
+ "truncated": 0,
2427
+ "non_truncated": 235,
2428
+ "padded": 940,
2429
+ "non_padded": 0,
2430
+ "effective_few_shots": 5.0,
2431
+ "num_truncated_few_shots": 0
2432
+ },
2433
+ "leaderboard|mmlu:econometrics|5": {
2434
+ "hashes": {
2435
+ "hash_examples": "ddde36788a04a46f",
2436
+ "hash_full_prompts": "aa5255d923b0e3a3",
2437
+ "hash_input_tokens": "e0a79ea9e037599d",
2438
+ "hash_cont_tokens": "26791a0b1941b4c4"
2439
+ },
2440
+ "truncated": 0,
2441
+ "non_truncated": 114,
2442
+ "padded": 456,
2443
+ "non_padded": 0,
2444
+ "effective_few_shots": 5.0,
2445
+ "num_truncated_few_shots": 0
2446
+ },
2447
+ "leaderboard|mmlu:electrical_engineering|5": {
2448
+ "hashes": {
2449
+ "hash_examples": "acbc5def98c19b3f",
2450
+ "hash_full_prompts": "c1f9a9087987d1d7",
2451
+ "hash_input_tokens": "e48ddb58b2efa8e3",
2452
+ "hash_cont_tokens": "3e336577994f6c0d"
2453
+ },
2454
+ "truncated": 0,
2455
+ "non_truncated": 145,
2456
+ "padded": 580,
2457
+ "non_padded": 0,
2458
+ "effective_few_shots": 5.0,
2459
+ "num_truncated_few_shots": 0
2460
+ },
2461
+ "leaderboard|mmlu:elementary_mathematics|5": {
2462
+ "hashes": {
2463
+ "hash_examples": "146e61d07497a9bd",
2464
+ "hash_full_prompts": "57fb9ddf2f814bb5",
2465
+ "hash_input_tokens": "9e81373b5265da10",
2466
+ "hash_cont_tokens": "1d6bbfa8a67327c8"
2467
+ },
2468
+ "truncated": 0,
2469
+ "non_truncated": 378,
2470
+ "padded": 1512,
2471
+ "non_padded": 0,
2472
+ "effective_few_shots": 5.0,
2473
+ "num_truncated_few_shots": 0
2474
+ },
2475
+ "leaderboard|mmlu:formal_logic|5": {
2476
+ "hashes": {
2477
+ "hash_examples": "8635216e1909a03f",
2478
+ "hash_full_prompts": "dc7e34e04346adfd",
2479
+ "hash_input_tokens": "0378ed1f1a9bb3f6",
2480
+ "hash_cont_tokens": "60508d85eb7693a4"
2481
+ },
2482
+ "truncated": 0,
2483
+ "non_truncated": 126,
2484
+ "padded": 504,
2485
+ "non_padded": 0,
2486
+ "effective_few_shots": 5.0,
2487
+ "num_truncated_few_shots": 0
2488
+ },
2489
+ "leaderboard|mmlu:global_facts|5": {
2490
+ "hashes": {
2491
+ "hash_examples": "30b315aa6353ee47",
2492
+ "hash_full_prompts": "7dedb5baa45f3a38",
2493
+ "hash_input_tokens": "d20db9bd82fb76c1",
2494
+ "hash_cont_tokens": "00520b0ec06da34f"
2495
+ },
2496
+ "truncated": 0,
2497
+ "non_truncated": 100,
2498
+ "padded": 400,
2499
+ "non_padded": 0,
2500
+ "effective_few_shots": 5.0,
2501
+ "num_truncated_few_shots": 0
2502
+ },
2503
+ "leaderboard|mmlu:high_school_biology|5": {
2504
+ "hashes": {
2505
+ "hash_examples": "c9136373af2180de",
2506
+ "hash_full_prompts": "15157813fc668acf",
2507
+ "hash_input_tokens": "c3c10eef8c477c93",
2508
+ "hash_cont_tokens": "d236ce982144e65f"
2509
+ },
2510
+ "truncated": 0,
2511
+ "non_truncated": 310,
2512
+ "padded": 1240,
2513
+ "non_padded": 0,
2514
+ "effective_few_shots": 5.0,
2515
+ "num_truncated_few_shots": 0
2516
+ },
2517
+ "leaderboard|mmlu:high_school_chemistry|5": {
2518
+ "hashes": {
2519
+ "hash_examples": "b0661bfa1add6404",
2520
+ "hash_full_prompts": "f51dfd92a2d6fdba",
2521
+ "hash_input_tokens": "dc53c87961ef4ab5",
2522
+ "hash_cont_tokens": "59f93238ec5aead6"
2523
+ },
2524
+ "truncated": 0,
2525
+ "non_truncated": 203,
2526
+ "padded": 812,
2527
+ "non_padded": 0,
2528
+ "effective_few_shots": 5.0,
2529
+ "num_truncated_few_shots": 0
2530
+ },
2531
+ "leaderboard|mmlu:high_school_computer_science|5": {
2532
+ "hashes": {
2533
+ "hash_examples": "80fc1d623a3d665f",
2534
+ "hash_full_prompts": "fe432a03fe8cc766",
2535
+ "hash_input_tokens": "61fa356c3ea98372",
2536
+ "hash_cont_tokens": "00520b0ec06da34f"
2537
+ },
2538
+ "truncated": 0,
2539
+ "non_truncated": 100,
2540
+ "padded": 400,
2541
+ "non_padded": 0,
2542
+ "effective_few_shots": 5.0,
2543
+ "num_truncated_few_shots": 0
2544
+ },
2545
+ "leaderboard|mmlu:high_school_european_history|5": {
2546
+ "hashes": {
2547
+ "hash_examples": "854da6e5af0fe1a1",
2548
+ "hash_full_prompts": "09a62e1560fb1171",
2549
+ "hash_input_tokens": "272f8d31300ef0af",
2550
+ "hash_cont_tokens": "7b7414d6a5da3d91"
2551
+ },
2552
+ "truncated": 0,
2553
+ "non_truncated": 165,
2554
+ "padded": 656,
2555
+ "non_padded": 4,
2556
+ "effective_few_shots": 5.0,
2557
+ "num_truncated_few_shots": 0
2558
+ },
2559
+ "leaderboard|mmlu:high_school_geography|5": {
2560
+ "hashes": {
2561
+ "hash_examples": "7dc963c7acd19ad8",
2562
+ "hash_full_prompts": "8284151c76cee4d8",
2563
+ "hash_input_tokens": "12624aed9bf6356b",
2564
+ "hash_cont_tokens": "1b66289e10988f84"
2565
+ },
2566
+ "truncated": 0,
2567
+ "non_truncated": 198,
2568
+ "padded": 792,
2569
+ "non_padded": 0,
2570
+ "effective_few_shots": 5.0,
2571
+ "num_truncated_few_shots": 0
2572
+ },
2573
+ "leaderboard|mmlu:high_school_government_and_politics|5": {
2574
+ "hashes": {
2575
+ "hash_examples": "1f675dcdebc9758f",
2576
+ "hash_full_prompts": "083339a69a8bfafa",
2577
+ "hash_input_tokens": "32e30c43a4a5347e",
2578
+ "hash_cont_tokens": "5ab3c3415b1d3a55"
2579
+ },
2580
+ "truncated": 0,
2581
+ "non_truncated": 193,
2582
+ "padded": 772,
2583
+ "non_padded": 0,
2584
+ "effective_few_shots": 5.0,
2585
+ "num_truncated_few_shots": 0
2586
+ },
2587
+ "leaderboard|mmlu:high_school_macroeconomics|5": {
2588
+ "hashes": {
2589
+ "hash_examples": "2fb32cf2d80f0b35",
2590
+ "hash_full_prompts": "ececedb0c4a4ffcd",
2591
+ "hash_input_tokens": "dc2cd6b398f5f86e",
2592
+ "hash_cont_tokens": "2f5457058d187374"
2593
+ },
2594
+ "truncated": 0,
2595
+ "non_truncated": 390,
2596
+ "padded": 1557,
2597
+ "non_padded": 3,
2598
+ "effective_few_shots": 5.0,
2599
+ "num_truncated_few_shots": 0
2600
+ },
2601
+ "leaderboard|mmlu:high_school_mathematics|5": {
2602
+ "hashes": {
2603
+ "hash_examples": "fd6646fdb5d58a1f",
2604
+ "hash_full_prompts": "d58a3ca5c8ed6780",
2605
+ "hash_input_tokens": "6f9c5ce6428dd87d",
2606
+ "hash_cont_tokens": "e35137cb972e1918"
2607
+ },
2608
+ "truncated": 0,
2609
+ "non_truncated": 270,
2610
+ "padded": 1080,
2611
+ "non_padded": 0,
2612
+ "effective_few_shots": 5.0,
2613
+ "num_truncated_few_shots": 0
2614
+ },
2615
+ "leaderboard|mmlu:high_school_microeconomics|5": {
2616
+ "hashes": {
2617
+ "hash_examples": "2118f21f71d87d84",
2618
+ "hash_full_prompts": "bd49ce8a930e3e78",
2619
+ "hash_input_tokens": "44722cbe1d85e636",
2620
+ "hash_cont_tokens": "f756093278ebb83e"
2621
+ },
2622
+ "truncated": 0,
2623
+ "non_truncated": 238,
2624
+ "padded": 908,
2625
+ "non_padded": 44,
2626
+ "effective_few_shots": 5.0,
2627
+ "num_truncated_few_shots": 0
2628
+ },
2629
+ "leaderboard|mmlu:high_school_physics|5": {
2630
+ "hashes": {
2631
+ "hash_examples": "dc3ce06378548565",
2632
+ "hash_full_prompts": "3904af994b32b959",
2633
+ "hash_input_tokens": "2132f616c2587937",
2634
+ "hash_cont_tokens": "9cf883ebf1c82176"
2635
+ },
2636
+ "truncated": 0,
2637
+ "non_truncated": 151,
2638
+ "padded": 604,
2639
+ "non_padded": 0,
2640
+ "effective_few_shots": 5.0,
2641
+ "num_truncated_few_shots": 0
2642
+ },
2643
+ "leaderboard|mmlu:high_school_psychology|5": {
2644
+ "hashes": {
2645
+ "hash_examples": "c8d1d98a40e11f2f",
2646
+ "hash_full_prompts": "d3a4d5dd3f3513f8",
2647
+ "hash_input_tokens": "6cc69cf1a89e4a88",
2648
+ "hash_cont_tokens": "bda0f77331ebb21a"
2649
+ },
2650
+ "truncated": 0,
2651
+ "non_truncated": 545,
2652
+ "padded": 2178,
2653
+ "non_padded": 2,
2654
+ "effective_few_shots": 5.0,
2655
+ "num_truncated_few_shots": 0
2656
+ },
2657
+ "leaderboard|mmlu:high_school_statistics|5": {
2658
+ "hashes": {
2659
+ "hash_examples": "666c8759b98ee4ff",
2660
+ "hash_full_prompts": "1b5599f9d4edc7de",
2661
+ "hash_input_tokens": "60af7a873b579818",
2662
+ "hash_cont_tokens": "4d04f014105a0bad"
2663
+ },
2664
+ "truncated": 0,
2665
+ "non_truncated": 216,
2666
+ "padded": 864,
2667
+ "non_padded": 0,
2668
+ "effective_few_shots": 5.0,
2669
+ "num_truncated_few_shots": 0
2670
+ },
2671
+ "leaderboard|mmlu:high_school_us_history|5": {
2672
+ "hashes": {
2673
+ "hash_examples": "95fef1c4b7d3f81e",
2674
+ "hash_full_prompts": "001f7e7cc8185618",
2675
+ "hash_input_tokens": "8c2d01a0f291db69",
2676
+ "hash_cont_tokens": "f4590c58f12f2766"
2677
+ },
2678
+ "truncated": 0,
2679
+ "non_truncated": 204,
2680
+ "padded": 816,
2681
+ "non_padded": 0,
2682
+ "effective_few_shots": 5.0,
2683
+ "num_truncated_few_shots": 0
2684
+ },
2685
+ "leaderboard|mmlu:high_school_world_history|5": {
2686
+ "hashes": {
2687
+ "hash_examples": "7e5085b6184b0322",
2688
+ "hash_full_prompts": "6a5c2a43cf7c6cb1",
2689
+ "hash_input_tokens": "612ed95e43bc21b5",
2690
+ "hash_cont_tokens": "db6bcddd891df5d9"
2691
+ },
2692
+ "truncated": 0,
2693
+ "non_truncated": 237,
2694
+ "padded": 948,
2695
+ "non_padded": 0,
2696
+ "effective_few_shots": 5.0,
2697
+ "num_truncated_few_shots": 0
2698
+ },
2699
+ "leaderboard|mmlu:human_aging|5": {
2700
+ "hashes": {
2701
+ "hash_examples": "c17333e7c7c10797",
2702
+ "hash_full_prompts": "a3ad8e679fe07bef",
2703
+ "hash_input_tokens": "4c948b081b40ba31",
2704
+ "hash_cont_tokens": "25cec8d640319105"
2705
+ },
2706
+ "truncated": 0,
2707
+ "non_truncated": 223,
2708
+ "padded": 892,
2709
+ "non_padded": 0,
2710
+ "effective_few_shots": 5.0,
2711
+ "num_truncated_few_shots": 0
2712
+ },
2713
+ "leaderboard|mmlu:human_sexuality|5": {
2714
+ "hashes": {
2715
+ "hash_examples": "4edd1e9045df5e3d",
2716
+ "hash_full_prompts": "3389ffb95929a661",
2717
+ "hash_input_tokens": "9e649cc80ef9f2fe",
2718
+ "hash_cont_tokens": "6778302b4a10b645"
2719
+ },
2720
+ "truncated": 0,
2721
+ "non_truncated": 131,
2722
+ "padded": 524,
2723
+ "non_padded": 0,
2724
+ "effective_few_shots": 5.0,
2725
+ "num_truncated_few_shots": 0
2726
+ },
2727
+ "leaderboard|mmlu:international_law|5": {
2728
+ "hashes": {
2729
+ "hash_examples": "db2fa00d771a062a",
2730
+ "hash_full_prompts": "104f48c64f6f9622",
2731
+ "hash_input_tokens": "c51db1d4a2a87eed",
2732
+ "hash_cont_tokens": "9eb54e1a46032749"
2733
+ },
2734
+ "truncated": 0,
2735
+ "non_truncated": 121,
2736
+ "padded": 484,
2737
+ "non_padded": 0,
2738
+ "effective_few_shots": 5.0,
2739
+ "num_truncated_few_shots": 0
2740
+ },
2741
+ "leaderboard|mmlu:jurisprudence|5": {
2742
+ "hashes": {
2743
+ "hash_examples": "e956f86b124076fe",
2744
+ "hash_full_prompts": "49295d36462ddc97",
2745
+ "hash_input_tokens": "a779a1b30bc13f30",
2746
+ "hash_cont_tokens": "f17d9a372cfd66b1"
2747
+ },
2748
+ "truncated": 0,
2749
+ "non_truncated": 108,
2750
+ "padded": 420,
2751
+ "non_padded": 12,
2752
+ "effective_few_shots": 5.0,
2753
+ "num_truncated_few_shots": 0
2754
+ },
2755
+ "leaderboard|mmlu:logical_fallacies|5": {
2756
+ "hashes": {
2757
+ "hash_examples": "956e0e6365ab79f1",
2758
+ "hash_full_prompts": "b64f452752d5cd23",
2759
+ "hash_input_tokens": "61d99e8d4d4d8652",
2760
+ "hash_cont_tokens": "cf44a68f5bca9a96"
2761
+ },
2762
+ "truncated": 0,
2763
+ "non_truncated": 163,
2764
+ "padded": 648,
2765
+ "non_padded": 4,
2766
+ "effective_few_shots": 5.0,
2767
+ "num_truncated_few_shots": 0
2768
+ },
2769
+ "leaderboard|mmlu:machine_learning|5": {
2770
+ "hashes": {
2771
+ "hash_examples": "397997cc6f4d581e",
2772
+ "hash_full_prompts": "54da136ebd708042",
2773
+ "hash_input_tokens": "11e6731506fcf366",
2774
+ "hash_cont_tokens": "eace00d420f4f32c"
2775
+ },
2776
+ "truncated": 0,
2777
+ "non_truncated": 112,
2778
+ "padded": 448,
2779
+ "non_padded": 0,
2780
+ "effective_few_shots": 5.0,
2781
+ "num_truncated_few_shots": 0
2782
+ },
2783
+ "leaderboard|mmlu:management|5": {
2784
+ "hashes": {
2785
+ "hash_examples": "2bcbe6f6ca63d740",
2786
+ "hash_full_prompts": "a4b864ff27598ba3",
2787
+ "hash_input_tokens": "caffa6e4e80cbd5e",
2788
+ "hash_cont_tokens": "b7c51d0250c252d8"
2789
+ },
2790
+ "truncated": 0,
2791
+ "non_truncated": 103,
2792
+ "padded": 412,
2793
+ "non_padded": 0,
2794
+ "effective_few_shots": 5.0,
2795
+ "num_truncated_few_shots": 0
2796
+ },
2797
+ "leaderboard|mmlu:marketing|5": {
2798
+ "hashes": {
2799
+ "hash_examples": "8ddb20d964a1b065",
2800
+ "hash_full_prompts": "c7183ac32f36104d",
2801
+ "hash_input_tokens": "5cd238ac5e8f19f4",
2802
+ "hash_cont_tokens": "086fb63f8b1d1339"
2803
+ },
2804
+ "truncated": 0,
2805
+ "non_truncated": 234,
2806
+ "padded": 924,
2807
+ "non_padded": 12,
2808
+ "effective_few_shots": 5.0,
2809
+ "num_truncated_few_shots": 0
2810
+ },
2811
+ "leaderboard|mmlu:medical_genetics|5": {
2812
+ "hashes": {
2813
+ "hash_examples": "182a71f4763d2cea",
2814
+ "hash_full_prompts": "c17b0a66e3027303",
2815
+ "hash_input_tokens": "46c0c8a573b43089",
2816
+ "hash_cont_tokens": "00520b0ec06da34f"
2817
+ },
2818
+ "truncated": 0,
2819
+ "non_truncated": 100,
2820
+ "padded": 400,
2821
+ "non_padded": 0,
2822
+ "effective_few_shots": 5.0,
2823
+ "num_truncated_few_shots": 0
2824
+ },
2825
+ "leaderboard|mmlu:miscellaneous|5": {
2826
+ "hashes": {
2827
+ "hash_examples": "4c404fdbb4ca57fc",
2828
+ "hash_full_prompts": "bc5fa37ce20a2503",
2829
+ "hash_input_tokens": "5327cd4585062ac2",
2830
+ "hash_cont_tokens": "1827274fa6537077"
2831
+ },
2832
+ "truncated": 0,
2833
+ "non_truncated": 783,
2834
+ "padded": 3132,
2835
+ "non_padded": 0,
2836
+ "effective_few_shots": 5.0,
2837
+ "num_truncated_few_shots": 0
2838
+ },
2839
+ "leaderboard|mmlu:moral_disputes|5": {
2840
+ "hashes": {
2841
+ "hash_examples": "60cbd2baa3fea5c9",
2842
+ "hash_full_prompts": "075742051236078f",
2843
+ "hash_input_tokens": "a2c9da202f686839",
2844
+ "hash_cont_tokens": "472c223f6f28cfc7"
2845
+ },
2846
+ "truncated": 0,
2847
+ "non_truncated": 346,
2848
+ "padded": 1384,
2849
+ "non_padded": 0,
2850
+ "effective_few_shots": 5.0,
2851
+ "num_truncated_few_shots": 0
2852
+ },
2853
+ "leaderboard|mmlu:moral_scenarios|5": {
2854
+ "hashes": {
2855
+ "hash_examples": "fd8b0431fbdd75ef",
2856
+ "hash_full_prompts": "533c4700637599a2",
2857
+ "hash_input_tokens": "9a1a9f3900b372e6",
2858
+ "hash_cont_tokens": "e90dade00a092f9e"
2859
+ },
2860
+ "truncated": 0,
2861
+ "non_truncated": 895,
2862
+ "padded": 3567,
2863
+ "non_padded": 13,
2864
+ "effective_few_shots": 5.0,
2865
+ "num_truncated_few_shots": 0
2866
+ },
2867
+ "leaderboard|mmlu:nutrition|5": {
2868
+ "hashes": {
2869
+ "hash_examples": "71e55e2b829b6528",
2870
+ "hash_full_prompts": "02b6877dc5a603a6",
2871
+ "hash_input_tokens": "dd91fec063272e23",
2872
+ "hash_cont_tokens": "128e0ec97d96b165"
2873
+ },
2874
+ "truncated": 0,
2875
+ "non_truncated": 306,
2876
+ "padded": 1224,
2877
+ "non_padded": 0,
2878
+ "effective_few_shots": 5.0,
2879
+ "num_truncated_few_shots": 0
2880
+ },
2881
+ "leaderboard|mmlu:philosophy|5": {
2882
+ "hashes": {
2883
+ "hash_examples": "a6d489a8d208fa4b",
2884
+ "hash_full_prompts": "0e65b5f40a9ceb20",
2885
+ "hash_input_tokens": "2255e15265a7d96a",
2886
+ "hash_cont_tokens": "cbfd7829a3e0f082"
2887
+ },
2888
+ "truncated": 0,
2889
+ "non_truncated": 311,
2890
+ "padded": 1244,
2891
+ "non_padded": 0,
2892
+ "effective_few_shots": 5.0,
2893
+ "num_truncated_few_shots": 0
2894
+ },
2895
+ "leaderboard|mmlu:prehistory|5": {
2896
+ "hashes": {
2897
+ "hash_examples": "6cc50f032a19acaa",
2898
+ "hash_full_prompts": "e838e60749e4a598",
2899
+ "hash_input_tokens": "1b9b906efbcc97fd",
2900
+ "hash_cont_tokens": "9c0cf5a2f71afa7e"
2901
+ },
2902
+ "truncated": 0,
2903
+ "non_truncated": 324,
2904
+ "padded": 1284,
2905
+ "non_padded": 12,
2906
+ "effective_few_shots": 5.0,
2907
+ "num_truncated_few_shots": 0
2908
+ },
2909
+ "leaderboard|mmlu:professional_accounting|5": {
2910
+ "hashes": {
2911
+ "hash_examples": "50f57ab32f5f6cea",
2912
+ "hash_full_prompts": "9abf7319f68b7ba8",
2913
+ "hash_input_tokens": "d42c8275cd4e10e1",
2914
+ "hash_cont_tokens": "50f011c2453517ee"
2915
+ },
2916
+ "truncated": 0,
2917
+ "non_truncated": 282,
2918
+ "padded": 1128,
2919
+ "non_padded": 0,
2920
+ "effective_few_shots": 5.0,
2921
+ "num_truncated_few_shots": 0
2922
+ },
2923
+ "leaderboard|mmlu:professional_law|5": {
2924
+ "hashes": {
2925
+ "hash_examples": "a8fdc85c64f4b215",
2926
+ "hash_full_prompts": "4074faf1eaedcfda",
2927
+ "hash_input_tokens": "215c854d27e741b8",
2928
+ "hash_cont_tokens": "73527e852c24186c"
2929
+ },
2930
+ "truncated": 0,
2931
+ "non_truncated": 1534,
2932
+ "padded": 6136,
2933
+ "non_padded": 0,
2934
+ "effective_few_shots": 5.0,
2935
+ "num_truncated_few_shots": 0
2936
+ },
2937
+ "leaderboard|mmlu:professional_medicine|5": {
2938
+ "hashes": {
2939
+ "hash_examples": "c373a28a3050a73a",
2940
+ "hash_full_prompts": "e72202fc20fcab70",
2941
+ "hash_input_tokens": "5a6e9aaaaea83544",
2942
+ "hash_cont_tokens": "ceb7af5e2e789abc"
2943
+ },
2944
+ "truncated": 0,
2945
+ "non_truncated": 272,
2946
+ "padded": 1088,
2947
+ "non_padded": 0,
2948
+ "effective_few_shots": 5.0,
2949
+ "num_truncated_few_shots": 0
2950
+ },
2951
+ "leaderboard|mmlu:professional_psychology|5": {
2952
+ "hashes": {
2953
+ "hash_examples": "bf5254fe818356af",
2954
+ "hash_full_prompts": "4dcb71c9ef602791",
2955
+ "hash_input_tokens": "316d0ba731b0de4f",
2956
+ "hash_cont_tokens": "8cfdced8a9667380"
2957
+ },
2958
+ "truncated": 0,
2959
+ "non_truncated": 612,
2960
+ "padded": 2428,
2961
+ "non_padded": 20,
2962
+ "effective_few_shots": 5.0,
2963
+ "num_truncated_few_shots": 0
2964
+ },
2965
+ "leaderboard|mmlu:public_relations|5": {
2966
+ "hashes": {
2967
+ "hash_examples": "b66d52e28e7d14e0",
2968
+ "hash_full_prompts": "c6050b1748185950",
2969
+ "hash_input_tokens": "2ba1d90c95e19dce",
2970
+ "hash_cont_tokens": "f8327461a9cc5123"
2971
+ },
2972
+ "truncated": 0,
2973
+ "non_truncated": 110,
2974
+ "padded": 436,
2975
+ "non_padded": 4,
2976
+ "effective_few_shots": 5.0,
2977
+ "num_truncated_few_shots": 0
2978
+ },
2979
+ "leaderboard|mmlu:security_studies|5": {
2980
+ "hashes": {
2981
+ "hash_examples": "514c14feaf000ad9",
2982
+ "hash_full_prompts": "4c6786915b670d03",
2983
+ "hash_input_tokens": "b92f71eccf4f89bf",
2984
+ "hash_cont_tokens": "c30b0c4d52c2875d"
2985
+ },
2986
+ "truncated": 0,
2987
+ "non_truncated": 245,
2988
+ "padded": 980,
2989
+ "non_padded": 0,
2990
+ "effective_few_shots": 5.0,
2991
+ "num_truncated_few_shots": 0
2992
+ },
2993
+ "leaderboard|mmlu:sociology|5": {
2994
+ "hashes": {
2995
+ "hash_examples": "f6c9bc9d18c80870",
2996
+ "hash_full_prompts": "a2e9a27e985a4e9b",
2997
+ "hash_input_tokens": "e821334ab55c0d44",
2998
+ "hash_cont_tokens": "eef4bd16d536fbd6"
2999
+ },
3000
+ "truncated": 0,
3001
+ "non_truncated": 201,
3002
+ "padded": 804,
3003
+ "non_padded": 0,
3004
+ "effective_few_shots": 5.0,
3005
+ "num_truncated_few_shots": 0
3006
+ },
3007
+ "leaderboard|mmlu:us_foreign_policy|5": {
3008
+ "hashes": {
3009
+ "hash_examples": "ed7b78629db6678f",
3010
+ "hash_full_prompts": "46d0986398662d59",
3011
+ "hash_input_tokens": "9f6b40a7b6b8a3b2",
3012
+ "hash_cont_tokens": "00520b0ec06da34f"
3013
+ },
3014
+ "truncated": 0,
3015
+ "non_truncated": 100,
3016
+ "padded": 400,
3017
+ "non_padded": 0,
3018
+ "effective_few_shots": 5.0,
3019
+ "num_truncated_few_shots": 0
3020
+ },
3021
+ "leaderboard|mmlu:virology|5": {
3022
+ "hashes": {
3023
+ "hash_examples": "bc52ffdc3f9b994a",
3024
+ "hash_full_prompts": "6b591e3983159283",
3025
+ "hash_input_tokens": "d7c6d39e149defc9",
3026
+ "hash_cont_tokens": "f5fc195e049353c0"
3027
+ },
3028
+ "truncated": 0,
3029
+ "non_truncated": 166,
3030
+ "padded": 664,
3031
+ "non_padded": 0,
3032
+ "effective_few_shots": 5.0,
3033
+ "num_truncated_few_shots": 0
3034
+ },
3035
+ "leaderboard|mmlu:world_religions|5": {
3036
+ "hashes": {
3037
+ "hash_examples": "ecdb4a4f94f62930",
3038
+ "hash_full_prompts": "8c2e37a02519af15",
3039
+ "hash_input_tokens": "80b87b6e634441d6",
3040
+ "hash_cont_tokens": "ada548665e87b1e0"
3041
+ },
3042
+ "truncated": 0,
3043
+ "non_truncated": 171,
3044
+ "padded": 684,
3045
+ "non_padded": 0,
3046
+ "effective_few_shots": 5.0,
3047
+ "num_truncated_few_shots": 0
3048
+ }
3049
+ },
3050
+ "summary_general": {
3051
+ "hashes": {
3052
+ "hash_examples": "341a076d0beb7048",
3053
+ "hash_full_prompts": "7c1eeddf962b8fc9",
3054
+ "hash_input_tokens": "98bef9715b6ebf74",
3055
+ "hash_cont_tokens": "3672212ca582e2d0"
3056
+ },
3057
+ "truncated": 0,
3058
+ "non_truncated": 14042,
3059
+ "padded": 56038,
3060
+ "non_padded": 130,
3061
+ "num_truncated_few_shots": 0
3062
+ }
3063
+ }