lewtun HF Staff commited on
Commit
fb677f6
·
verified ·
1 Parent(s): 6d168ad

Upload eval_results/HuggingFaceH4/mistral-7b-ift/v50.0/mmlu/results_2024-03-26T09-43-23.504759.json with huggingface_hub

Browse files
eval_results/HuggingFaceH4/mistral-7b-ift/v50.0/mmlu/results_2024-03-26T09-43-23.504759.json ADDED
@@ -0,0 +1,3063 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "config_general": {
3
+ "lighteval_sha": "?",
4
+ "num_fewshot_seeds": 1,
5
+ "override_batch_size": 1,
6
+ "max_samples": null,
7
+ "job_id": "",
8
+ "start_time": 5774060.136160472,
9
+ "end_time": 5774943.639989422,
10
+ "total_evaluation_time_secondes": "883.503828949295",
11
+ "model_name": "HuggingFaceH4/mistral-7b-ift",
12
+ "model_sha": "3462e12c9a378fe585a14a3524a11eeca56e29ee",
13
+ "model_dtype": "torch.bfloat16",
14
+ "model_size": "13.99 GB",
15
+ "config": null
16
+ },
17
+ "results": {
18
+ "leaderboard|mmlu:abstract_algebra|5": {
19
+ "acc": 0.33,
20
+ "acc_stderr": 0.04725815626252606
21
+ },
22
+ "leaderboard|mmlu:anatomy|5": {
23
+ "acc": 0.6074074074074074,
24
+ "acc_stderr": 0.04218506215368879
25
+ },
26
+ "leaderboard|mmlu:astronomy|5": {
27
+ "acc": 0.625,
28
+ "acc_stderr": 0.039397364351956274
29
+ },
30
+ "leaderboard|mmlu:business_ethics|5": {
31
+ "acc": 0.53,
32
+ "acc_stderr": 0.05016135580465919
33
+ },
34
+ "leaderboard|mmlu:clinical_knowledge|5": {
35
+ "acc": 0.690566037735849,
36
+ "acc_stderr": 0.028450154794118637
37
+ },
38
+ "leaderboard|mmlu:college_biology|5": {
39
+ "acc": 0.6875,
40
+ "acc_stderr": 0.038760854559127644
41
+ },
42
+ "leaderboard|mmlu:college_chemistry|5": {
43
+ "acc": 0.46,
44
+ "acc_stderr": 0.05009082659620333
45
+ },
46
+ "leaderboard|mmlu:college_computer_science|5": {
47
+ "acc": 0.55,
48
+ "acc_stderr": 0.05
49
+ },
50
+ "leaderboard|mmlu:college_mathematics|5": {
51
+ "acc": 0.32,
52
+ "acc_stderr": 0.04688261722621504
53
+ },
54
+ "leaderboard|mmlu:college_medicine|5": {
55
+ "acc": 0.6184971098265896,
56
+ "acc_stderr": 0.03703851193099521
57
+ },
58
+ "leaderboard|mmlu:college_physics|5": {
59
+ "acc": 0.28431372549019607,
60
+ "acc_stderr": 0.04488482852329017
61
+ },
62
+ "leaderboard|mmlu:computer_security|5": {
63
+ "acc": 0.79,
64
+ "acc_stderr": 0.04093601807403326
65
+ },
66
+ "leaderboard|mmlu:conceptual_physics|5": {
67
+ "acc": 0.502127659574468,
68
+ "acc_stderr": 0.03268572658667492
69
+ },
70
+ "leaderboard|mmlu:econometrics|5": {
71
+ "acc": 0.41228070175438597,
72
+ "acc_stderr": 0.046306532033665956
73
+ },
74
+ "leaderboard|mmlu:electrical_engineering|5": {
75
+ "acc": 0.503448275862069,
76
+ "acc_stderr": 0.04166567577101579
77
+ },
78
+ "leaderboard|mmlu:elementary_mathematics|5": {
79
+ "acc": 0.3835978835978836,
80
+ "acc_stderr": 0.0250437573185202
81
+ },
82
+ "leaderboard|mmlu:formal_logic|5": {
83
+ "acc": 0.40476190476190477,
84
+ "acc_stderr": 0.04390259265377562
85
+ },
86
+ "leaderboard|mmlu:global_facts|5": {
87
+ "acc": 0.37,
88
+ "acc_stderr": 0.04852365870939099
89
+ },
90
+ "leaderboard|mmlu:high_school_biology|5": {
91
+ "acc": 0.7064516129032258,
92
+ "acc_stderr": 0.025906087021319295
93
+ },
94
+ "leaderboard|mmlu:high_school_chemistry|5": {
95
+ "acc": 0.4827586206896552,
96
+ "acc_stderr": 0.035158955511656986
97
+ },
98
+ "leaderboard|mmlu:high_school_computer_science|5": {
99
+ "acc": 0.61,
100
+ "acc_stderr": 0.04902071300001974
101
+ },
102
+ "leaderboard|mmlu:high_school_european_history|5": {
103
+ "acc": 0.7454545454545455,
104
+ "acc_stderr": 0.03401506715249039
105
+ },
106
+ "leaderboard|mmlu:high_school_geography|5": {
107
+ "acc": 0.7777777777777778,
108
+ "acc_stderr": 0.02962022787479048
109
+ },
110
+ "leaderboard|mmlu:high_school_government_and_politics|5": {
111
+ "acc": 0.8238341968911918,
112
+ "acc_stderr": 0.02749350424454805
113
+ },
114
+ "leaderboard|mmlu:high_school_macroeconomics|5": {
115
+ "acc": 0.5897435897435898,
116
+ "acc_stderr": 0.024939313906940798
117
+ },
118
+ "leaderboard|mmlu:high_school_mathematics|5": {
119
+ "acc": 0.3,
120
+ "acc_stderr": 0.027940457136228412
121
+ },
122
+ "leaderboard|mmlu:high_school_microeconomics|5": {
123
+ "acc": 0.6218487394957983,
124
+ "acc_stderr": 0.031499305777849054
125
+ },
126
+ "leaderboard|mmlu:high_school_physics|5": {
127
+ "acc": 0.36423841059602646,
128
+ "acc_stderr": 0.03929111781242742
129
+ },
130
+ "leaderboard|mmlu:high_school_psychology|5": {
131
+ "acc": 0.7889908256880734,
132
+ "acc_stderr": 0.01749392240411265
133
+ },
134
+ "leaderboard|mmlu:high_school_statistics|5": {
135
+ "acc": 0.4537037037037037,
136
+ "acc_stderr": 0.03395322726375797
137
+ },
138
+ "leaderboard|mmlu:high_school_us_history|5": {
139
+ "acc": 0.7892156862745098,
140
+ "acc_stderr": 0.0286265479124374
141
+ },
142
+ "leaderboard|mmlu:high_school_world_history|5": {
143
+ "acc": 0.729957805907173,
144
+ "acc_stderr": 0.028900721906293426
145
+ },
146
+ "leaderboard|mmlu:human_aging|5": {
147
+ "acc": 0.6547085201793722,
148
+ "acc_stderr": 0.03191100192835794
149
+ },
150
+ "leaderboard|mmlu:human_sexuality|5": {
151
+ "acc": 0.6793893129770993,
152
+ "acc_stderr": 0.04093329229834278
153
+ },
154
+ "leaderboard|mmlu:international_law|5": {
155
+ "acc": 0.7272727272727273,
156
+ "acc_stderr": 0.04065578140908705
157
+ },
158
+ "leaderboard|mmlu:jurisprudence|5": {
159
+ "acc": 0.7037037037037037,
160
+ "acc_stderr": 0.04414343666854933
161
+ },
162
+ "leaderboard|mmlu:logical_fallacies|5": {
163
+ "acc": 0.7239263803680982,
164
+ "acc_stderr": 0.035123852837050475
165
+ },
166
+ "leaderboard|mmlu:machine_learning|5": {
167
+ "acc": 0.5089285714285714,
168
+ "acc_stderr": 0.04745033255489123
169
+ },
170
+ "leaderboard|mmlu:management|5": {
171
+ "acc": 0.7281553398058253,
172
+ "acc_stderr": 0.044052680241409216
173
+ },
174
+ "leaderboard|mmlu:marketing|5": {
175
+ "acc": 0.8547008547008547,
176
+ "acc_stderr": 0.023086635086841407
177
+ },
178
+ "leaderboard|mmlu:medical_genetics|5": {
179
+ "acc": 0.67,
180
+ "acc_stderr": 0.047258156262526094
181
+ },
182
+ "leaderboard|mmlu:miscellaneous|5": {
183
+ "acc": 0.8007662835249042,
184
+ "acc_stderr": 0.014283378044296415
185
+ },
186
+ "leaderboard|mmlu:moral_disputes|5": {
187
+ "acc": 0.615606936416185,
188
+ "acc_stderr": 0.02618966696627204
189
+ },
190
+ "leaderboard|mmlu:moral_scenarios|5": {
191
+ "acc": 0.34972067039106147,
192
+ "acc_stderr": 0.015949308790233645
193
+ },
194
+ "leaderboard|mmlu:nutrition|5": {
195
+ "acc": 0.6797385620915033,
196
+ "acc_stderr": 0.026716118380156847
197
+ },
198
+ "leaderboard|mmlu:philosophy|5": {
199
+ "acc": 0.6784565916398714,
200
+ "acc_stderr": 0.026527724079528872
201
+ },
202
+ "leaderboard|mmlu:prehistory|5": {
203
+ "acc": 0.6820987654320988,
204
+ "acc_stderr": 0.02591006352824088
205
+ },
206
+ "leaderboard|mmlu:professional_accounting|5": {
207
+ "acc": 0.4574468085106383,
208
+ "acc_stderr": 0.029719281272236844
209
+ },
210
+ "leaderboard|mmlu:professional_law|5": {
211
+ "acc": 0.42894393741851367,
212
+ "acc_stderr": 0.012640625443067361
213
+ },
214
+ "leaderboard|mmlu:professional_medicine|5": {
215
+ "acc": 0.6066176470588235,
216
+ "acc_stderr": 0.029674288281311155
217
+ },
218
+ "leaderboard|mmlu:professional_psychology|5": {
219
+ "acc": 0.6111111111111112,
220
+ "acc_stderr": 0.019722058939618065
221
+ },
222
+ "leaderboard|mmlu:public_relations|5": {
223
+ "acc": 0.6454545454545455,
224
+ "acc_stderr": 0.045820048415054174
225
+ },
226
+ "leaderboard|mmlu:security_studies|5": {
227
+ "acc": 0.6081632653061224,
228
+ "acc_stderr": 0.03125127591089165
229
+ },
230
+ "leaderboard|mmlu:sociology|5": {
231
+ "acc": 0.8009950248756219,
232
+ "acc_stderr": 0.028231365092758406
233
+ },
234
+ "leaderboard|mmlu:us_foreign_policy|5": {
235
+ "acc": 0.79,
236
+ "acc_stderr": 0.040936018074033256
237
+ },
238
+ "leaderboard|mmlu:virology|5": {
239
+ "acc": 0.5060240963855421,
240
+ "acc_stderr": 0.03892212195333045
241
+ },
242
+ "leaderboard|mmlu:world_religions|5": {
243
+ "acc": 0.8304093567251462,
244
+ "acc_stderr": 0.02878210810540171
245
+ },
246
+ "leaderboard|mmlu:_average|5": {
247
+ "acc": 0.5999265830511221,
248
+ "acc_stderr": 0.03480567513751256
249
+ }
250
+ },
251
+ "versions": {
252
+ "leaderboard|mmlu:abstract_algebra|5": 0,
253
+ "leaderboard|mmlu:anatomy|5": 0,
254
+ "leaderboard|mmlu:astronomy|5": 0,
255
+ "leaderboard|mmlu:business_ethics|5": 0,
256
+ "leaderboard|mmlu:clinical_knowledge|5": 0,
257
+ "leaderboard|mmlu:college_biology|5": 0,
258
+ "leaderboard|mmlu:college_chemistry|5": 0,
259
+ "leaderboard|mmlu:college_computer_science|5": 0,
260
+ "leaderboard|mmlu:college_mathematics|5": 0,
261
+ "leaderboard|mmlu:college_medicine|5": 0,
262
+ "leaderboard|mmlu:college_physics|5": 0,
263
+ "leaderboard|mmlu:computer_security|5": 0,
264
+ "leaderboard|mmlu:conceptual_physics|5": 0,
265
+ "leaderboard|mmlu:econometrics|5": 0,
266
+ "leaderboard|mmlu:electrical_engineering|5": 0,
267
+ "leaderboard|mmlu:elementary_mathematics|5": 0,
268
+ "leaderboard|mmlu:formal_logic|5": 0,
269
+ "leaderboard|mmlu:global_facts|5": 0,
270
+ "leaderboard|mmlu:high_school_biology|5": 0,
271
+ "leaderboard|mmlu:high_school_chemistry|5": 0,
272
+ "leaderboard|mmlu:high_school_computer_science|5": 0,
273
+ "leaderboard|mmlu:high_school_european_history|5": 0,
274
+ "leaderboard|mmlu:high_school_geography|5": 0,
275
+ "leaderboard|mmlu:high_school_government_and_politics|5": 0,
276
+ "leaderboard|mmlu:high_school_macroeconomics|5": 0,
277
+ "leaderboard|mmlu:high_school_mathematics|5": 0,
278
+ "leaderboard|mmlu:high_school_microeconomics|5": 0,
279
+ "leaderboard|mmlu:high_school_physics|5": 0,
280
+ "leaderboard|mmlu:high_school_psychology|5": 0,
281
+ "leaderboard|mmlu:high_school_statistics|5": 0,
282
+ "leaderboard|mmlu:high_school_us_history|5": 0,
283
+ "leaderboard|mmlu:high_school_world_history|5": 0,
284
+ "leaderboard|mmlu:human_aging|5": 0,
285
+ "leaderboard|mmlu:human_sexuality|5": 0,
286
+ "leaderboard|mmlu:international_law|5": 0,
287
+ "leaderboard|mmlu:jurisprudence|5": 0,
288
+ "leaderboard|mmlu:logical_fallacies|5": 0,
289
+ "leaderboard|mmlu:machine_learning|5": 0,
290
+ "leaderboard|mmlu:management|5": 0,
291
+ "leaderboard|mmlu:marketing|5": 0,
292
+ "leaderboard|mmlu:medical_genetics|5": 0,
293
+ "leaderboard|mmlu:miscellaneous|5": 0,
294
+ "leaderboard|mmlu:moral_disputes|5": 0,
295
+ "leaderboard|mmlu:moral_scenarios|5": 0,
296
+ "leaderboard|mmlu:nutrition|5": 0,
297
+ "leaderboard|mmlu:philosophy|5": 0,
298
+ "leaderboard|mmlu:prehistory|5": 0,
299
+ "leaderboard|mmlu:professional_accounting|5": 0,
300
+ "leaderboard|mmlu:professional_law|5": 0,
301
+ "leaderboard|mmlu:professional_medicine|5": 0,
302
+ "leaderboard|mmlu:professional_psychology|5": 0,
303
+ "leaderboard|mmlu:public_relations|5": 0,
304
+ "leaderboard|mmlu:security_studies|5": 0,
305
+ "leaderboard|mmlu:sociology|5": 0,
306
+ "leaderboard|mmlu:us_foreign_policy|5": 0,
307
+ "leaderboard|mmlu:virology|5": 0,
308
+ "leaderboard|mmlu:world_religions|5": 0
309
+ },
310
+ "config_tasks": {
311
+ "leaderboard|mmlu:abstract_algebra": {
312
+ "name": "mmlu:abstract_algebra",
313
+ "prompt_function": "mmlu_harness",
314
+ "hf_repo": "lighteval/mmlu",
315
+ "hf_subset": "abstract_algebra",
316
+ "metric": [
317
+ "loglikelihood_acc"
318
+ ],
319
+ "hf_avail_splits": [
320
+ "auxiliary_train",
321
+ "test",
322
+ "validation",
323
+ "dev"
324
+ ],
325
+ "evaluation_splits": [
326
+ "test"
327
+ ],
328
+ "few_shots_split": "dev",
329
+ "few_shots_select": "sequential",
330
+ "generation_size": 1,
331
+ "stop_sequence": [
332
+ "\n"
333
+ ],
334
+ "output_regex": null,
335
+ "frozen": false,
336
+ "suite": [
337
+ "leaderboard",
338
+ "mmlu"
339
+ ],
340
+ "original_num_docs": 100,
341
+ "effective_num_docs": 100,
342
+ "trust_dataset": true,
343
+ "must_remove_duplicate_docs": null
344
+ },
345
+ "leaderboard|mmlu:anatomy": {
346
+ "name": "mmlu:anatomy",
347
+ "prompt_function": "mmlu_harness",
348
+ "hf_repo": "lighteval/mmlu",
349
+ "hf_subset": "anatomy",
350
+ "metric": [
351
+ "loglikelihood_acc"
352
+ ],
353
+ "hf_avail_splits": [
354
+ "auxiliary_train",
355
+ "test",
356
+ "validation",
357
+ "dev"
358
+ ],
359
+ "evaluation_splits": [
360
+ "test"
361
+ ],
362
+ "few_shots_split": "dev",
363
+ "few_shots_select": "sequential",
364
+ "generation_size": 1,
365
+ "stop_sequence": [
366
+ "\n"
367
+ ],
368
+ "output_regex": null,
369
+ "frozen": false,
370
+ "suite": [
371
+ "leaderboard",
372
+ "mmlu"
373
+ ],
374
+ "original_num_docs": 135,
375
+ "effective_num_docs": 135,
376
+ "trust_dataset": true,
377
+ "must_remove_duplicate_docs": null
378
+ },
379
+ "leaderboard|mmlu:astronomy": {
380
+ "name": "mmlu:astronomy",
381
+ "prompt_function": "mmlu_harness",
382
+ "hf_repo": "lighteval/mmlu",
383
+ "hf_subset": "astronomy",
384
+ "metric": [
385
+ "loglikelihood_acc"
386
+ ],
387
+ "hf_avail_splits": [
388
+ "auxiliary_train",
389
+ "test",
390
+ "validation",
391
+ "dev"
392
+ ],
393
+ "evaluation_splits": [
394
+ "test"
395
+ ],
396
+ "few_shots_split": "dev",
397
+ "few_shots_select": "sequential",
398
+ "generation_size": 1,
399
+ "stop_sequence": [
400
+ "\n"
401
+ ],
402
+ "output_regex": null,
403
+ "frozen": false,
404
+ "suite": [
405
+ "leaderboard",
406
+ "mmlu"
407
+ ],
408
+ "original_num_docs": 152,
409
+ "effective_num_docs": 152,
410
+ "trust_dataset": true,
411
+ "must_remove_duplicate_docs": null
412
+ },
413
+ "leaderboard|mmlu:business_ethics": {
414
+ "name": "mmlu:business_ethics",
415
+ "prompt_function": "mmlu_harness",
416
+ "hf_repo": "lighteval/mmlu",
417
+ "hf_subset": "business_ethics",
418
+ "metric": [
419
+ "loglikelihood_acc"
420
+ ],
421
+ "hf_avail_splits": [
422
+ "auxiliary_train",
423
+ "test",
424
+ "validation",
425
+ "dev"
426
+ ],
427
+ "evaluation_splits": [
428
+ "test"
429
+ ],
430
+ "few_shots_split": "dev",
431
+ "few_shots_select": "sequential",
432
+ "generation_size": 1,
433
+ "stop_sequence": [
434
+ "\n"
435
+ ],
436
+ "output_regex": null,
437
+ "frozen": false,
438
+ "suite": [
439
+ "leaderboard",
440
+ "mmlu"
441
+ ],
442
+ "original_num_docs": 100,
443
+ "effective_num_docs": 100,
444
+ "trust_dataset": true,
445
+ "must_remove_duplicate_docs": null
446
+ },
447
+ "leaderboard|mmlu:clinical_knowledge": {
448
+ "name": "mmlu:clinical_knowledge",
449
+ "prompt_function": "mmlu_harness",
450
+ "hf_repo": "lighteval/mmlu",
451
+ "hf_subset": "clinical_knowledge",
452
+ "metric": [
453
+ "loglikelihood_acc"
454
+ ],
455
+ "hf_avail_splits": [
456
+ "auxiliary_train",
457
+ "test",
458
+ "validation",
459
+ "dev"
460
+ ],
461
+ "evaluation_splits": [
462
+ "test"
463
+ ],
464
+ "few_shots_split": "dev",
465
+ "few_shots_select": "sequential",
466
+ "generation_size": 1,
467
+ "stop_sequence": [
468
+ "\n"
469
+ ],
470
+ "output_regex": null,
471
+ "frozen": false,
472
+ "suite": [
473
+ "leaderboard",
474
+ "mmlu"
475
+ ],
476
+ "original_num_docs": 265,
477
+ "effective_num_docs": 265,
478
+ "trust_dataset": true,
479
+ "must_remove_duplicate_docs": null
480
+ },
481
+ "leaderboard|mmlu:college_biology": {
482
+ "name": "mmlu:college_biology",
483
+ "prompt_function": "mmlu_harness",
484
+ "hf_repo": "lighteval/mmlu",
485
+ "hf_subset": "college_biology",
486
+ "metric": [
487
+ "loglikelihood_acc"
488
+ ],
489
+ "hf_avail_splits": [
490
+ "auxiliary_train",
491
+ "test",
492
+ "validation",
493
+ "dev"
494
+ ],
495
+ "evaluation_splits": [
496
+ "test"
497
+ ],
498
+ "few_shots_split": "dev",
499
+ "few_shots_select": "sequential",
500
+ "generation_size": 1,
501
+ "stop_sequence": [
502
+ "\n"
503
+ ],
504
+ "output_regex": null,
505
+ "frozen": false,
506
+ "suite": [
507
+ "leaderboard",
508
+ "mmlu"
509
+ ],
510
+ "original_num_docs": 144,
511
+ "effective_num_docs": 144,
512
+ "trust_dataset": true,
513
+ "must_remove_duplicate_docs": null
514
+ },
515
+ "leaderboard|mmlu:college_chemistry": {
516
+ "name": "mmlu:college_chemistry",
517
+ "prompt_function": "mmlu_harness",
518
+ "hf_repo": "lighteval/mmlu",
519
+ "hf_subset": "college_chemistry",
520
+ "metric": [
521
+ "loglikelihood_acc"
522
+ ],
523
+ "hf_avail_splits": [
524
+ "auxiliary_train",
525
+ "test",
526
+ "validation",
527
+ "dev"
528
+ ],
529
+ "evaluation_splits": [
530
+ "test"
531
+ ],
532
+ "few_shots_split": "dev",
533
+ "few_shots_select": "sequential",
534
+ "generation_size": 1,
535
+ "stop_sequence": [
536
+ "\n"
537
+ ],
538
+ "output_regex": null,
539
+ "frozen": false,
540
+ "suite": [
541
+ "leaderboard",
542
+ "mmlu"
543
+ ],
544
+ "original_num_docs": 100,
545
+ "effective_num_docs": 100,
546
+ "trust_dataset": true,
547
+ "must_remove_duplicate_docs": null
548
+ },
549
+ "leaderboard|mmlu:college_computer_science": {
550
+ "name": "mmlu:college_computer_science",
551
+ "prompt_function": "mmlu_harness",
552
+ "hf_repo": "lighteval/mmlu",
553
+ "hf_subset": "college_computer_science",
554
+ "metric": [
555
+ "loglikelihood_acc"
556
+ ],
557
+ "hf_avail_splits": [
558
+ "auxiliary_train",
559
+ "test",
560
+ "validation",
561
+ "dev"
562
+ ],
563
+ "evaluation_splits": [
564
+ "test"
565
+ ],
566
+ "few_shots_split": "dev",
567
+ "few_shots_select": "sequential",
568
+ "generation_size": 1,
569
+ "stop_sequence": [
570
+ "\n"
571
+ ],
572
+ "output_regex": null,
573
+ "frozen": false,
574
+ "suite": [
575
+ "leaderboard",
576
+ "mmlu"
577
+ ],
578
+ "original_num_docs": 100,
579
+ "effective_num_docs": 100,
580
+ "trust_dataset": true,
581
+ "must_remove_duplicate_docs": null
582
+ },
583
+ "leaderboard|mmlu:college_mathematics": {
584
+ "name": "mmlu:college_mathematics",
585
+ "prompt_function": "mmlu_harness",
586
+ "hf_repo": "lighteval/mmlu",
587
+ "hf_subset": "college_mathematics",
588
+ "metric": [
589
+ "loglikelihood_acc"
590
+ ],
591
+ "hf_avail_splits": [
592
+ "auxiliary_train",
593
+ "test",
594
+ "validation",
595
+ "dev"
596
+ ],
597
+ "evaluation_splits": [
598
+ "test"
599
+ ],
600
+ "few_shots_split": "dev",
601
+ "few_shots_select": "sequential",
602
+ "generation_size": 1,
603
+ "stop_sequence": [
604
+ "\n"
605
+ ],
606
+ "output_regex": null,
607
+ "frozen": false,
608
+ "suite": [
609
+ "leaderboard",
610
+ "mmlu"
611
+ ],
612
+ "original_num_docs": 100,
613
+ "effective_num_docs": 100,
614
+ "trust_dataset": true,
615
+ "must_remove_duplicate_docs": null
616
+ },
617
+ "leaderboard|mmlu:college_medicine": {
618
+ "name": "mmlu:college_medicine",
619
+ "prompt_function": "mmlu_harness",
620
+ "hf_repo": "lighteval/mmlu",
621
+ "hf_subset": "college_medicine",
622
+ "metric": [
623
+ "loglikelihood_acc"
624
+ ],
625
+ "hf_avail_splits": [
626
+ "auxiliary_train",
627
+ "test",
628
+ "validation",
629
+ "dev"
630
+ ],
631
+ "evaluation_splits": [
632
+ "test"
633
+ ],
634
+ "few_shots_split": "dev",
635
+ "few_shots_select": "sequential",
636
+ "generation_size": 1,
637
+ "stop_sequence": [
638
+ "\n"
639
+ ],
640
+ "output_regex": null,
641
+ "frozen": false,
642
+ "suite": [
643
+ "leaderboard",
644
+ "mmlu"
645
+ ],
646
+ "original_num_docs": 173,
647
+ "effective_num_docs": 173,
648
+ "trust_dataset": true,
649
+ "must_remove_duplicate_docs": null
650
+ },
651
+ "leaderboard|mmlu:college_physics": {
652
+ "name": "mmlu:college_physics",
653
+ "prompt_function": "mmlu_harness",
654
+ "hf_repo": "lighteval/mmlu",
655
+ "hf_subset": "college_physics",
656
+ "metric": [
657
+ "loglikelihood_acc"
658
+ ],
659
+ "hf_avail_splits": [
660
+ "auxiliary_train",
661
+ "test",
662
+ "validation",
663
+ "dev"
664
+ ],
665
+ "evaluation_splits": [
666
+ "test"
667
+ ],
668
+ "few_shots_split": "dev",
669
+ "few_shots_select": "sequential",
670
+ "generation_size": 1,
671
+ "stop_sequence": [
672
+ "\n"
673
+ ],
674
+ "output_regex": null,
675
+ "frozen": false,
676
+ "suite": [
677
+ "leaderboard",
678
+ "mmlu"
679
+ ],
680
+ "original_num_docs": 102,
681
+ "effective_num_docs": 102,
682
+ "trust_dataset": true,
683
+ "must_remove_duplicate_docs": null
684
+ },
685
+ "leaderboard|mmlu:computer_security": {
686
+ "name": "mmlu:computer_security",
687
+ "prompt_function": "mmlu_harness",
688
+ "hf_repo": "lighteval/mmlu",
689
+ "hf_subset": "computer_security",
690
+ "metric": [
691
+ "loglikelihood_acc"
692
+ ],
693
+ "hf_avail_splits": [
694
+ "auxiliary_train",
695
+ "test",
696
+ "validation",
697
+ "dev"
698
+ ],
699
+ "evaluation_splits": [
700
+ "test"
701
+ ],
702
+ "few_shots_split": "dev",
703
+ "few_shots_select": "sequential",
704
+ "generation_size": 1,
705
+ "stop_sequence": [
706
+ "\n"
707
+ ],
708
+ "output_regex": null,
709
+ "frozen": false,
710
+ "suite": [
711
+ "leaderboard",
712
+ "mmlu"
713
+ ],
714
+ "original_num_docs": 100,
715
+ "effective_num_docs": 100,
716
+ "trust_dataset": true,
717
+ "must_remove_duplicate_docs": null
718
+ },
719
+ "leaderboard|mmlu:conceptual_physics": {
720
+ "name": "mmlu:conceptual_physics",
721
+ "prompt_function": "mmlu_harness",
722
+ "hf_repo": "lighteval/mmlu",
723
+ "hf_subset": "conceptual_physics",
724
+ "metric": [
725
+ "loglikelihood_acc"
726
+ ],
727
+ "hf_avail_splits": [
728
+ "auxiliary_train",
729
+ "test",
730
+ "validation",
731
+ "dev"
732
+ ],
733
+ "evaluation_splits": [
734
+ "test"
735
+ ],
736
+ "few_shots_split": "dev",
737
+ "few_shots_select": "sequential",
738
+ "generation_size": 1,
739
+ "stop_sequence": [
740
+ "\n"
741
+ ],
742
+ "output_regex": null,
743
+ "frozen": false,
744
+ "suite": [
745
+ "leaderboard",
746
+ "mmlu"
747
+ ],
748
+ "original_num_docs": 235,
749
+ "effective_num_docs": 235,
750
+ "trust_dataset": true,
751
+ "must_remove_duplicate_docs": null
752
+ },
753
+ "leaderboard|mmlu:econometrics": {
754
+ "name": "mmlu:econometrics",
755
+ "prompt_function": "mmlu_harness",
756
+ "hf_repo": "lighteval/mmlu",
757
+ "hf_subset": "econometrics",
758
+ "metric": [
759
+ "loglikelihood_acc"
760
+ ],
761
+ "hf_avail_splits": [
762
+ "auxiliary_train",
763
+ "test",
764
+ "validation",
765
+ "dev"
766
+ ],
767
+ "evaluation_splits": [
768
+ "test"
769
+ ],
770
+ "few_shots_split": "dev",
771
+ "few_shots_select": "sequential",
772
+ "generation_size": 1,
773
+ "stop_sequence": [
774
+ "\n"
775
+ ],
776
+ "output_regex": null,
777
+ "frozen": false,
778
+ "suite": [
779
+ "leaderboard",
780
+ "mmlu"
781
+ ],
782
+ "original_num_docs": 114,
783
+ "effective_num_docs": 114,
784
+ "trust_dataset": true,
785
+ "must_remove_duplicate_docs": null
786
+ },
787
+ "leaderboard|mmlu:electrical_engineering": {
788
+ "name": "mmlu:electrical_engineering",
789
+ "prompt_function": "mmlu_harness",
790
+ "hf_repo": "lighteval/mmlu",
791
+ "hf_subset": "electrical_engineering",
792
+ "metric": [
793
+ "loglikelihood_acc"
794
+ ],
795
+ "hf_avail_splits": [
796
+ "auxiliary_train",
797
+ "test",
798
+ "validation",
799
+ "dev"
800
+ ],
801
+ "evaluation_splits": [
802
+ "test"
803
+ ],
804
+ "few_shots_split": "dev",
805
+ "few_shots_select": "sequential",
806
+ "generation_size": 1,
807
+ "stop_sequence": [
808
+ "\n"
809
+ ],
810
+ "output_regex": null,
811
+ "frozen": false,
812
+ "suite": [
813
+ "leaderboard",
814
+ "mmlu"
815
+ ],
816
+ "original_num_docs": 145,
817
+ "effective_num_docs": 145,
818
+ "trust_dataset": true,
819
+ "must_remove_duplicate_docs": null
820
+ },
821
+ "leaderboard|mmlu:elementary_mathematics": {
822
+ "name": "mmlu:elementary_mathematics",
823
+ "prompt_function": "mmlu_harness",
824
+ "hf_repo": "lighteval/mmlu",
825
+ "hf_subset": "elementary_mathematics",
826
+ "metric": [
827
+ "loglikelihood_acc"
828
+ ],
829
+ "hf_avail_splits": [
830
+ "auxiliary_train",
831
+ "test",
832
+ "validation",
833
+ "dev"
834
+ ],
835
+ "evaluation_splits": [
836
+ "test"
837
+ ],
838
+ "few_shots_split": "dev",
839
+ "few_shots_select": "sequential",
840
+ "generation_size": 1,
841
+ "stop_sequence": [
842
+ "\n"
843
+ ],
844
+ "output_regex": null,
845
+ "frozen": false,
846
+ "suite": [
847
+ "leaderboard",
848
+ "mmlu"
849
+ ],
850
+ "original_num_docs": 378,
851
+ "effective_num_docs": 378,
852
+ "trust_dataset": true,
853
+ "must_remove_duplicate_docs": null
854
+ },
855
+ "leaderboard|mmlu:formal_logic": {
856
+ "name": "mmlu:formal_logic",
857
+ "prompt_function": "mmlu_harness",
858
+ "hf_repo": "lighteval/mmlu",
859
+ "hf_subset": "formal_logic",
860
+ "metric": [
861
+ "loglikelihood_acc"
862
+ ],
863
+ "hf_avail_splits": [
864
+ "auxiliary_train",
865
+ "test",
866
+ "validation",
867
+ "dev"
868
+ ],
869
+ "evaluation_splits": [
870
+ "test"
871
+ ],
872
+ "few_shots_split": "dev",
873
+ "few_shots_select": "sequential",
874
+ "generation_size": 1,
875
+ "stop_sequence": [
876
+ "\n"
877
+ ],
878
+ "output_regex": null,
879
+ "frozen": false,
880
+ "suite": [
881
+ "leaderboard",
882
+ "mmlu"
883
+ ],
884
+ "original_num_docs": 126,
885
+ "effective_num_docs": 126,
886
+ "trust_dataset": true,
887
+ "must_remove_duplicate_docs": null
888
+ },
889
+ "leaderboard|mmlu:global_facts": {
890
+ "name": "mmlu:global_facts",
891
+ "prompt_function": "mmlu_harness",
892
+ "hf_repo": "lighteval/mmlu",
893
+ "hf_subset": "global_facts",
894
+ "metric": [
895
+ "loglikelihood_acc"
896
+ ],
897
+ "hf_avail_splits": [
898
+ "auxiliary_train",
899
+ "test",
900
+ "validation",
901
+ "dev"
902
+ ],
903
+ "evaluation_splits": [
904
+ "test"
905
+ ],
906
+ "few_shots_split": "dev",
907
+ "few_shots_select": "sequential",
908
+ "generation_size": 1,
909
+ "stop_sequence": [
910
+ "\n"
911
+ ],
912
+ "output_regex": null,
913
+ "frozen": false,
914
+ "suite": [
915
+ "leaderboard",
916
+ "mmlu"
917
+ ],
918
+ "original_num_docs": 100,
919
+ "effective_num_docs": 100,
920
+ "trust_dataset": true,
921
+ "must_remove_duplicate_docs": null
922
+ },
923
+ "leaderboard|mmlu:high_school_biology": {
924
+ "name": "mmlu:high_school_biology",
925
+ "prompt_function": "mmlu_harness",
926
+ "hf_repo": "lighteval/mmlu",
927
+ "hf_subset": "high_school_biology",
928
+ "metric": [
929
+ "loglikelihood_acc"
930
+ ],
931
+ "hf_avail_splits": [
932
+ "auxiliary_train",
933
+ "test",
934
+ "validation",
935
+ "dev"
936
+ ],
937
+ "evaluation_splits": [
938
+ "test"
939
+ ],
940
+ "few_shots_split": "dev",
941
+ "few_shots_select": "sequential",
942
+ "generation_size": 1,
943
+ "stop_sequence": [
944
+ "\n"
945
+ ],
946
+ "output_regex": null,
947
+ "frozen": false,
948
+ "suite": [
949
+ "leaderboard",
950
+ "mmlu"
951
+ ],
952
+ "original_num_docs": 310,
953
+ "effective_num_docs": 310,
954
+ "trust_dataset": true,
955
+ "must_remove_duplicate_docs": null
956
+ },
957
+ "leaderboard|mmlu:high_school_chemistry": {
958
+ "name": "mmlu:high_school_chemistry",
959
+ "prompt_function": "mmlu_harness",
960
+ "hf_repo": "lighteval/mmlu",
961
+ "hf_subset": "high_school_chemistry",
962
+ "metric": [
963
+ "loglikelihood_acc"
964
+ ],
965
+ "hf_avail_splits": [
966
+ "auxiliary_train",
967
+ "test",
968
+ "validation",
969
+ "dev"
970
+ ],
971
+ "evaluation_splits": [
972
+ "test"
973
+ ],
974
+ "few_shots_split": "dev",
975
+ "few_shots_select": "sequential",
976
+ "generation_size": 1,
977
+ "stop_sequence": [
978
+ "\n"
979
+ ],
980
+ "output_regex": null,
981
+ "frozen": false,
982
+ "suite": [
983
+ "leaderboard",
984
+ "mmlu"
985
+ ],
986
+ "original_num_docs": 203,
987
+ "effective_num_docs": 203,
988
+ "trust_dataset": true,
989
+ "must_remove_duplicate_docs": null
990
+ },
991
+ "leaderboard|mmlu:high_school_computer_science": {
992
+ "name": "mmlu:high_school_computer_science",
993
+ "prompt_function": "mmlu_harness",
994
+ "hf_repo": "lighteval/mmlu",
995
+ "hf_subset": "high_school_computer_science",
996
+ "metric": [
997
+ "loglikelihood_acc"
998
+ ],
999
+ "hf_avail_splits": [
1000
+ "auxiliary_train",
1001
+ "test",
1002
+ "validation",
1003
+ "dev"
1004
+ ],
1005
+ "evaluation_splits": [
1006
+ "test"
1007
+ ],
1008
+ "few_shots_split": "dev",
1009
+ "few_shots_select": "sequential",
1010
+ "generation_size": 1,
1011
+ "stop_sequence": [
1012
+ "\n"
1013
+ ],
1014
+ "output_regex": null,
1015
+ "frozen": false,
1016
+ "suite": [
1017
+ "leaderboard",
1018
+ "mmlu"
1019
+ ],
1020
+ "original_num_docs": 100,
1021
+ "effective_num_docs": 100,
1022
+ "trust_dataset": true,
1023
+ "must_remove_duplicate_docs": null
1024
+ },
1025
+ "leaderboard|mmlu:high_school_european_history": {
1026
+ "name": "mmlu:high_school_european_history",
1027
+ "prompt_function": "mmlu_harness",
1028
+ "hf_repo": "lighteval/mmlu",
1029
+ "hf_subset": "high_school_european_history",
1030
+ "metric": [
1031
+ "loglikelihood_acc"
1032
+ ],
1033
+ "hf_avail_splits": [
1034
+ "auxiliary_train",
1035
+ "test",
1036
+ "validation",
1037
+ "dev"
1038
+ ],
1039
+ "evaluation_splits": [
1040
+ "test"
1041
+ ],
1042
+ "few_shots_split": "dev",
1043
+ "few_shots_select": "sequential",
1044
+ "generation_size": 1,
1045
+ "stop_sequence": [
1046
+ "\n"
1047
+ ],
1048
+ "output_regex": null,
1049
+ "frozen": false,
1050
+ "suite": [
1051
+ "leaderboard",
1052
+ "mmlu"
1053
+ ],
1054
+ "original_num_docs": 165,
1055
+ "effective_num_docs": 165,
1056
+ "trust_dataset": true,
1057
+ "must_remove_duplicate_docs": null
1058
+ },
1059
+ "leaderboard|mmlu:high_school_geography": {
1060
+ "name": "mmlu:high_school_geography",
1061
+ "prompt_function": "mmlu_harness",
1062
+ "hf_repo": "lighteval/mmlu",
1063
+ "hf_subset": "high_school_geography",
1064
+ "metric": [
1065
+ "loglikelihood_acc"
1066
+ ],
1067
+ "hf_avail_splits": [
1068
+ "auxiliary_train",
1069
+ "test",
1070
+ "validation",
1071
+ "dev"
1072
+ ],
1073
+ "evaluation_splits": [
1074
+ "test"
1075
+ ],
1076
+ "few_shots_split": "dev",
1077
+ "few_shots_select": "sequential",
1078
+ "generation_size": 1,
1079
+ "stop_sequence": [
1080
+ "\n"
1081
+ ],
1082
+ "output_regex": null,
1083
+ "frozen": false,
1084
+ "suite": [
1085
+ "leaderboard",
1086
+ "mmlu"
1087
+ ],
1088
+ "original_num_docs": 198,
1089
+ "effective_num_docs": 198,
1090
+ "trust_dataset": true,
1091
+ "must_remove_duplicate_docs": null
1092
+ },
1093
+ "leaderboard|mmlu:high_school_government_and_politics": {
1094
+ "name": "mmlu:high_school_government_and_politics",
1095
+ "prompt_function": "mmlu_harness",
1096
+ "hf_repo": "lighteval/mmlu",
1097
+ "hf_subset": "high_school_government_and_politics",
1098
+ "metric": [
1099
+ "loglikelihood_acc"
1100
+ ],
1101
+ "hf_avail_splits": [
1102
+ "auxiliary_train",
1103
+ "test",
1104
+ "validation",
1105
+ "dev"
1106
+ ],
1107
+ "evaluation_splits": [
1108
+ "test"
1109
+ ],
1110
+ "few_shots_split": "dev",
1111
+ "few_shots_select": "sequential",
1112
+ "generation_size": 1,
1113
+ "stop_sequence": [
1114
+ "\n"
1115
+ ],
1116
+ "output_regex": null,
1117
+ "frozen": false,
1118
+ "suite": [
1119
+ "leaderboard",
1120
+ "mmlu"
1121
+ ],
1122
+ "original_num_docs": 193,
1123
+ "effective_num_docs": 193,
1124
+ "trust_dataset": true,
1125
+ "must_remove_duplicate_docs": null
1126
+ },
1127
+ "leaderboard|mmlu:high_school_macroeconomics": {
1128
+ "name": "mmlu:high_school_macroeconomics",
1129
+ "prompt_function": "mmlu_harness",
1130
+ "hf_repo": "lighteval/mmlu",
1131
+ "hf_subset": "high_school_macroeconomics",
1132
+ "metric": [
1133
+ "loglikelihood_acc"
1134
+ ],
1135
+ "hf_avail_splits": [
1136
+ "auxiliary_train",
1137
+ "test",
1138
+ "validation",
1139
+ "dev"
1140
+ ],
1141
+ "evaluation_splits": [
1142
+ "test"
1143
+ ],
1144
+ "few_shots_split": "dev",
1145
+ "few_shots_select": "sequential",
1146
+ "generation_size": 1,
1147
+ "stop_sequence": [
1148
+ "\n"
1149
+ ],
1150
+ "output_regex": null,
1151
+ "frozen": false,
1152
+ "suite": [
1153
+ "leaderboard",
1154
+ "mmlu"
1155
+ ],
1156
+ "original_num_docs": 390,
1157
+ "effective_num_docs": 390,
1158
+ "trust_dataset": true,
1159
+ "must_remove_duplicate_docs": null
1160
+ },
1161
+ "leaderboard|mmlu:high_school_mathematics": {
1162
+ "name": "mmlu:high_school_mathematics",
1163
+ "prompt_function": "mmlu_harness",
1164
+ "hf_repo": "lighteval/mmlu",
1165
+ "hf_subset": "high_school_mathematics",
1166
+ "metric": [
1167
+ "loglikelihood_acc"
1168
+ ],
1169
+ "hf_avail_splits": [
1170
+ "auxiliary_train",
1171
+ "test",
1172
+ "validation",
1173
+ "dev"
1174
+ ],
1175
+ "evaluation_splits": [
1176
+ "test"
1177
+ ],
1178
+ "few_shots_split": "dev",
1179
+ "few_shots_select": "sequential",
1180
+ "generation_size": 1,
1181
+ "stop_sequence": [
1182
+ "\n"
1183
+ ],
1184
+ "output_regex": null,
1185
+ "frozen": false,
1186
+ "suite": [
1187
+ "leaderboard",
1188
+ "mmlu"
1189
+ ],
1190
+ "original_num_docs": 270,
1191
+ "effective_num_docs": 270,
1192
+ "trust_dataset": true,
1193
+ "must_remove_duplicate_docs": null
1194
+ },
1195
+ "leaderboard|mmlu:high_school_microeconomics": {
1196
+ "name": "mmlu:high_school_microeconomics",
1197
+ "prompt_function": "mmlu_harness",
1198
+ "hf_repo": "lighteval/mmlu",
1199
+ "hf_subset": "high_school_microeconomics",
1200
+ "metric": [
1201
+ "loglikelihood_acc"
1202
+ ],
1203
+ "hf_avail_splits": [
1204
+ "auxiliary_train",
1205
+ "test",
1206
+ "validation",
1207
+ "dev"
1208
+ ],
1209
+ "evaluation_splits": [
1210
+ "test"
1211
+ ],
1212
+ "few_shots_split": "dev",
1213
+ "few_shots_select": "sequential",
1214
+ "generation_size": 1,
1215
+ "stop_sequence": [
1216
+ "\n"
1217
+ ],
1218
+ "output_regex": null,
1219
+ "frozen": false,
1220
+ "suite": [
1221
+ "leaderboard",
1222
+ "mmlu"
1223
+ ],
1224
+ "original_num_docs": 238,
1225
+ "effective_num_docs": 238,
1226
+ "trust_dataset": true,
1227
+ "must_remove_duplicate_docs": null
1228
+ },
1229
+ "leaderboard|mmlu:high_school_physics": {
1230
+ "name": "mmlu:high_school_physics",
1231
+ "prompt_function": "mmlu_harness",
1232
+ "hf_repo": "lighteval/mmlu",
1233
+ "hf_subset": "high_school_physics",
1234
+ "metric": [
1235
+ "loglikelihood_acc"
1236
+ ],
1237
+ "hf_avail_splits": [
1238
+ "auxiliary_train",
1239
+ "test",
1240
+ "validation",
1241
+ "dev"
1242
+ ],
1243
+ "evaluation_splits": [
1244
+ "test"
1245
+ ],
1246
+ "few_shots_split": "dev",
1247
+ "few_shots_select": "sequential",
1248
+ "generation_size": 1,
1249
+ "stop_sequence": [
1250
+ "\n"
1251
+ ],
1252
+ "output_regex": null,
1253
+ "frozen": false,
1254
+ "suite": [
1255
+ "leaderboard",
1256
+ "mmlu"
1257
+ ],
1258
+ "original_num_docs": 151,
1259
+ "effective_num_docs": 151,
1260
+ "trust_dataset": true,
1261
+ "must_remove_duplicate_docs": null
1262
+ },
1263
+ "leaderboard|mmlu:high_school_psychology": {
1264
+ "name": "mmlu:high_school_psychology",
1265
+ "prompt_function": "mmlu_harness",
1266
+ "hf_repo": "lighteval/mmlu",
1267
+ "hf_subset": "high_school_psychology",
1268
+ "metric": [
1269
+ "loglikelihood_acc"
1270
+ ],
1271
+ "hf_avail_splits": [
1272
+ "auxiliary_train",
1273
+ "test",
1274
+ "validation",
1275
+ "dev"
1276
+ ],
1277
+ "evaluation_splits": [
1278
+ "test"
1279
+ ],
1280
+ "few_shots_split": "dev",
1281
+ "few_shots_select": "sequential",
1282
+ "generation_size": 1,
1283
+ "stop_sequence": [
1284
+ "\n"
1285
+ ],
1286
+ "output_regex": null,
1287
+ "frozen": false,
1288
+ "suite": [
1289
+ "leaderboard",
1290
+ "mmlu"
1291
+ ],
1292
+ "original_num_docs": 545,
1293
+ "effective_num_docs": 545,
1294
+ "trust_dataset": true,
1295
+ "must_remove_duplicate_docs": null
1296
+ },
1297
+ "leaderboard|mmlu:high_school_statistics": {
1298
+ "name": "mmlu:high_school_statistics",
1299
+ "prompt_function": "mmlu_harness",
1300
+ "hf_repo": "lighteval/mmlu",
1301
+ "hf_subset": "high_school_statistics",
1302
+ "metric": [
1303
+ "loglikelihood_acc"
1304
+ ],
1305
+ "hf_avail_splits": [
1306
+ "auxiliary_train",
1307
+ "test",
1308
+ "validation",
1309
+ "dev"
1310
+ ],
1311
+ "evaluation_splits": [
1312
+ "test"
1313
+ ],
1314
+ "few_shots_split": "dev",
1315
+ "few_shots_select": "sequential",
1316
+ "generation_size": 1,
1317
+ "stop_sequence": [
1318
+ "\n"
1319
+ ],
1320
+ "output_regex": null,
1321
+ "frozen": false,
1322
+ "suite": [
1323
+ "leaderboard",
1324
+ "mmlu"
1325
+ ],
1326
+ "original_num_docs": 216,
1327
+ "effective_num_docs": 216,
1328
+ "trust_dataset": true,
1329
+ "must_remove_duplicate_docs": null
1330
+ },
1331
+ "leaderboard|mmlu:high_school_us_history": {
1332
+ "name": "mmlu:high_school_us_history",
1333
+ "prompt_function": "mmlu_harness",
1334
+ "hf_repo": "lighteval/mmlu",
1335
+ "hf_subset": "high_school_us_history",
1336
+ "metric": [
1337
+ "loglikelihood_acc"
1338
+ ],
1339
+ "hf_avail_splits": [
1340
+ "auxiliary_train",
1341
+ "test",
1342
+ "validation",
1343
+ "dev"
1344
+ ],
1345
+ "evaluation_splits": [
1346
+ "test"
1347
+ ],
1348
+ "few_shots_split": "dev",
1349
+ "few_shots_select": "sequential",
1350
+ "generation_size": 1,
1351
+ "stop_sequence": [
1352
+ "\n"
1353
+ ],
1354
+ "output_regex": null,
1355
+ "frozen": false,
1356
+ "suite": [
1357
+ "leaderboard",
1358
+ "mmlu"
1359
+ ],
1360
+ "original_num_docs": 204,
1361
+ "effective_num_docs": 204,
1362
+ "trust_dataset": true,
1363
+ "must_remove_duplicate_docs": null
1364
+ },
1365
+ "leaderboard|mmlu:high_school_world_history": {
1366
+ "name": "mmlu:high_school_world_history",
1367
+ "prompt_function": "mmlu_harness",
1368
+ "hf_repo": "lighteval/mmlu",
1369
+ "hf_subset": "high_school_world_history",
1370
+ "metric": [
1371
+ "loglikelihood_acc"
1372
+ ],
1373
+ "hf_avail_splits": [
1374
+ "auxiliary_train",
1375
+ "test",
1376
+ "validation",
1377
+ "dev"
1378
+ ],
1379
+ "evaluation_splits": [
1380
+ "test"
1381
+ ],
1382
+ "few_shots_split": "dev",
1383
+ "few_shots_select": "sequential",
1384
+ "generation_size": 1,
1385
+ "stop_sequence": [
1386
+ "\n"
1387
+ ],
1388
+ "output_regex": null,
1389
+ "frozen": false,
1390
+ "suite": [
1391
+ "leaderboard",
1392
+ "mmlu"
1393
+ ],
1394
+ "original_num_docs": 237,
1395
+ "effective_num_docs": 237,
1396
+ "trust_dataset": true,
1397
+ "must_remove_duplicate_docs": null
1398
+ },
1399
+ "leaderboard|mmlu:human_aging": {
1400
+ "name": "mmlu:human_aging",
1401
+ "prompt_function": "mmlu_harness",
1402
+ "hf_repo": "lighteval/mmlu",
1403
+ "hf_subset": "human_aging",
1404
+ "metric": [
1405
+ "loglikelihood_acc"
1406
+ ],
1407
+ "hf_avail_splits": [
1408
+ "auxiliary_train",
1409
+ "test",
1410
+ "validation",
1411
+ "dev"
1412
+ ],
1413
+ "evaluation_splits": [
1414
+ "test"
1415
+ ],
1416
+ "few_shots_split": "dev",
1417
+ "few_shots_select": "sequential",
1418
+ "generation_size": 1,
1419
+ "stop_sequence": [
1420
+ "\n"
1421
+ ],
1422
+ "output_regex": null,
1423
+ "frozen": false,
1424
+ "suite": [
1425
+ "leaderboard",
1426
+ "mmlu"
1427
+ ],
1428
+ "original_num_docs": 223,
1429
+ "effective_num_docs": 223,
1430
+ "trust_dataset": true,
1431
+ "must_remove_duplicate_docs": null
1432
+ },
1433
+ "leaderboard|mmlu:human_sexuality": {
1434
+ "name": "mmlu:human_sexuality",
1435
+ "prompt_function": "mmlu_harness",
1436
+ "hf_repo": "lighteval/mmlu",
1437
+ "hf_subset": "human_sexuality",
1438
+ "metric": [
1439
+ "loglikelihood_acc"
1440
+ ],
1441
+ "hf_avail_splits": [
1442
+ "auxiliary_train",
1443
+ "test",
1444
+ "validation",
1445
+ "dev"
1446
+ ],
1447
+ "evaluation_splits": [
1448
+ "test"
1449
+ ],
1450
+ "few_shots_split": "dev",
1451
+ "few_shots_select": "sequential",
1452
+ "generation_size": 1,
1453
+ "stop_sequence": [
1454
+ "\n"
1455
+ ],
1456
+ "output_regex": null,
1457
+ "frozen": false,
1458
+ "suite": [
1459
+ "leaderboard",
1460
+ "mmlu"
1461
+ ],
1462
+ "original_num_docs": 131,
1463
+ "effective_num_docs": 131,
1464
+ "trust_dataset": true,
1465
+ "must_remove_duplicate_docs": null
1466
+ },
1467
+ "leaderboard|mmlu:international_law": {
1468
+ "name": "mmlu:international_law",
1469
+ "prompt_function": "mmlu_harness",
1470
+ "hf_repo": "lighteval/mmlu",
1471
+ "hf_subset": "international_law",
1472
+ "metric": [
1473
+ "loglikelihood_acc"
1474
+ ],
1475
+ "hf_avail_splits": [
1476
+ "auxiliary_train",
1477
+ "test",
1478
+ "validation",
1479
+ "dev"
1480
+ ],
1481
+ "evaluation_splits": [
1482
+ "test"
1483
+ ],
1484
+ "few_shots_split": "dev",
1485
+ "few_shots_select": "sequential",
1486
+ "generation_size": 1,
1487
+ "stop_sequence": [
1488
+ "\n"
1489
+ ],
1490
+ "output_regex": null,
1491
+ "frozen": false,
1492
+ "suite": [
1493
+ "leaderboard",
1494
+ "mmlu"
1495
+ ],
1496
+ "original_num_docs": 121,
1497
+ "effective_num_docs": 121,
1498
+ "trust_dataset": true,
1499
+ "must_remove_duplicate_docs": null
1500
+ },
1501
+ "leaderboard|mmlu:jurisprudence": {
1502
+ "name": "mmlu:jurisprudence",
1503
+ "prompt_function": "mmlu_harness",
1504
+ "hf_repo": "lighteval/mmlu",
1505
+ "hf_subset": "jurisprudence",
1506
+ "metric": [
1507
+ "loglikelihood_acc"
1508
+ ],
1509
+ "hf_avail_splits": [
1510
+ "auxiliary_train",
1511
+ "test",
1512
+ "validation",
1513
+ "dev"
1514
+ ],
1515
+ "evaluation_splits": [
1516
+ "test"
1517
+ ],
1518
+ "few_shots_split": "dev",
1519
+ "few_shots_select": "sequential",
1520
+ "generation_size": 1,
1521
+ "stop_sequence": [
1522
+ "\n"
1523
+ ],
1524
+ "output_regex": null,
1525
+ "frozen": false,
1526
+ "suite": [
1527
+ "leaderboard",
1528
+ "mmlu"
1529
+ ],
1530
+ "original_num_docs": 108,
1531
+ "effective_num_docs": 108,
1532
+ "trust_dataset": true,
1533
+ "must_remove_duplicate_docs": null
1534
+ },
1535
+ "leaderboard|mmlu:logical_fallacies": {
1536
+ "name": "mmlu:logical_fallacies",
1537
+ "prompt_function": "mmlu_harness",
1538
+ "hf_repo": "lighteval/mmlu",
1539
+ "hf_subset": "logical_fallacies",
1540
+ "metric": [
1541
+ "loglikelihood_acc"
1542
+ ],
1543
+ "hf_avail_splits": [
1544
+ "auxiliary_train",
1545
+ "test",
1546
+ "validation",
1547
+ "dev"
1548
+ ],
1549
+ "evaluation_splits": [
1550
+ "test"
1551
+ ],
1552
+ "few_shots_split": "dev",
1553
+ "few_shots_select": "sequential",
1554
+ "generation_size": 1,
1555
+ "stop_sequence": [
1556
+ "\n"
1557
+ ],
1558
+ "output_regex": null,
1559
+ "frozen": false,
1560
+ "suite": [
1561
+ "leaderboard",
1562
+ "mmlu"
1563
+ ],
1564
+ "original_num_docs": 163,
1565
+ "effective_num_docs": 163,
1566
+ "trust_dataset": true,
1567
+ "must_remove_duplicate_docs": null
1568
+ },
1569
+ "leaderboard|mmlu:machine_learning": {
1570
+ "name": "mmlu:machine_learning",
1571
+ "prompt_function": "mmlu_harness",
1572
+ "hf_repo": "lighteval/mmlu",
1573
+ "hf_subset": "machine_learning",
1574
+ "metric": [
1575
+ "loglikelihood_acc"
1576
+ ],
1577
+ "hf_avail_splits": [
1578
+ "auxiliary_train",
1579
+ "test",
1580
+ "validation",
1581
+ "dev"
1582
+ ],
1583
+ "evaluation_splits": [
1584
+ "test"
1585
+ ],
1586
+ "few_shots_split": "dev",
1587
+ "few_shots_select": "sequential",
1588
+ "generation_size": 1,
1589
+ "stop_sequence": [
1590
+ "\n"
1591
+ ],
1592
+ "output_regex": null,
1593
+ "frozen": false,
1594
+ "suite": [
1595
+ "leaderboard",
1596
+ "mmlu"
1597
+ ],
1598
+ "original_num_docs": 112,
1599
+ "effective_num_docs": 112,
1600
+ "trust_dataset": true,
1601
+ "must_remove_duplicate_docs": null
1602
+ },
1603
+ "leaderboard|mmlu:management": {
1604
+ "name": "mmlu:management",
1605
+ "prompt_function": "mmlu_harness",
1606
+ "hf_repo": "lighteval/mmlu",
1607
+ "hf_subset": "management",
1608
+ "metric": [
1609
+ "loglikelihood_acc"
1610
+ ],
1611
+ "hf_avail_splits": [
1612
+ "auxiliary_train",
1613
+ "test",
1614
+ "validation",
1615
+ "dev"
1616
+ ],
1617
+ "evaluation_splits": [
1618
+ "test"
1619
+ ],
1620
+ "few_shots_split": "dev",
1621
+ "few_shots_select": "sequential",
1622
+ "generation_size": 1,
1623
+ "stop_sequence": [
1624
+ "\n"
1625
+ ],
1626
+ "output_regex": null,
1627
+ "frozen": false,
1628
+ "suite": [
1629
+ "leaderboard",
1630
+ "mmlu"
1631
+ ],
1632
+ "original_num_docs": 103,
1633
+ "effective_num_docs": 103,
1634
+ "trust_dataset": true,
1635
+ "must_remove_duplicate_docs": null
1636
+ },
1637
+ "leaderboard|mmlu:marketing": {
1638
+ "name": "mmlu:marketing",
1639
+ "prompt_function": "mmlu_harness",
1640
+ "hf_repo": "lighteval/mmlu",
1641
+ "hf_subset": "marketing",
1642
+ "metric": [
1643
+ "loglikelihood_acc"
1644
+ ],
1645
+ "hf_avail_splits": [
1646
+ "auxiliary_train",
1647
+ "test",
1648
+ "validation",
1649
+ "dev"
1650
+ ],
1651
+ "evaluation_splits": [
1652
+ "test"
1653
+ ],
1654
+ "few_shots_split": "dev",
1655
+ "few_shots_select": "sequential",
1656
+ "generation_size": 1,
1657
+ "stop_sequence": [
1658
+ "\n"
1659
+ ],
1660
+ "output_regex": null,
1661
+ "frozen": false,
1662
+ "suite": [
1663
+ "leaderboard",
1664
+ "mmlu"
1665
+ ],
1666
+ "original_num_docs": 234,
1667
+ "effective_num_docs": 234,
1668
+ "trust_dataset": true,
1669
+ "must_remove_duplicate_docs": null
1670
+ },
1671
+ "leaderboard|mmlu:medical_genetics": {
1672
+ "name": "mmlu:medical_genetics",
1673
+ "prompt_function": "mmlu_harness",
1674
+ "hf_repo": "lighteval/mmlu",
1675
+ "hf_subset": "medical_genetics",
1676
+ "metric": [
1677
+ "loglikelihood_acc"
1678
+ ],
1679
+ "hf_avail_splits": [
1680
+ "auxiliary_train",
1681
+ "test",
1682
+ "validation",
1683
+ "dev"
1684
+ ],
1685
+ "evaluation_splits": [
1686
+ "test"
1687
+ ],
1688
+ "few_shots_split": "dev",
1689
+ "few_shots_select": "sequential",
1690
+ "generation_size": 1,
1691
+ "stop_sequence": [
1692
+ "\n"
1693
+ ],
1694
+ "output_regex": null,
1695
+ "frozen": false,
1696
+ "suite": [
1697
+ "leaderboard",
1698
+ "mmlu"
1699
+ ],
1700
+ "original_num_docs": 100,
1701
+ "effective_num_docs": 100,
1702
+ "trust_dataset": true,
1703
+ "must_remove_duplicate_docs": null
1704
+ },
1705
+ "leaderboard|mmlu:miscellaneous": {
1706
+ "name": "mmlu:miscellaneous",
1707
+ "prompt_function": "mmlu_harness",
1708
+ "hf_repo": "lighteval/mmlu",
1709
+ "hf_subset": "miscellaneous",
1710
+ "metric": [
1711
+ "loglikelihood_acc"
1712
+ ],
1713
+ "hf_avail_splits": [
1714
+ "auxiliary_train",
1715
+ "test",
1716
+ "validation",
1717
+ "dev"
1718
+ ],
1719
+ "evaluation_splits": [
1720
+ "test"
1721
+ ],
1722
+ "few_shots_split": "dev",
1723
+ "few_shots_select": "sequential",
1724
+ "generation_size": 1,
1725
+ "stop_sequence": [
1726
+ "\n"
1727
+ ],
1728
+ "output_regex": null,
1729
+ "frozen": false,
1730
+ "suite": [
1731
+ "leaderboard",
1732
+ "mmlu"
1733
+ ],
1734
+ "original_num_docs": 783,
1735
+ "effective_num_docs": 783,
1736
+ "trust_dataset": true,
1737
+ "must_remove_duplicate_docs": null
1738
+ },
1739
+ "leaderboard|mmlu:moral_disputes": {
1740
+ "name": "mmlu:moral_disputes",
1741
+ "prompt_function": "mmlu_harness",
1742
+ "hf_repo": "lighteval/mmlu",
1743
+ "hf_subset": "moral_disputes",
1744
+ "metric": [
1745
+ "loglikelihood_acc"
1746
+ ],
1747
+ "hf_avail_splits": [
1748
+ "auxiliary_train",
1749
+ "test",
1750
+ "validation",
1751
+ "dev"
1752
+ ],
1753
+ "evaluation_splits": [
1754
+ "test"
1755
+ ],
1756
+ "few_shots_split": "dev",
1757
+ "few_shots_select": "sequential",
1758
+ "generation_size": 1,
1759
+ "stop_sequence": [
1760
+ "\n"
1761
+ ],
1762
+ "output_regex": null,
1763
+ "frozen": false,
1764
+ "suite": [
1765
+ "leaderboard",
1766
+ "mmlu"
1767
+ ],
1768
+ "original_num_docs": 346,
1769
+ "effective_num_docs": 346,
1770
+ "trust_dataset": true,
1771
+ "must_remove_duplicate_docs": null
1772
+ },
1773
+ "leaderboard|mmlu:moral_scenarios": {
1774
+ "name": "mmlu:moral_scenarios",
1775
+ "prompt_function": "mmlu_harness",
1776
+ "hf_repo": "lighteval/mmlu",
1777
+ "hf_subset": "moral_scenarios",
1778
+ "metric": [
1779
+ "loglikelihood_acc"
1780
+ ],
1781
+ "hf_avail_splits": [
1782
+ "auxiliary_train",
1783
+ "test",
1784
+ "validation",
1785
+ "dev"
1786
+ ],
1787
+ "evaluation_splits": [
1788
+ "test"
1789
+ ],
1790
+ "few_shots_split": "dev",
1791
+ "few_shots_select": "sequential",
1792
+ "generation_size": 1,
1793
+ "stop_sequence": [
1794
+ "\n"
1795
+ ],
1796
+ "output_regex": null,
1797
+ "frozen": false,
1798
+ "suite": [
1799
+ "leaderboard",
1800
+ "mmlu"
1801
+ ],
1802
+ "original_num_docs": 895,
1803
+ "effective_num_docs": 895,
1804
+ "trust_dataset": true,
1805
+ "must_remove_duplicate_docs": null
1806
+ },
1807
+ "leaderboard|mmlu:nutrition": {
1808
+ "name": "mmlu:nutrition",
1809
+ "prompt_function": "mmlu_harness",
1810
+ "hf_repo": "lighteval/mmlu",
1811
+ "hf_subset": "nutrition",
1812
+ "metric": [
1813
+ "loglikelihood_acc"
1814
+ ],
1815
+ "hf_avail_splits": [
1816
+ "auxiliary_train",
1817
+ "test",
1818
+ "validation",
1819
+ "dev"
1820
+ ],
1821
+ "evaluation_splits": [
1822
+ "test"
1823
+ ],
1824
+ "few_shots_split": "dev",
1825
+ "few_shots_select": "sequential",
1826
+ "generation_size": 1,
1827
+ "stop_sequence": [
1828
+ "\n"
1829
+ ],
1830
+ "output_regex": null,
1831
+ "frozen": false,
1832
+ "suite": [
1833
+ "leaderboard",
1834
+ "mmlu"
1835
+ ],
1836
+ "original_num_docs": 306,
1837
+ "effective_num_docs": 306,
1838
+ "trust_dataset": true,
1839
+ "must_remove_duplicate_docs": null
1840
+ },
1841
+ "leaderboard|mmlu:philosophy": {
1842
+ "name": "mmlu:philosophy",
1843
+ "prompt_function": "mmlu_harness",
1844
+ "hf_repo": "lighteval/mmlu",
1845
+ "hf_subset": "philosophy",
1846
+ "metric": [
1847
+ "loglikelihood_acc"
1848
+ ],
1849
+ "hf_avail_splits": [
1850
+ "auxiliary_train",
1851
+ "test",
1852
+ "validation",
1853
+ "dev"
1854
+ ],
1855
+ "evaluation_splits": [
1856
+ "test"
1857
+ ],
1858
+ "few_shots_split": "dev",
1859
+ "few_shots_select": "sequential",
1860
+ "generation_size": 1,
1861
+ "stop_sequence": [
1862
+ "\n"
1863
+ ],
1864
+ "output_regex": null,
1865
+ "frozen": false,
1866
+ "suite": [
1867
+ "leaderboard",
1868
+ "mmlu"
1869
+ ],
1870
+ "original_num_docs": 311,
1871
+ "effective_num_docs": 311,
1872
+ "trust_dataset": true,
1873
+ "must_remove_duplicate_docs": null
1874
+ },
1875
+ "leaderboard|mmlu:prehistory": {
1876
+ "name": "mmlu:prehistory",
1877
+ "prompt_function": "mmlu_harness",
1878
+ "hf_repo": "lighteval/mmlu",
1879
+ "hf_subset": "prehistory",
1880
+ "metric": [
1881
+ "loglikelihood_acc"
1882
+ ],
1883
+ "hf_avail_splits": [
1884
+ "auxiliary_train",
1885
+ "test",
1886
+ "validation",
1887
+ "dev"
1888
+ ],
1889
+ "evaluation_splits": [
1890
+ "test"
1891
+ ],
1892
+ "few_shots_split": "dev",
1893
+ "few_shots_select": "sequential",
1894
+ "generation_size": 1,
1895
+ "stop_sequence": [
1896
+ "\n"
1897
+ ],
1898
+ "output_regex": null,
1899
+ "frozen": false,
1900
+ "suite": [
1901
+ "leaderboard",
1902
+ "mmlu"
1903
+ ],
1904
+ "original_num_docs": 324,
1905
+ "effective_num_docs": 324,
1906
+ "trust_dataset": true,
1907
+ "must_remove_duplicate_docs": null
1908
+ },
1909
+ "leaderboard|mmlu:professional_accounting": {
1910
+ "name": "mmlu:professional_accounting",
1911
+ "prompt_function": "mmlu_harness",
1912
+ "hf_repo": "lighteval/mmlu",
1913
+ "hf_subset": "professional_accounting",
1914
+ "metric": [
1915
+ "loglikelihood_acc"
1916
+ ],
1917
+ "hf_avail_splits": [
1918
+ "auxiliary_train",
1919
+ "test",
1920
+ "validation",
1921
+ "dev"
1922
+ ],
1923
+ "evaluation_splits": [
1924
+ "test"
1925
+ ],
1926
+ "few_shots_split": "dev",
1927
+ "few_shots_select": "sequential",
1928
+ "generation_size": 1,
1929
+ "stop_sequence": [
1930
+ "\n"
1931
+ ],
1932
+ "output_regex": null,
1933
+ "frozen": false,
1934
+ "suite": [
1935
+ "leaderboard",
1936
+ "mmlu"
1937
+ ],
1938
+ "original_num_docs": 282,
1939
+ "effective_num_docs": 282,
1940
+ "trust_dataset": true,
1941
+ "must_remove_duplicate_docs": null
1942
+ },
1943
+ "leaderboard|mmlu:professional_law": {
1944
+ "name": "mmlu:professional_law",
1945
+ "prompt_function": "mmlu_harness",
1946
+ "hf_repo": "lighteval/mmlu",
1947
+ "hf_subset": "professional_law",
1948
+ "metric": [
1949
+ "loglikelihood_acc"
1950
+ ],
1951
+ "hf_avail_splits": [
1952
+ "auxiliary_train",
1953
+ "test",
1954
+ "validation",
1955
+ "dev"
1956
+ ],
1957
+ "evaluation_splits": [
1958
+ "test"
1959
+ ],
1960
+ "few_shots_split": "dev",
1961
+ "few_shots_select": "sequential",
1962
+ "generation_size": 1,
1963
+ "stop_sequence": [
1964
+ "\n"
1965
+ ],
1966
+ "output_regex": null,
1967
+ "frozen": false,
1968
+ "suite": [
1969
+ "leaderboard",
1970
+ "mmlu"
1971
+ ],
1972
+ "original_num_docs": 1534,
1973
+ "effective_num_docs": 1534,
1974
+ "trust_dataset": true,
1975
+ "must_remove_duplicate_docs": null
1976
+ },
1977
+ "leaderboard|mmlu:professional_medicine": {
1978
+ "name": "mmlu:professional_medicine",
1979
+ "prompt_function": "mmlu_harness",
1980
+ "hf_repo": "lighteval/mmlu",
1981
+ "hf_subset": "professional_medicine",
1982
+ "metric": [
1983
+ "loglikelihood_acc"
1984
+ ],
1985
+ "hf_avail_splits": [
1986
+ "auxiliary_train",
1987
+ "test",
1988
+ "validation",
1989
+ "dev"
1990
+ ],
1991
+ "evaluation_splits": [
1992
+ "test"
1993
+ ],
1994
+ "few_shots_split": "dev",
1995
+ "few_shots_select": "sequential",
1996
+ "generation_size": 1,
1997
+ "stop_sequence": [
1998
+ "\n"
1999
+ ],
2000
+ "output_regex": null,
2001
+ "frozen": false,
2002
+ "suite": [
2003
+ "leaderboard",
2004
+ "mmlu"
2005
+ ],
2006
+ "original_num_docs": 272,
2007
+ "effective_num_docs": 272,
2008
+ "trust_dataset": true,
2009
+ "must_remove_duplicate_docs": null
2010
+ },
2011
+ "leaderboard|mmlu:professional_psychology": {
2012
+ "name": "mmlu:professional_psychology",
2013
+ "prompt_function": "mmlu_harness",
2014
+ "hf_repo": "lighteval/mmlu",
2015
+ "hf_subset": "professional_psychology",
2016
+ "metric": [
2017
+ "loglikelihood_acc"
2018
+ ],
2019
+ "hf_avail_splits": [
2020
+ "auxiliary_train",
2021
+ "test",
2022
+ "validation",
2023
+ "dev"
2024
+ ],
2025
+ "evaluation_splits": [
2026
+ "test"
2027
+ ],
2028
+ "few_shots_split": "dev",
2029
+ "few_shots_select": "sequential",
2030
+ "generation_size": 1,
2031
+ "stop_sequence": [
2032
+ "\n"
2033
+ ],
2034
+ "output_regex": null,
2035
+ "frozen": false,
2036
+ "suite": [
2037
+ "leaderboard",
2038
+ "mmlu"
2039
+ ],
2040
+ "original_num_docs": 612,
2041
+ "effective_num_docs": 612,
2042
+ "trust_dataset": true,
2043
+ "must_remove_duplicate_docs": null
2044
+ },
2045
+ "leaderboard|mmlu:public_relations": {
2046
+ "name": "mmlu:public_relations",
2047
+ "prompt_function": "mmlu_harness",
2048
+ "hf_repo": "lighteval/mmlu",
2049
+ "hf_subset": "public_relations",
2050
+ "metric": [
2051
+ "loglikelihood_acc"
2052
+ ],
2053
+ "hf_avail_splits": [
2054
+ "auxiliary_train",
2055
+ "test",
2056
+ "validation",
2057
+ "dev"
2058
+ ],
2059
+ "evaluation_splits": [
2060
+ "test"
2061
+ ],
2062
+ "few_shots_split": "dev",
2063
+ "few_shots_select": "sequential",
2064
+ "generation_size": 1,
2065
+ "stop_sequence": [
2066
+ "\n"
2067
+ ],
2068
+ "output_regex": null,
2069
+ "frozen": false,
2070
+ "suite": [
2071
+ "leaderboard",
2072
+ "mmlu"
2073
+ ],
2074
+ "original_num_docs": 110,
2075
+ "effective_num_docs": 110,
2076
+ "trust_dataset": true,
2077
+ "must_remove_duplicate_docs": null
2078
+ },
2079
+ "leaderboard|mmlu:security_studies": {
2080
+ "name": "mmlu:security_studies",
2081
+ "prompt_function": "mmlu_harness",
2082
+ "hf_repo": "lighteval/mmlu",
2083
+ "hf_subset": "security_studies",
2084
+ "metric": [
2085
+ "loglikelihood_acc"
2086
+ ],
2087
+ "hf_avail_splits": [
2088
+ "auxiliary_train",
2089
+ "test",
2090
+ "validation",
2091
+ "dev"
2092
+ ],
2093
+ "evaluation_splits": [
2094
+ "test"
2095
+ ],
2096
+ "few_shots_split": "dev",
2097
+ "few_shots_select": "sequential",
2098
+ "generation_size": 1,
2099
+ "stop_sequence": [
2100
+ "\n"
2101
+ ],
2102
+ "output_regex": null,
2103
+ "frozen": false,
2104
+ "suite": [
2105
+ "leaderboard",
2106
+ "mmlu"
2107
+ ],
2108
+ "original_num_docs": 245,
2109
+ "effective_num_docs": 245,
2110
+ "trust_dataset": true,
2111
+ "must_remove_duplicate_docs": null
2112
+ },
2113
+ "leaderboard|mmlu:sociology": {
2114
+ "name": "mmlu:sociology",
2115
+ "prompt_function": "mmlu_harness",
2116
+ "hf_repo": "lighteval/mmlu",
2117
+ "hf_subset": "sociology",
2118
+ "metric": [
2119
+ "loglikelihood_acc"
2120
+ ],
2121
+ "hf_avail_splits": [
2122
+ "auxiliary_train",
2123
+ "test",
2124
+ "validation",
2125
+ "dev"
2126
+ ],
2127
+ "evaluation_splits": [
2128
+ "test"
2129
+ ],
2130
+ "few_shots_split": "dev",
2131
+ "few_shots_select": "sequential",
2132
+ "generation_size": 1,
2133
+ "stop_sequence": [
2134
+ "\n"
2135
+ ],
2136
+ "output_regex": null,
2137
+ "frozen": false,
2138
+ "suite": [
2139
+ "leaderboard",
2140
+ "mmlu"
2141
+ ],
2142
+ "original_num_docs": 201,
2143
+ "effective_num_docs": 201,
2144
+ "trust_dataset": true,
2145
+ "must_remove_duplicate_docs": null
2146
+ },
2147
+ "leaderboard|mmlu:us_foreign_policy": {
2148
+ "name": "mmlu:us_foreign_policy",
2149
+ "prompt_function": "mmlu_harness",
2150
+ "hf_repo": "lighteval/mmlu",
2151
+ "hf_subset": "us_foreign_policy",
2152
+ "metric": [
2153
+ "loglikelihood_acc"
2154
+ ],
2155
+ "hf_avail_splits": [
2156
+ "auxiliary_train",
2157
+ "test",
2158
+ "validation",
2159
+ "dev"
2160
+ ],
2161
+ "evaluation_splits": [
2162
+ "test"
2163
+ ],
2164
+ "few_shots_split": "dev",
2165
+ "few_shots_select": "sequential",
2166
+ "generation_size": 1,
2167
+ "stop_sequence": [
2168
+ "\n"
2169
+ ],
2170
+ "output_regex": null,
2171
+ "frozen": false,
2172
+ "suite": [
2173
+ "leaderboard",
2174
+ "mmlu"
2175
+ ],
2176
+ "original_num_docs": 100,
2177
+ "effective_num_docs": 100,
2178
+ "trust_dataset": true,
2179
+ "must_remove_duplicate_docs": null
2180
+ },
2181
+ "leaderboard|mmlu:virology": {
2182
+ "name": "mmlu:virology",
2183
+ "prompt_function": "mmlu_harness",
2184
+ "hf_repo": "lighteval/mmlu",
2185
+ "hf_subset": "virology",
2186
+ "metric": [
2187
+ "loglikelihood_acc"
2188
+ ],
2189
+ "hf_avail_splits": [
2190
+ "auxiliary_train",
2191
+ "test",
2192
+ "validation",
2193
+ "dev"
2194
+ ],
2195
+ "evaluation_splits": [
2196
+ "test"
2197
+ ],
2198
+ "few_shots_split": "dev",
2199
+ "few_shots_select": "sequential",
2200
+ "generation_size": 1,
2201
+ "stop_sequence": [
2202
+ "\n"
2203
+ ],
2204
+ "output_regex": null,
2205
+ "frozen": false,
2206
+ "suite": [
2207
+ "leaderboard",
2208
+ "mmlu"
2209
+ ],
2210
+ "original_num_docs": 166,
2211
+ "effective_num_docs": 166,
2212
+ "trust_dataset": true,
2213
+ "must_remove_duplicate_docs": null
2214
+ },
2215
+ "leaderboard|mmlu:world_religions": {
2216
+ "name": "mmlu:world_religions",
2217
+ "prompt_function": "mmlu_harness",
2218
+ "hf_repo": "lighteval/mmlu",
2219
+ "hf_subset": "world_religions",
2220
+ "metric": [
2221
+ "loglikelihood_acc"
2222
+ ],
2223
+ "hf_avail_splits": [
2224
+ "auxiliary_train",
2225
+ "test",
2226
+ "validation",
2227
+ "dev"
2228
+ ],
2229
+ "evaluation_splits": [
2230
+ "test"
2231
+ ],
2232
+ "few_shots_split": "dev",
2233
+ "few_shots_select": "sequential",
2234
+ "generation_size": 1,
2235
+ "stop_sequence": [
2236
+ "\n"
2237
+ ],
2238
+ "output_regex": null,
2239
+ "frozen": false,
2240
+ "suite": [
2241
+ "leaderboard",
2242
+ "mmlu"
2243
+ ],
2244
+ "original_num_docs": 171,
2245
+ "effective_num_docs": 171,
2246
+ "trust_dataset": true,
2247
+ "must_remove_duplicate_docs": null
2248
+ }
2249
+ },
2250
+ "summary_tasks": {
2251
+ "leaderboard|mmlu:abstract_algebra|5": {
2252
+ "hashes": {
2253
+ "hash_examples": "4c76229e00c9c0e9",
2254
+ "hash_full_prompts": "a45d01c3409c889c",
2255
+ "hash_input_tokens": "dcd16c89f0617c0b",
2256
+ "hash_cont_tokens": "9e1c9ca2c51de57e"
2257
+ },
2258
+ "truncated": 0,
2259
+ "non_truncated": 100,
2260
+ "padded": 400,
2261
+ "non_padded": 0,
2262
+ "effective_few_shots": 5.0,
2263
+ "num_truncated_few_shots": 0
2264
+ },
2265
+ "leaderboard|mmlu:anatomy|5": {
2266
+ "hashes": {
2267
+ "hash_examples": "6a1f8104dccbd33b",
2268
+ "hash_full_prompts": "e245c6600e03cc32",
2269
+ "hash_input_tokens": "2ba0f5c55af5f38b",
2270
+ "hash_cont_tokens": "025910e68cf29c3d"
2271
+ },
2272
+ "truncated": 0,
2273
+ "non_truncated": 135,
2274
+ "padded": 540,
2275
+ "non_padded": 0,
2276
+ "effective_few_shots": 5.0,
2277
+ "num_truncated_few_shots": 0
2278
+ },
2279
+ "leaderboard|mmlu:astronomy|5": {
2280
+ "hashes": {
2281
+ "hash_examples": "1302effa3a76ce4c",
2282
+ "hash_full_prompts": "390f9bddf857ad04",
2283
+ "hash_input_tokens": "f4726f2ecad1cbdf",
2284
+ "hash_cont_tokens": "1a66fd04f03e0517"
2285
+ },
2286
+ "truncated": 0,
2287
+ "non_truncated": 152,
2288
+ "padded": 608,
2289
+ "non_padded": 0,
2290
+ "effective_few_shots": 5.0,
2291
+ "num_truncated_few_shots": 0
2292
+ },
2293
+ "leaderboard|mmlu:business_ethics|5": {
2294
+ "hashes": {
2295
+ "hash_examples": "03cb8bce5336419a",
2296
+ "hash_full_prompts": "5504f893bc4f2fa1",
2297
+ "hash_input_tokens": "7a7100b8a79dd55d",
2298
+ "hash_cont_tokens": "9e1c9ca2c51de57e"
2299
+ },
2300
+ "truncated": 0,
2301
+ "non_truncated": 100,
2302
+ "padded": 400,
2303
+ "non_padded": 0,
2304
+ "effective_few_shots": 5.0,
2305
+ "num_truncated_few_shots": 0
2306
+ },
2307
+ "leaderboard|mmlu:clinical_knowledge|5": {
2308
+ "hashes": {
2309
+ "hash_examples": "ffbb9c7b2be257f9",
2310
+ "hash_full_prompts": "106ad0bab4b90b78",
2311
+ "hash_input_tokens": "c8d9b8aa03735f94",
2312
+ "hash_cont_tokens": "de872053260a1588"
2313
+ },
2314
+ "truncated": 0,
2315
+ "non_truncated": 265,
2316
+ "padded": 1060,
2317
+ "non_padded": 0,
2318
+ "effective_few_shots": 5.0,
2319
+ "num_truncated_few_shots": 0
2320
+ },
2321
+ "leaderboard|mmlu:college_biology|5": {
2322
+ "hashes": {
2323
+ "hash_examples": "3ee77f176f38eb8e",
2324
+ "hash_full_prompts": "59f9bdf2695cb226",
2325
+ "hash_input_tokens": "f127971484788afb",
2326
+ "hash_cont_tokens": "9ace296b3e00bba3"
2327
+ },
2328
+ "truncated": 0,
2329
+ "non_truncated": 144,
2330
+ "padded": 576,
2331
+ "non_padded": 0,
2332
+ "effective_few_shots": 5.0,
2333
+ "num_truncated_few_shots": 0
2334
+ },
2335
+ "leaderboard|mmlu:college_chemistry|5": {
2336
+ "hashes": {
2337
+ "hash_examples": "ce61a69c46d47aeb",
2338
+ "hash_full_prompts": "3cac9b759fcff7a0",
2339
+ "hash_input_tokens": "73c59c88a8b06f1f",
2340
+ "hash_cont_tokens": "9e1c9ca2c51de57e"
2341
+ },
2342
+ "truncated": 0,
2343
+ "non_truncated": 100,
2344
+ "padded": 400,
2345
+ "non_padded": 0,
2346
+ "effective_few_shots": 5.0,
2347
+ "num_truncated_few_shots": 0
2348
+ },
2349
+ "leaderboard|mmlu:college_computer_science|5": {
2350
+ "hashes": {
2351
+ "hash_examples": "32805b52d7d5daab",
2352
+ "hash_full_prompts": "010b0cca35070130",
2353
+ "hash_input_tokens": "5156fe94a0f086ba",
2354
+ "hash_cont_tokens": "9e1c9ca2c51de57e"
2355
+ },
2356
+ "truncated": 0,
2357
+ "non_truncated": 100,
2358
+ "padded": 400,
2359
+ "non_padded": 0,
2360
+ "effective_few_shots": 5.0,
2361
+ "num_truncated_few_shots": 0
2362
+ },
2363
+ "leaderboard|mmlu:college_mathematics|5": {
2364
+ "hashes": {
2365
+ "hash_examples": "55da1a0a0bd33722",
2366
+ "hash_full_prompts": "511422eb9eefc773",
2367
+ "hash_input_tokens": "8d86abfbbe2bbcca",
2368
+ "hash_cont_tokens": "9e1c9ca2c51de57e"
2369
+ },
2370
+ "truncated": 0,
2371
+ "non_truncated": 100,
2372
+ "padded": 400,
2373
+ "non_padded": 0,
2374
+ "effective_few_shots": 5.0,
2375
+ "num_truncated_few_shots": 0
2376
+ },
2377
+ "leaderboard|mmlu:college_medicine|5": {
2378
+ "hashes": {
2379
+ "hash_examples": "c33e143163049176",
2380
+ "hash_full_prompts": "c8cc1a82a51a046e",
2381
+ "hash_input_tokens": "c5e456edb1809398",
2382
+ "hash_cont_tokens": "c80c0b5489bdbc5a"
2383
+ },
2384
+ "truncated": 0,
2385
+ "non_truncated": 173,
2386
+ "padded": 692,
2387
+ "non_padded": 0,
2388
+ "effective_few_shots": 5.0,
2389
+ "num_truncated_few_shots": 0
2390
+ },
2391
+ "leaderboard|mmlu:college_physics|5": {
2392
+ "hashes": {
2393
+ "hash_examples": "ebdab1cdb7e555df",
2394
+ "hash_full_prompts": "e40721b5059c5818",
2395
+ "hash_input_tokens": "36778d41528c69db",
2396
+ "hash_cont_tokens": "569fcb9ac44734ae"
2397
+ },
2398
+ "truncated": 0,
2399
+ "non_truncated": 102,
2400
+ "padded": 408,
2401
+ "non_padded": 0,
2402
+ "effective_few_shots": 5.0,
2403
+ "num_truncated_few_shots": 0
2404
+ },
2405
+ "leaderboard|mmlu:computer_security|5": {
2406
+ "hashes": {
2407
+ "hash_examples": "a24fd7d08a560921",
2408
+ "hash_full_prompts": "946c9be5964ac44a",
2409
+ "hash_input_tokens": "341d545e9acf72c9",
2410
+ "hash_cont_tokens": "9e1c9ca2c51de57e"
2411
+ },
2412
+ "truncated": 0,
2413
+ "non_truncated": 100,
2414
+ "padded": 400,
2415
+ "non_padded": 0,
2416
+ "effective_few_shots": 5.0,
2417
+ "num_truncated_few_shots": 0
2418
+ },
2419
+ "leaderboard|mmlu:conceptual_physics|5": {
2420
+ "hashes": {
2421
+ "hash_examples": "8300977a79386993",
2422
+ "hash_full_prompts": "506a4f6094cc40c9",
2423
+ "hash_input_tokens": "c4fd0b3bdf930aa0",
2424
+ "hash_cont_tokens": "6e88c64c1a76752a"
2425
+ },
2426
+ "truncated": 0,
2427
+ "non_truncated": 235,
2428
+ "padded": 940,
2429
+ "non_padded": 0,
2430
+ "effective_few_shots": 5.0,
2431
+ "num_truncated_few_shots": 0
2432
+ },
2433
+ "leaderboard|mmlu:econometrics|5": {
2434
+ "hashes": {
2435
+ "hash_examples": "ddde36788a04a46f",
2436
+ "hash_full_prompts": "4ed2703f27f1ed05",
2437
+ "hash_input_tokens": "13057ab0c2e220ae",
2438
+ "hash_cont_tokens": "a315e0e16c922c3c"
2439
+ },
2440
+ "truncated": 0,
2441
+ "non_truncated": 114,
2442
+ "padded": 456,
2443
+ "non_padded": 0,
2444
+ "effective_few_shots": 5.0,
2445
+ "num_truncated_few_shots": 0
2446
+ },
2447
+ "leaderboard|mmlu:electrical_engineering|5": {
2448
+ "hashes": {
2449
+ "hash_examples": "acbc5def98c19b3f",
2450
+ "hash_full_prompts": "d8f4b3e11c23653c",
2451
+ "hash_input_tokens": "33b1bf88ca122864",
2452
+ "hash_cont_tokens": "44c72e6a7422c304"
2453
+ },
2454
+ "truncated": 0,
2455
+ "non_truncated": 145,
2456
+ "padded": 580,
2457
+ "non_padded": 0,
2458
+ "effective_few_shots": 5.0,
2459
+ "num_truncated_few_shots": 0
2460
+ },
2461
+ "leaderboard|mmlu:elementary_mathematics|5": {
2462
+ "hashes": {
2463
+ "hash_examples": "146e61d07497a9bd",
2464
+ "hash_full_prompts": "256d111bd15647ff",
2465
+ "hash_input_tokens": "3358f83a975de5ba",
2466
+ "hash_cont_tokens": "cac0a6c304791bb7"
2467
+ },
2468
+ "truncated": 0,
2469
+ "non_truncated": 378,
2470
+ "padded": 1512,
2471
+ "non_padded": 0,
2472
+ "effective_few_shots": 5.0,
2473
+ "num_truncated_few_shots": 0
2474
+ },
2475
+ "leaderboard|mmlu:formal_logic|5": {
2476
+ "hashes": {
2477
+ "hash_examples": "8635216e1909a03f",
2478
+ "hash_full_prompts": "1171d04f3b1a11f5",
2479
+ "hash_input_tokens": "9606dd965a7a23b9",
2480
+ "hash_cont_tokens": "8801fad3bbc72e57"
2481
+ },
2482
+ "truncated": 0,
2483
+ "non_truncated": 126,
2484
+ "padded": 504,
2485
+ "non_padded": 0,
2486
+ "effective_few_shots": 5.0,
2487
+ "num_truncated_few_shots": 0
2488
+ },
2489
+ "leaderboard|mmlu:global_facts|5": {
2490
+ "hashes": {
2491
+ "hash_examples": "30b315aa6353ee47",
2492
+ "hash_full_prompts": "a7e56dbc074c7529",
2493
+ "hash_input_tokens": "31bbd096e2001c12",
2494
+ "hash_cont_tokens": "9e1c9ca2c51de57e"
2495
+ },
2496
+ "truncated": 0,
2497
+ "non_truncated": 100,
2498
+ "padded": 400,
2499
+ "non_padded": 0,
2500
+ "effective_few_shots": 5.0,
2501
+ "num_truncated_few_shots": 0
2502
+ },
2503
+ "leaderboard|mmlu:high_school_biology|5": {
2504
+ "hashes": {
2505
+ "hash_examples": "c9136373af2180de",
2506
+ "hash_full_prompts": "ad6e859ed978e04a",
2507
+ "hash_input_tokens": "b83977f179151f8a",
2508
+ "hash_cont_tokens": "2d57d9e2c5a1fd64"
2509
+ },
2510
+ "truncated": 0,
2511
+ "non_truncated": 310,
2512
+ "padded": 1240,
2513
+ "non_padded": 0,
2514
+ "effective_few_shots": 5.0,
2515
+ "num_truncated_few_shots": 0
2516
+ },
2517
+ "leaderboard|mmlu:high_school_chemistry|5": {
2518
+ "hashes": {
2519
+ "hash_examples": "b0661bfa1add6404",
2520
+ "hash_full_prompts": "6eb9c04bcc8a8f2a",
2521
+ "hash_input_tokens": "cc46903458771653",
2522
+ "hash_cont_tokens": "bb0fd92673ddfb31"
2523
+ },
2524
+ "truncated": 0,
2525
+ "non_truncated": 203,
2526
+ "padded": 812,
2527
+ "non_padded": 0,
2528
+ "effective_few_shots": 5.0,
2529
+ "num_truncated_few_shots": 0
2530
+ },
2531
+ "leaderboard|mmlu:high_school_computer_science|5": {
2532
+ "hashes": {
2533
+ "hash_examples": "80fc1d623a3d665f",
2534
+ "hash_full_prompts": "8e51bc91c81cf8dd",
2535
+ "hash_input_tokens": "126365ab1bd9432e",
2536
+ "hash_cont_tokens": "9e1c9ca2c51de57e"
2537
+ },
2538
+ "truncated": 0,
2539
+ "non_truncated": 100,
2540
+ "padded": 400,
2541
+ "non_padded": 0,
2542
+ "effective_few_shots": 5.0,
2543
+ "num_truncated_few_shots": 0
2544
+ },
2545
+ "leaderboard|mmlu:high_school_european_history|5": {
2546
+ "hashes": {
2547
+ "hash_examples": "854da6e5af0fe1a1",
2548
+ "hash_full_prompts": "664a1f16c9f3195c",
2549
+ "hash_input_tokens": "717134f9a7cfebeb",
2550
+ "hash_cont_tokens": "16e494cddccc4a04"
2551
+ },
2552
+ "truncated": 0,
2553
+ "non_truncated": 165,
2554
+ "padded": 656,
2555
+ "non_padded": 4,
2556
+ "effective_few_shots": 5.0,
2557
+ "num_truncated_few_shots": 0
2558
+ },
2559
+ "leaderboard|mmlu:high_school_geography|5": {
2560
+ "hashes": {
2561
+ "hash_examples": "7dc963c7acd19ad8",
2562
+ "hash_full_prompts": "f3acf911f4023c8a",
2563
+ "hash_input_tokens": "0688033c7e5cd9a5",
2564
+ "hash_cont_tokens": "16b7f65a07b3d47b"
2565
+ },
2566
+ "truncated": 0,
2567
+ "non_truncated": 198,
2568
+ "padded": 792,
2569
+ "non_padded": 0,
2570
+ "effective_few_shots": 5.0,
2571
+ "num_truncated_few_shots": 0
2572
+ },
2573
+ "leaderboard|mmlu:high_school_government_and_politics|5": {
2574
+ "hashes": {
2575
+ "hash_examples": "1f675dcdebc9758f",
2576
+ "hash_full_prompts": "066254feaa3158ae",
2577
+ "hash_input_tokens": "17b37cad7a865933",
2578
+ "hash_cont_tokens": "476e87fd675136aa"
2579
+ },
2580
+ "truncated": 0,
2581
+ "non_truncated": 193,
2582
+ "padded": 772,
2583
+ "non_padded": 0,
2584
+ "effective_few_shots": 5.0,
2585
+ "num_truncated_few_shots": 0
2586
+ },
2587
+ "leaderboard|mmlu:high_school_macroeconomics|5": {
2588
+ "hashes": {
2589
+ "hash_examples": "2fb32cf2d80f0b35",
2590
+ "hash_full_prompts": "19a7fa502aa85c95",
2591
+ "hash_input_tokens": "df79701e12193fc2",
2592
+ "hash_cont_tokens": "b0c7b4c5f7bdf3e7"
2593
+ },
2594
+ "truncated": 0,
2595
+ "non_truncated": 390,
2596
+ "padded": 1560,
2597
+ "non_padded": 0,
2598
+ "effective_few_shots": 5.0,
2599
+ "num_truncated_few_shots": 0
2600
+ },
2601
+ "leaderboard|mmlu:high_school_mathematics|5": {
2602
+ "hashes": {
2603
+ "hash_examples": "fd6646fdb5d58a1f",
2604
+ "hash_full_prompts": "4f704e369778b5b0",
2605
+ "hash_input_tokens": "844770e9dbfc9672",
2606
+ "hash_cont_tokens": "1a05d6ff49846fd1"
2607
+ },
2608
+ "truncated": 0,
2609
+ "non_truncated": 270,
2610
+ "padded": 1080,
2611
+ "non_padded": 0,
2612
+ "effective_few_shots": 5.0,
2613
+ "num_truncated_few_shots": 0
2614
+ },
2615
+ "leaderboard|mmlu:high_school_microeconomics|5": {
2616
+ "hashes": {
2617
+ "hash_examples": "2118f21f71d87d84",
2618
+ "hash_full_prompts": "4350f9e2240f8010",
2619
+ "hash_input_tokens": "d275282ee3fcaf73",
2620
+ "hash_cont_tokens": "0e7f0645ffffd6cd"
2621
+ },
2622
+ "truncated": 0,
2623
+ "non_truncated": 238,
2624
+ "padded": 949,
2625
+ "non_padded": 3,
2626
+ "effective_few_shots": 5.0,
2627
+ "num_truncated_few_shots": 0
2628
+ },
2629
+ "leaderboard|mmlu:high_school_physics|5": {
2630
+ "hashes": {
2631
+ "hash_examples": "dc3ce06378548565",
2632
+ "hash_full_prompts": "5dc0d6831b66188f",
2633
+ "hash_input_tokens": "077a59148f55b11a",
2634
+ "hash_cont_tokens": "41ca6560b8c10183"
2635
+ },
2636
+ "truncated": 0,
2637
+ "non_truncated": 151,
2638
+ "padded": 604,
2639
+ "non_padded": 0,
2640
+ "effective_few_shots": 5.0,
2641
+ "num_truncated_few_shots": 0
2642
+ },
2643
+ "leaderboard|mmlu:high_school_psychology|5": {
2644
+ "hashes": {
2645
+ "hash_examples": "c8d1d98a40e11f2f",
2646
+ "hash_full_prompts": "af2b097da6d50365",
2647
+ "hash_input_tokens": "40c945a9e6372f3e",
2648
+ "hash_cont_tokens": "53a17ff85c607844"
2649
+ },
2650
+ "truncated": 0,
2651
+ "non_truncated": 545,
2652
+ "padded": 2178,
2653
+ "non_padded": 2,
2654
+ "effective_few_shots": 5.0,
2655
+ "num_truncated_few_shots": 0
2656
+ },
2657
+ "leaderboard|mmlu:high_school_statistics|5": {
2658
+ "hashes": {
2659
+ "hash_examples": "666c8759b98ee4ff",
2660
+ "hash_full_prompts": "c757694421d6d68d",
2661
+ "hash_input_tokens": "4c142698c2fcebcd",
2662
+ "hash_cont_tokens": "bc9063ad140cc941"
2663
+ },
2664
+ "truncated": 0,
2665
+ "non_truncated": 216,
2666
+ "padded": 864,
2667
+ "non_padded": 0,
2668
+ "effective_few_shots": 5.0,
2669
+ "num_truncated_few_shots": 0
2670
+ },
2671
+ "leaderboard|mmlu:high_school_us_history|5": {
2672
+ "hashes": {
2673
+ "hash_examples": "95fef1c4b7d3f81e",
2674
+ "hash_full_prompts": "e34a028d0ddeec5e",
2675
+ "hash_input_tokens": "52ab0238e27d7233",
2676
+ "hash_cont_tokens": "5cf777085ba01096"
2677
+ },
2678
+ "truncated": 0,
2679
+ "non_truncated": 204,
2680
+ "padded": 816,
2681
+ "non_padded": 0,
2682
+ "effective_few_shots": 5.0,
2683
+ "num_truncated_few_shots": 0
2684
+ },
2685
+ "leaderboard|mmlu:high_school_world_history|5": {
2686
+ "hashes": {
2687
+ "hash_examples": "7e5085b6184b0322",
2688
+ "hash_full_prompts": "1fa3d51392765601",
2689
+ "hash_input_tokens": "48a9455da6296979",
2690
+ "hash_cont_tokens": "152af2d9e4830517"
2691
+ },
2692
+ "truncated": 0,
2693
+ "non_truncated": 237,
2694
+ "padded": 948,
2695
+ "non_padded": 0,
2696
+ "effective_few_shots": 5.0,
2697
+ "num_truncated_few_shots": 0
2698
+ },
2699
+ "leaderboard|mmlu:human_aging|5": {
2700
+ "hashes": {
2701
+ "hash_examples": "c17333e7c7c10797",
2702
+ "hash_full_prompts": "cac900721f9a1a94",
2703
+ "hash_input_tokens": "339adae80562986a",
2704
+ "hash_cont_tokens": "da4d9eaa044021dd"
2705
+ },
2706
+ "truncated": 0,
2707
+ "non_truncated": 223,
2708
+ "padded": 892,
2709
+ "non_padded": 0,
2710
+ "effective_few_shots": 5.0,
2711
+ "num_truncated_few_shots": 0
2712
+ },
2713
+ "leaderboard|mmlu:human_sexuality|5": {
2714
+ "hashes": {
2715
+ "hash_examples": "4edd1e9045df5e3d",
2716
+ "hash_full_prompts": "0d6567bafee0a13c",
2717
+ "hash_input_tokens": "ff1387c2013633fb",
2718
+ "hash_cont_tokens": "1b99e384258a4eeb"
2719
+ },
2720
+ "truncated": 0,
2721
+ "non_truncated": 131,
2722
+ "padded": 524,
2723
+ "non_padded": 0,
2724
+ "effective_few_shots": 5.0,
2725
+ "num_truncated_few_shots": 0
2726
+ },
2727
+ "leaderboard|mmlu:international_law|5": {
2728
+ "hashes": {
2729
+ "hash_examples": "db2fa00d771a062a",
2730
+ "hash_full_prompts": "d018f9116479795e",
2731
+ "hash_input_tokens": "77b37a074e9639bd",
2732
+ "hash_cont_tokens": "cbf02c30cdded208"
2733
+ },
2734
+ "truncated": 0,
2735
+ "non_truncated": 121,
2736
+ "padded": 484,
2737
+ "non_padded": 0,
2738
+ "effective_few_shots": 5.0,
2739
+ "num_truncated_few_shots": 0
2740
+ },
2741
+ "leaderboard|mmlu:jurisprudence|5": {
2742
+ "hashes": {
2743
+ "hash_examples": "e956f86b124076fe",
2744
+ "hash_full_prompts": "1487e89a10ec58b7",
2745
+ "hash_input_tokens": "c813143d75ef5cce",
2746
+ "hash_cont_tokens": "4b248cf879d97a50"
2747
+ },
2748
+ "truncated": 0,
2749
+ "non_truncated": 108,
2750
+ "padded": 424,
2751
+ "non_padded": 8,
2752
+ "effective_few_shots": 5.0,
2753
+ "num_truncated_few_shots": 0
2754
+ },
2755
+ "leaderboard|mmlu:logical_fallacies|5": {
2756
+ "hashes": {
2757
+ "hash_examples": "956e0e6365ab79f1",
2758
+ "hash_full_prompts": "677785b2181f9243",
2759
+ "hash_input_tokens": "64f99c507ec3b899",
2760
+ "hash_cont_tokens": "6d9c35172b158838"
2761
+ },
2762
+ "truncated": 0,
2763
+ "non_truncated": 163,
2764
+ "padded": 632,
2765
+ "non_padded": 20,
2766
+ "effective_few_shots": 5.0,
2767
+ "num_truncated_few_shots": 0
2768
+ },
2769
+ "leaderboard|mmlu:machine_learning|5": {
2770
+ "hashes": {
2771
+ "hash_examples": "397997cc6f4d581e",
2772
+ "hash_full_prompts": "769ee14a2aea49bb",
2773
+ "hash_input_tokens": "273d1ef7b0121919",
2774
+ "hash_cont_tokens": "66c3ec85fee2fc98"
2775
+ },
2776
+ "truncated": 0,
2777
+ "non_truncated": 112,
2778
+ "padded": 448,
2779
+ "non_padded": 0,
2780
+ "effective_few_shots": 5.0,
2781
+ "num_truncated_few_shots": 0
2782
+ },
2783
+ "leaderboard|mmlu:management|5": {
2784
+ "hashes": {
2785
+ "hash_examples": "2bcbe6f6ca63d740",
2786
+ "hash_full_prompts": "cb1ff9dac9582144",
2787
+ "hash_input_tokens": "eb176b9b171d3159",
2788
+ "hash_cont_tokens": "5e2470abd1fb9d10"
2789
+ },
2790
+ "truncated": 0,
2791
+ "non_truncated": 103,
2792
+ "padded": 412,
2793
+ "non_padded": 0,
2794
+ "effective_few_shots": 5.0,
2795
+ "num_truncated_few_shots": 0
2796
+ },
2797
+ "leaderboard|mmlu:marketing|5": {
2798
+ "hashes": {
2799
+ "hash_examples": "8ddb20d964a1b065",
2800
+ "hash_full_prompts": "9fc2114a187ad9a2",
2801
+ "hash_input_tokens": "9af057a3a5b71aa4",
2802
+ "hash_cont_tokens": "27fe68d9630f8999"
2803
+ },
2804
+ "truncated": 0,
2805
+ "non_truncated": 234,
2806
+ "padded": 916,
2807
+ "non_padded": 20,
2808
+ "effective_few_shots": 5.0,
2809
+ "num_truncated_few_shots": 0
2810
+ },
2811
+ "leaderboard|mmlu:medical_genetics|5": {
2812
+ "hashes": {
2813
+ "hash_examples": "182a71f4763d2cea",
2814
+ "hash_full_prompts": "46a616fa51878959",
2815
+ "hash_input_tokens": "78e8fd5b85018f70",
2816
+ "hash_cont_tokens": "9e1c9ca2c51de57e"
2817
+ },
2818
+ "truncated": 0,
2819
+ "non_truncated": 100,
2820
+ "padded": 400,
2821
+ "non_padded": 0,
2822
+ "effective_few_shots": 5.0,
2823
+ "num_truncated_few_shots": 0
2824
+ },
2825
+ "leaderboard|mmlu:miscellaneous|5": {
2826
+ "hashes": {
2827
+ "hash_examples": "4c404fdbb4ca57fc",
2828
+ "hash_full_prompts": "0813e1be36dbaae1",
2829
+ "hash_input_tokens": "6d11bc47e05dd2de",
2830
+ "hash_cont_tokens": "dfa423a160edd337"
2831
+ },
2832
+ "truncated": 0,
2833
+ "non_truncated": 783,
2834
+ "padded": 3128,
2835
+ "non_padded": 4,
2836
+ "effective_few_shots": 5.0,
2837
+ "num_truncated_few_shots": 0
2838
+ },
2839
+ "leaderboard|mmlu:moral_disputes|5": {
2840
+ "hashes": {
2841
+ "hash_examples": "60cbd2baa3fea5c9",
2842
+ "hash_full_prompts": "1d14adebb9b62519",
2843
+ "hash_input_tokens": "67e735b423b90b85",
2844
+ "hash_cont_tokens": "bef966e6669349be"
2845
+ },
2846
+ "truncated": 0,
2847
+ "non_truncated": 346,
2848
+ "padded": 1380,
2849
+ "non_padded": 4,
2850
+ "effective_few_shots": 5.0,
2851
+ "num_truncated_few_shots": 0
2852
+ },
2853
+ "leaderboard|mmlu:moral_scenarios|5": {
2854
+ "hashes": {
2855
+ "hash_examples": "fd8b0431fbdd75ef",
2856
+ "hash_full_prompts": "b80d3d236165e3de",
2857
+ "hash_input_tokens": "2162a0284a55e8dd",
2858
+ "hash_cont_tokens": "a7bfdd944d86bcb5"
2859
+ },
2860
+ "truncated": 0,
2861
+ "non_truncated": 895,
2862
+ "padded": 3575,
2863
+ "non_padded": 5,
2864
+ "effective_few_shots": 5.0,
2865
+ "num_truncated_few_shots": 0
2866
+ },
2867
+ "leaderboard|mmlu:nutrition|5": {
2868
+ "hashes": {
2869
+ "hash_examples": "71e55e2b829b6528",
2870
+ "hash_full_prompts": "2bfb18e5fab8dea7",
2871
+ "hash_input_tokens": "7ef044063c66d876",
2872
+ "hash_cont_tokens": "fcda7736026f2449"
2873
+ },
2874
+ "truncated": 0,
2875
+ "non_truncated": 306,
2876
+ "padded": 1224,
2877
+ "non_padded": 0,
2878
+ "effective_few_shots": 5.0,
2879
+ "num_truncated_few_shots": 0
2880
+ },
2881
+ "leaderboard|mmlu:philosophy|5": {
2882
+ "hashes": {
2883
+ "hash_examples": "a6d489a8d208fa4b",
2884
+ "hash_full_prompts": "e8c0d5b6dae3ccc8",
2885
+ "hash_input_tokens": "e475f585b68fe0ad",
2886
+ "hash_cont_tokens": "0f39b851342e8986"
2887
+ },
2888
+ "truncated": 0,
2889
+ "non_truncated": 311,
2890
+ "padded": 1244,
2891
+ "non_padded": 0,
2892
+ "effective_few_shots": 5.0,
2893
+ "num_truncated_few_shots": 0
2894
+ },
2895
+ "leaderboard|mmlu:prehistory|5": {
2896
+ "hashes": {
2897
+ "hash_examples": "6cc50f032a19acaa",
2898
+ "hash_full_prompts": "4a6a1d3ab1bf28e4",
2899
+ "hash_input_tokens": "44821ed56961b949",
2900
+ "hash_cont_tokens": "b60e45d3e9856b35"
2901
+ },
2902
+ "truncated": 0,
2903
+ "non_truncated": 324,
2904
+ "padded": 1280,
2905
+ "non_padded": 16,
2906
+ "effective_few_shots": 5.0,
2907
+ "num_truncated_few_shots": 0
2908
+ },
2909
+ "leaderboard|mmlu:professional_accounting|5": {
2910
+ "hashes": {
2911
+ "hash_examples": "50f57ab32f5f6cea",
2912
+ "hash_full_prompts": "e60129bd2d82ffc6",
2913
+ "hash_input_tokens": "3fda79e4e3994658",
2914
+ "hash_cont_tokens": "a0c4e121b7293818"
2915
+ },
2916
+ "truncated": 0,
2917
+ "non_truncated": 282,
2918
+ "padded": 1112,
2919
+ "non_padded": 16,
2920
+ "effective_few_shots": 5.0,
2921
+ "num_truncated_few_shots": 0
2922
+ },
2923
+ "leaderboard|mmlu:professional_law|5": {
2924
+ "hashes": {
2925
+ "hash_examples": "a8fdc85c64f4b215",
2926
+ "hash_full_prompts": "0dbb1d9b72dcea03",
2927
+ "hash_input_tokens": "ea30074f69f9f56f",
2928
+ "hash_cont_tokens": "68b662abeba54fbc"
2929
+ },
2930
+ "truncated": 0,
2931
+ "non_truncated": 1534,
2932
+ "padded": 6136,
2933
+ "non_padded": 0,
2934
+ "effective_few_shots": 5.0,
2935
+ "num_truncated_few_shots": 0
2936
+ },
2937
+ "leaderboard|mmlu:professional_medicine|5": {
2938
+ "hashes": {
2939
+ "hash_examples": "c373a28a3050a73a",
2940
+ "hash_full_prompts": "5e040f9ca68b089e",
2941
+ "hash_input_tokens": "9bcee2ee9fbc2765",
2942
+ "hash_cont_tokens": "6caeac5412bb4a09"
2943
+ },
2944
+ "truncated": 0,
2945
+ "non_truncated": 272,
2946
+ "padded": 1088,
2947
+ "non_padded": 0,
2948
+ "effective_few_shots": 5.0,
2949
+ "num_truncated_few_shots": 0
2950
+ },
2951
+ "leaderboard|mmlu:professional_psychology|5": {
2952
+ "hashes": {
2953
+ "hash_examples": "bf5254fe818356af",
2954
+ "hash_full_prompts": "b386ecda8b87150e",
2955
+ "hash_input_tokens": "52ee178fd8502719",
2956
+ "hash_cont_tokens": "79b091252a1095a9"
2957
+ },
2958
+ "truncated": 0,
2959
+ "non_truncated": 612,
2960
+ "padded": 2448,
2961
+ "non_padded": 0,
2962
+ "effective_few_shots": 5.0,
2963
+ "num_truncated_few_shots": 0
2964
+ },
2965
+ "leaderboard|mmlu:public_relations|5": {
2966
+ "hashes": {
2967
+ "hash_examples": "b66d52e28e7d14e0",
2968
+ "hash_full_prompts": "fe43562263e25677",
2969
+ "hash_input_tokens": "19c1feb7b275bb61",
2970
+ "hash_cont_tokens": "987115a77c8704f0"
2971
+ },
2972
+ "truncated": 0,
2973
+ "non_truncated": 110,
2974
+ "padded": 436,
2975
+ "non_padded": 4,
2976
+ "effective_few_shots": 5.0,
2977
+ "num_truncated_few_shots": 0
2978
+ },
2979
+ "leaderboard|mmlu:security_studies|5": {
2980
+ "hashes": {
2981
+ "hash_examples": "514c14feaf000ad9",
2982
+ "hash_full_prompts": "27d4a2ac541ef4b9",
2983
+ "hash_input_tokens": "8be9de223078a919",
2984
+ "hash_cont_tokens": "6c35bc7e96074b27"
2985
+ },
2986
+ "truncated": 0,
2987
+ "non_truncated": 245,
2988
+ "padded": 980,
2989
+ "non_padded": 0,
2990
+ "effective_few_shots": 5.0,
2991
+ "num_truncated_few_shots": 0
2992
+ },
2993
+ "leaderboard|mmlu:sociology|5": {
2994
+ "hashes": {
2995
+ "hash_examples": "f6c9bc9d18c80870",
2996
+ "hash_full_prompts": "c072ea7d1a1524f2",
2997
+ "hash_input_tokens": "7a9f0cfb80730068",
2998
+ "hash_cont_tokens": "32af622f73b2e657"
2999
+ },
3000
+ "truncated": 0,
3001
+ "non_truncated": 201,
3002
+ "padded": 804,
3003
+ "non_padded": 0,
3004
+ "effective_few_shots": 5.0,
3005
+ "num_truncated_few_shots": 0
3006
+ },
3007
+ "leaderboard|mmlu:us_foreign_policy|5": {
3008
+ "hashes": {
3009
+ "hash_examples": "ed7b78629db6678f",
3010
+ "hash_full_prompts": "341a97ca3e4d699d",
3011
+ "hash_input_tokens": "8571fa884af32043",
3012
+ "hash_cont_tokens": "9e1c9ca2c51de57e"
3013
+ },
3014
+ "truncated": 0,
3015
+ "non_truncated": 100,
3016
+ "padded": 400,
3017
+ "non_padded": 0,
3018
+ "effective_few_shots": 5.0,
3019
+ "num_truncated_few_shots": 0
3020
+ },
3021
+ "leaderboard|mmlu:virology|5": {
3022
+ "hashes": {
3023
+ "hash_examples": "bc52ffdc3f9b994a",
3024
+ "hash_full_prompts": "651d471e2eb8b5e9",
3025
+ "hash_input_tokens": "b3022dee6d7e9a7f",
3026
+ "hash_cont_tokens": "beded8c3660dc8f5"
3027
+ },
3028
+ "truncated": 0,
3029
+ "non_truncated": 166,
3030
+ "padded": 664,
3031
+ "non_padded": 0,
3032
+ "effective_few_shots": 5.0,
3033
+ "num_truncated_few_shots": 0
3034
+ },
3035
+ "leaderboard|mmlu:world_religions|5": {
3036
+ "hashes": {
3037
+ "hash_examples": "ecdb4a4f94f62930",
3038
+ "hash_full_prompts": "3773f03542ce44a3",
3039
+ "hash_input_tokens": "1d1a7236494e8427",
3040
+ "hash_cont_tokens": "9b1952a4af3d6a73"
3041
+ },
3042
+ "truncated": 0,
3043
+ "non_truncated": 171,
3044
+ "padded": 684,
3045
+ "non_padded": 0,
3046
+ "effective_few_shots": 5.0,
3047
+ "num_truncated_few_shots": 0
3048
+ }
3049
+ },
3050
+ "summary_general": {
3051
+ "hashes": {
3052
+ "hash_examples": "341a076d0beb7048",
3053
+ "hash_full_prompts": "a5c8f2b7ff4f5ae2",
3054
+ "hash_input_tokens": "ccd7ce8216899ee4",
3055
+ "hash_cont_tokens": "25e9f343d6b95644"
3056
+ },
3057
+ "truncated": 0,
3058
+ "non_truncated": 14042,
3059
+ "padded": 56062,
3060
+ "non_padded": 106,
3061
+ "num_truncated_few_shots": 0
3062
+ }
3063
+ }