lewtun HF staff commited on
Commit
e5ba0d9
·
verified ·
1 Parent(s): 945fe8f

Upload eval_results/orpo-explorers/kaist-mistral-orpo-OHP-15k-Stratified-1-beta-0.2-1epoch-capybara-2epoch/main/mmlu/results_2024-04-30T21-30-34.625460.json with huggingface_hub

Browse files
eval_results/orpo-explorers/kaist-mistral-orpo-OHP-15k-Stratified-1-beta-0.2-1epoch-capybara-2epoch/main/mmlu/results_2024-04-30T21-30-34.625460.json ADDED
@@ -0,0 +1,3067 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "config_general": {
3
+ "lighteval_sha": "?",
4
+ "num_fewshot_seeds": 1,
5
+ "override_batch_size": 4,
6
+ "max_samples": null,
7
+ "job_id": "",
8
+ "start_time": 1494392.526407804,
9
+ "end_time": 1495052.000290271,
10
+ "total_evaluation_time_secondes": "659.4738824667875",
11
+ "model_name": "orpo-explorers/kaist-mistral-orpo-OHP-15k-Stratified-1-beta-0.2-1epoch-capybara-2epoch",
12
+ "model_sha": "90b892d49af17d5e23c1c53d45f41bbd66894843",
13
+ "model_dtype": "torch.bfloat16",
14
+ "model_size": "13.99 GB",
15
+ "config": null
16
+ },
17
+ "results": {
18
+ "leaderboard|mmlu:abstract_algebra|5": {
19
+ "acc": 0.33,
20
+ "acc_stderr": 0.04725815626252605
21
+ },
22
+ "leaderboard|mmlu:anatomy|5": {
23
+ "acc": 0.6,
24
+ "acc_stderr": 0.042320736951515885
25
+ },
26
+ "leaderboard|mmlu:astronomy|5": {
27
+ "acc": 0.631578947368421,
28
+ "acc_stderr": 0.03925523381052932
29
+ },
30
+ "leaderboard|mmlu:business_ethics|5": {
31
+ "acc": 0.53,
32
+ "acc_stderr": 0.05016135580465919
33
+ },
34
+ "leaderboard|mmlu:clinical_knowledge|5": {
35
+ "acc": 0.7094339622641509,
36
+ "acc_stderr": 0.027943219989337117
37
+ },
38
+ "leaderboard|mmlu:college_biology|5": {
39
+ "acc": 0.7291666666666666,
40
+ "acc_stderr": 0.03716177437566018
41
+ },
42
+ "leaderboard|mmlu:college_chemistry|5": {
43
+ "acc": 0.49,
44
+ "acc_stderr": 0.05024183937956911
45
+ },
46
+ "leaderboard|mmlu:college_computer_science|5": {
47
+ "acc": 0.4,
48
+ "acc_stderr": 0.049236596391733084
49
+ },
50
+ "leaderboard|mmlu:college_mathematics|5": {
51
+ "acc": 0.33,
52
+ "acc_stderr": 0.047258156262526045
53
+ },
54
+ "leaderboard|mmlu:college_medicine|5": {
55
+ "acc": 0.6473988439306358,
56
+ "acc_stderr": 0.036430371689585475
57
+ },
58
+ "leaderboard|mmlu:college_physics|5": {
59
+ "acc": 0.4215686274509804,
60
+ "acc_stderr": 0.04913595201274498
61
+ },
62
+ "leaderboard|mmlu:computer_security|5": {
63
+ "acc": 0.7,
64
+ "acc_stderr": 0.046056618647183814
65
+ },
66
+ "leaderboard|mmlu:conceptual_physics|5": {
67
+ "acc": 0.5063829787234042,
68
+ "acc_stderr": 0.03268335899936337
69
+ },
70
+ "leaderboard|mmlu:econometrics|5": {
71
+ "acc": 0.41228070175438597,
72
+ "acc_stderr": 0.046306532033665956
73
+ },
74
+ "leaderboard|mmlu:electrical_engineering|5": {
75
+ "acc": 0.47586206896551725,
76
+ "acc_stderr": 0.04161808503501528
77
+ },
78
+ "leaderboard|mmlu:elementary_mathematics|5": {
79
+ "acc": 0.35978835978835977,
80
+ "acc_stderr": 0.024718075944129288
81
+ },
82
+ "leaderboard|mmlu:formal_logic|5": {
83
+ "acc": 0.42857142857142855,
84
+ "acc_stderr": 0.04426266681379909
85
+ },
86
+ "leaderboard|mmlu:global_facts|5": {
87
+ "acc": 0.39,
88
+ "acc_stderr": 0.04902071300001974
89
+ },
90
+ "leaderboard|mmlu:high_school_biology|5": {
91
+ "acc": 0.7612903225806451,
92
+ "acc_stderr": 0.024251071262208837
93
+ },
94
+ "leaderboard|mmlu:high_school_chemistry|5": {
95
+ "acc": 0.4729064039408867,
96
+ "acc_stderr": 0.03512819077876106
97
+ },
98
+ "leaderboard|mmlu:high_school_computer_science|5": {
99
+ "acc": 0.62,
100
+ "acc_stderr": 0.048783173121456316
101
+ },
102
+ "leaderboard|mmlu:high_school_european_history|5": {
103
+ "acc": 0.7454545454545455,
104
+ "acc_stderr": 0.03401506715249039
105
+ },
106
+ "leaderboard|mmlu:high_school_geography|5": {
107
+ "acc": 0.7323232323232324,
108
+ "acc_stderr": 0.03154449888270286
109
+ },
110
+ "leaderboard|mmlu:high_school_government_and_politics|5": {
111
+ "acc": 0.8652849740932642,
112
+ "acc_stderr": 0.024639789097709443
113
+ },
114
+ "leaderboard|mmlu:high_school_macroeconomics|5": {
115
+ "acc": 0.6333333333333333,
116
+ "acc_stderr": 0.024433016466052462
117
+ },
118
+ "leaderboard|mmlu:high_school_mathematics|5": {
119
+ "acc": 0.2518518518518518,
120
+ "acc_stderr": 0.026466117538959916
121
+ },
122
+ "leaderboard|mmlu:high_school_microeconomics|5": {
123
+ "acc": 0.6596638655462185,
124
+ "acc_stderr": 0.030778057422931673
125
+ },
126
+ "leaderboard|mmlu:high_school_physics|5": {
127
+ "acc": 0.36423841059602646,
128
+ "acc_stderr": 0.03929111781242742
129
+ },
130
+ "leaderboard|mmlu:high_school_psychology|5": {
131
+ "acc": 0.8091743119266055,
132
+ "acc_stderr": 0.016847676400091098
133
+ },
134
+ "leaderboard|mmlu:high_school_statistics|5": {
135
+ "acc": 0.5277777777777778,
136
+ "acc_stderr": 0.0340470532865388
137
+ },
138
+ "leaderboard|mmlu:high_school_us_history|5": {
139
+ "acc": 0.7549019607843137,
140
+ "acc_stderr": 0.030190282453501947
141
+ },
142
+ "leaderboard|mmlu:high_school_world_history|5": {
143
+ "acc": 0.7510548523206751,
144
+ "acc_stderr": 0.028146970599422644
145
+ },
146
+ "leaderboard|mmlu:human_aging|5": {
147
+ "acc": 0.6636771300448431,
148
+ "acc_stderr": 0.031708824268455005
149
+ },
150
+ "leaderboard|mmlu:human_sexuality|5": {
151
+ "acc": 0.7251908396946565,
152
+ "acc_stderr": 0.039153454088478354
153
+ },
154
+ "leaderboard|mmlu:international_law|5": {
155
+ "acc": 0.7107438016528925,
156
+ "acc_stderr": 0.04139112727635463
157
+ },
158
+ "leaderboard|mmlu:jurisprudence|5": {
159
+ "acc": 0.7407407407407407,
160
+ "acc_stderr": 0.042365112580946315
161
+ },
162
+ "leaderboard|mmlu:logical_fallacies|5": {
163
+ "acc": 0.7361963190184049,
164
+ "acc_stderr": 0.03462419931615624
165
+ },
166
+ "leaderboard|mmlu:machine_learning|5": {
167
+ "acc": 0.4107142857142857,
168
+ "acc_stderr": 0.04669510663875191
169
+ },
170
+ "leaderboard|mmlu:management|5": {
171
+ "acc": 0.7961165048543689,
172
+ "acc_stderr": 0.039891398595317706
173
+ },
174
+ "leaderboard|mmlu:marketing|5": {
175
+ "acc": 0.8418803418803419,
176
+ "acc_stderr": 0.023902325549560403
177
+ },
178
+ "leaderboard|mmlu:medical_genetics|5": {
179
+ "acc": 0.7,
180
+ "acc_stderr": 0.046056618647183814
181
+ },
182
+ "leaderboard|mmlu:miscellaneous|5": {
183
+ "acc": 0.8058748403575989,
184
+ "acc_stderr": 0.014143970276657569
185
+ },
186
+ "leaderboard|mmlu:moral_disputes|5": {
187
+ "acc": 0.6936416184971098,
188
+ "acc_stderr": 0.0248183501294366
189
+ },
190
+ "leaderboard|mmlu:moral_scenarios|5": {
191
+ "acc": 0.3854748603351955,
192
+ "acc_stderr": 0.016277927039638193
193
+ },
194
+ "leaderboard|mmlu:nutrition|5": {
195
+ "acc": 0.6830065359477124,
196
+ "acc_stderr": 0.026643278474508755
197
+ },
198
+ "leaderboard|mmlu:philosophy|5": {
199
+ "acc": 0.6881028938906752,
200
+ "acc_stderr": 0.026311858071854155
201
+ },
202
+ "leaderboard|mmlu:prehistory|5": {
203
+ "acc": 0.691358024691358,
204
+ "acc_stderr": 0.02570264026060374
205
+ },
206
+ "leaderboard|mmlu:professional_accounting|5": {
207
+ "acc": 0.4787234042553192,
208
+ "acc_stderr": 0.029800481645628693
209
+ },
210
+ "leaderboard|mmlu:professional_law|5": {
211
+ "acc": 0.4426336375488918,
212
+ "acc_stderr": 0.01268590653820624
213
+ },
214
+ "leaderboard|mmlu:professional_medicine|5": {
215
+ "acc": 0.6654411764705882,
216
+ "acc_stderr": 0.028661996202335303
217
+ },
218
+ "leaderboard|mmlu:professional_psychology|5": {
219
+ "acc": 0.6241830065359477,
220
+ "acc_stderr": 0.019594021136577443
221
+ },
222
+ "leaderboard|mmlu:public_relations|5": {
223
+ "acc": 0.7,
224
+ "acc_stderr": 0.04389311454644286
225
+ },
226
+ "leaderboard|mmlu:security_studies|5": {
227
+ "acc": 0.7061224489795919,
228
+ "acc_stderr": 0.029162738410249776
229
+ },
230
+ "leaderboard|mmlu:sociology|5": {
231
+ "acc": 0.8109452736318408,
232
+ "acc_stderr": 0.027686913588013014
233
+ },
234
+ "leaderboard|mmlu:us_foreign_policy|5": {
235
+ "acc": 0.83,
236
+ "acc_stderr": 0.0377525168068637
237
+ },
238
+ "leaderboard|mmlu:virology|5": {
239
+ "acc": 0.5180722891566265,
240
+ "acc_stderr": 0.03889951252827216
241
+ },
242
+ "leaderboard|mmlu:world_religions|5": {
243
+ "acc": 0.8128654970760234,
244
+ "acc_stderr": 0.02991312723236804
245
+ },
246
+ "leaderboard|mmlu:_average|5": {
247
+ "acc": 0.612333226298041,
248
+ "acc_stderr": 0.03451522886890664
249
+ },
250
+ "all": {
251
+ "acc": 0.612333226298041,
252
+ "acc_stderr": 0.03451522886890664
253
+ }
254
+ },
255
+ "versions": {
256
+ "leaderboard|mmlu:abstract_algebra|5": 0,
257
+ "leaderboard|mmlu:anatomy|5": 0,
258
+ "leaderboard|mmlu:astronomy|5": 0,
259
+ "leaderboard|mmlu:business_ethics|5": 0,
260
+ "leaderboard|mmlu:clinical_knowledge|5": 0,
261
+ "leaderboard|mmlu:college_biology|5": 0,
262
+ "leaderboard|mmlu:college_chemistry|5": 0,
263
+ "leaderboard|mmlu:college_computer_science|5": 0,
264
+ "leaderboard|mmlu:college_mathematics|5": 0,
265
+ "leaderboard|mmlu:college_medicine|5": 0,
266
+ "leaderboard|mmlu:college_physics|5": 0,
267
+ "leaderboard|mmlu:computer_security|5": 0,
268
+ "leaderboard|mmlu:conceptual_physics|5": 0,
269
+ "leaderboard|mmlu:econometrics|5": 0,
270
+ "leaderboard|mmlu:electrical_engineering|5": 0,
271
+ "leaderboard|mmlu:elementary_mathematics|5": 0,
272
+ "leaderboard|mmlu:formal_logic|5": 0,
273
+ "leaderboard|mmlu:global_facts|5": 0,
274
+ "leaderboard|mmlu:high_school_biology|5": 0,
275
+ "leaderboard|mmlu:high_school_chemistry|5": 0,
276
+ "leaderboard|mmlu:high_school_computer_science|5": 0,
277
+ "leaderboard|mmlu:high_school_european_history|5": 0,
278
+ "leaderboard|mmlu:high_school_geography|5": 0,
279
+ "leaderboard|mmlu:high_school_government_and_politics|5": 0,
280
+ "leaderboard|mmlu:high_school_macroeconomics|5": 0,
281
+ "leaderboard|mmlu:high_school_mathematics|5": 0,
282
+ "leaderboard|mmlu:high_school_microeconomics|5": 0,
283
+ "leaderboard|mmlu:high_school_physics|5": 0,
284
+ "leaderboard|mmlu:high_school_psychology|5": 0,
285
+ "leaderboard|mmlu:high_school_statistics|5": 0,
286
+ "leaderboard|mmlu:high_school_us_history|5": 0,
287
+ "leaderboard|mmlu:high_school_world_history|5": 0,
288
+ "leaderboard|mmlu:human_aging|5": 0,
289
+ "leaderboard|mmlu:human_sexuality|5": 0,
290
+ "leaderboard|mmlu:international_law|5": 0,
291
+ "leaderboard|mmlu:jurisprudence|5": 0,
292
+ "leaderboard|mmlu:logical_fallacies|5": 0,
293
+ "leaderboard|mmlu:machine_learning|5": 0,
294
+ "leaderboard|mmlu:management|5": 0,
295
+ "leaderboard|mmlu:marketing|5": 0,
296
+ "leaderboard|mmlu:medical_genetics|5": 0,
297
+ "leaderboard|mmlu:miscellaneous|5": 0,
298
+ "leaderboard|mmlu:moral_disputes|5": 0,
299
+ "leaderboard|mmlu:moral_scenarios|5": 0,
300
+ "leaderboard|mmlu:nutrition|5": 0,
301
+ "leaderboard|mmlu:philosophy|5": 0,
302
+ "leaderboard|mmlu:prehistory|5": 0,
303
+ "leaderboard|mmlu:professional_accounting|5": 0,
304
+ "leaderboard|mmlu:professional_law|5": 0,
305
+ "leaderboard|mmlu:professional_medicine|5": 0,
306
+ "leaderboard|mmlu:professional_psychology|5": 0,
307
+ "leaderboard|mmlu:public_relations|5": 0,
308
+ "leaderboard|mmlu:security_studies|5": 0,
309
+ "leaderboard|mmlu:sociology|5": 0,
310
+ "leaderboard|mmlu:us_foreign_policy|5": 0,
311
+ "leaderboard|mmlu:virology|5": 0,
312
+ "leaderboard|mmlu:world_religions|5": 0
313
+ },
314
+ "config_tasks": {
315
+ "leaderboard|mmlu:abstract_algebra": {
316
+ "name": "mmlu:abstract_algebra",
317
+ "prompt_function": "mmlu_harness",
318
+ "hf_repo": "lighteval/mmlu",
319
+ "hf_subset": "abstract_algebra",
320
+ "metric": [
321
+ "loglikelihood_acc"
322
+ ],
323
+ "hf_avail_splits": [
324
+ "auxiliary_train",
325
+ "test",
326
+ "validation",
327
+ "dev"
328
+ ],
329
+ "evaluation_splits": [
330
+ "test"
331
+ ],
332
+ "few_shots_split": "dev",
333
+ "few_shots_select": "sequential",
334
+ "generation_size": 1,
335
+ "stop_sequence": [
336
+ "\n"
337
+ ],
338
+ "output_regex": null,
339
+ "frozen": false,
340
+ "suite": [
341
+ "leaderboard",
342
+ "mmlu"
343
+ ],
344
+ "original_num_docs": 100,
345
+ "effective_num_docs": 100,
346
+ "trust_dataset": true,
347
+ "must_remove_duplicate_docs": null
348
+ },
349
+ "leaderboard|mmlu:anatomy": {
350
+ "name": "mmlu:anatomy",
351
+ "prompt_function": "mmlu_harness",
352
+ "hf_repo": "lighteval/mmlu",
353
+ "hf_subset": "anatomy",
354
+ "metric": [
355
+ "loglikelihood_acc"
356
+ ],
357
+ "hf_avail_splits": [
358
+ "auxiliary_train",
359
+ "test",
360
+ "validation",
361
+ "dev"
362
+ ],
363
+ "evaluation_splits": [
364
+ "test"
365
+ ],
366
+ "few_shots_split": "dev",
367
+ "few_shots_select": "sequential",
368
+ "generation_size": 1,
369
+ "stop_sequence": [
370
+ "\n"
371
+ ],
372
+ "output_regex": null,
373
+ "frozen": false,
374
+ "suite": [
375
+ "leaderboard",
376
+ "mmlu"
377
+ ],
378
+ "original_num_docs": 135,
379
+ "effective_num_docs": 135,
380
+ "trust_dataset": true,
381
+ "must_remove_duplicate_docs": null
382
+ },
383
+ "leaderboard|mmlu:astronomy": {
384
+ "name": "mmlu:astronomy",
385
+ "prompt_function": "mmlu_harness",
386
+ "hf_repo": "lighteval/mmlu",
387
+ "hf_subset": "astronomy",
388
+ "metric": [
389
+ "loglikelihood_acc"
390
+ ],
391
+ "hf_avail_splits": [
392
+ "auxiliary_train",
393
+ "test",
394
+ "validation",
395
+ "dev"
396
+ ],
397
+ "evaluation_splits": [
398
+ "test"
399
+ ],
400
+ "few_shots_split": "dev",
401
+ "few_shots_select": "sequential",
402
+ "generation_size": 1,
403
+ "stop_sequence": [
404
+ "\n"
405
+ ],
406
+ "output_regex": null,
407
+ "frozen": false,
408
+ "suite": [
409
+ "leaderboard",
410
+ "mmlu"
411
+ ],
412
+ "original_num_docs": 152,
413
+ "effective_num_docs": 152,
414
+ "trust_dataset": true,
415
+ "must_remove_duplicate_docs": null
416
+ },
417
+ "leaderboard|mmlu:business_ethics": {
418
+ "name": "mmlu:business_ethics",
419
+ "prompt_function": "mmlu_harness",
420
+ "hf_repo": "lighteval/mmlu",
421
+ "hf_subset": "business_ethics",
422
+ "metric": [
423
+ "loglikelihood_acc"
424
+ ],
425
+ "hf_avail_splits": [
426
+ "auxiliary_train",
427
+ "test",
428
+ "validation",
429
+ "dev"
430
+ ],
431
+ "evaluation_splits": [
432
+ "test"
433
+ ],
434
+ "few_shots_split": "dev",
435
+ "few_shots_select": "sequential",
436
+ "generation_size": 1,
437
+ "stop_sequence": [
438
+ "\n"
439
+ ],
440
+ "output_regex": null,
441
+ "frozen": false,
442
+ "suite": [
443
+ "leaderboard",
444
+ "mmlu"
445
+ ],
446
+ "original_num_docs": 100,
447
+ "effective_num_docs": 100,
448
+ "trust_dataset": true,
449
+ "must_remove_duplicate_docs": null
450
+ },
451
+ "leaderboard|mmlu:clinical_knowledge": {
452
+ "name": "mmlu:clinical_knowledge",
453
+ "prompt_function": "mmlu_harness",
454
+ "hf_repo": "lighteval/mmlu",
455
+ "hf_subset": "clinical_knowledge",
456
+ "metric": [
457
+ "loglikelihood_acc"
458
+ ],
459
+ "hf_avail_splits": [
460
+ "auxiliary_train",
461
+ "test",
462
+ "validation",
463
+ "dev"
464
+ ],
465
+ "evaluation_splits": [
466
+ "test"
467
+ ],
468
+ "few_shots_split": "dev",
469
+ "few_shots_select": "sequential",
470
+ "generation_size": 1,
471
+ "stop_sequence": [
472
+ "\n"
473
+ ],
474
+ "output_regex": null,
475
+ "frozen": false,
476
+ "suite": [
477
+ "leaderboard",
478
+ "mmlu"
479
+ ],
480
+ "original_num_docs": 265,
481
+ "effective_num_docs": 265,
482
+ "trust_dataset": true,
483
+ "must_remove_duplicate_docs": null
484
+ },
485
+ "leaderboard|mmlu:college_biology": {
486
+ "name": "mmlu:college_biology",
487
+ "prompt_function": "mmlu_harness",
488
+ "hf_repo": "lighteval/mmlu",
489
+ "hf_subset": "college_biology",
490
+ "metric": [
491
+ "loglikelihood_acc"
492
+ ],
493
+ "hf_avail_splits": [
494
+ "auxiliary_train",
495
+ "test",
496
+ "validation",
497
+ "dev"
498
+ ],
499
+ "evaluation_splits": [
500
+ "test"
501
+ ],
502
+ "few_shots_split": "dev",
503
+ "few_shots_select": "sequential",
504
+ "generation_size": 1,
505
+ "stop_sequence": [
506
+ "\n"
507
+ ],
508
+ "output_regex": null,
509
+ "frozen": false,
510
+ "suite": [
511
+ "leaderboard",
512
+ "mmlu"
513
+ ],
514
+ "original_num_docs": 144,
515
+ "effective_num_docs": 144,
516
+ "trust_dataset": true,
517
+ "must_remove_duplicate_docs": null
518
+ },
519
+ "leaderboard|mmlu:college_chemistry": {
520
+ "name": "mmlu:college_chemistry",
521
+ "prompt_function": "mmlu_harness",
522
+ "hf_repo": "lighteval/mmlu",
523
+ "hf_subset": "college_chemistry",
524
+ "metric": [
525
+ "loglikelihood_acc"
526
+ ],
527
+ "hf_avail_splits": [
528
+ "auxiliary_train",
529
+ "test",
530
+ "validation",
531
+ "dev"
532
+ ],
533
+ "evaluation_splits": [
534
+ "test"
535
+ ],
536
+ "few_shots_split": "dev",
537
+ "few_shots_select": "sequential",
538
+ "generation_size": 1,
539
+ "stop_sequence": [
540
+ "\n"
541
+ ],
542
+ "output_regex": null,
543
+ "frozen": false,
544
+ "suite": [
545
+ "leaderboard",
546
+ "mmlu"
547
+ ],
548
+ "original_num_docs": 100,
549
+ "effective_num_docs": 100,
550
+ "trust_dataset": true,
551
+ "must_remove_duplicate_docs": null
552
+ },
553
+ "leaderboard|mmlu:college_computer_science": {
554
+ "name": "mmlu:college_computer_science",
555
+ "prompt_function": "mmlu_harness",
556
+ "hf_repo": "lighteval/mmlu",
557
+ "hf_subset": "college_computer_science",
558
+ "metric": [
559
+ "loglikelihood_acc"
560
+ ],
561
+ "hf_avail_splits": [
562
+ "auxiliary_train",
563
+ "test",
564
+ "validation",
565
+ "dev"
566
+ ],
567
+ "evaluation_splits": [
568
+ "test"
569
+ ],
570
+ "few_shots_split": "dev",
571
+ "few_shots_select": "sequential",
572
+ "generation_size": 1,
573
+ "stop_sequence": [
574
+ "\n"
575
+ ],
576
+ "output_regex": null,
577
+ "frozen": false,
578
+ "suite": [
579
+ "leaderboard",
580
+ "mmlu"
581
+ ],
582
+ "original_num_docs": 100,
583
+ "effective_num_docs": 100,
584
+ "trust_dataset": true,
585
+ "must_remove_duplicate_docs": null
586
+ },
587
+ "leaderboard|mmlu:college_mathematics": {
588
+ "name": "mmlu:college_mathematics",
589
+ "prompt_function": "mmlu_harness",
590
+ "hf_repo": "lighteval/mmlu",
591
+ "hf_subset": "college_mathematics",
592
+ "metric": [
593
+ "loglikelihood_acc"
594
+ ],
595
+ "hf_avail_splits": [
596
+ "auxiliary_train",
597
+ "test",
598
+ "validation",
599
+ "dev"
600
+ ],
601
+ "evaluation_splits": [
602
+ "test"
603
+ ],
604
+ "few_shots_split": "dev",
605
+ "few_shots_select": "sequential",
606
+ "generation_size": 1,
607
+ "stop_sequence": [
608
+ "\n"
609
+ ],
610
+ "output_regex": null,
611
+ "frozen": false,
612
+ "suite": [
613
+ "leaderboard",
614
+ "mmlu"
615
+ ],
616
+ "original_num_docs": 100,
617
+ "effective_num_docs": 100,
618
+ "trust_dataset": true,
619
+ "must_remove_duplicate_docs": null
620
+ },
621
+ "leaderboard|mmlu:college_medicine": {
622
+ "name": "mmlu:college_medicine",
623
+ "prompt_function": "mmlu_harness",
624
+ "hf_repo": "lighteval/mmlu",
625
+ "hf_subset": "college_medicine",
626
+ "metric": [
627
+ "loglikelihood_acc"
628
+ ],
629
+ "hf_avail_splits": [
630
+ "auxiliary_train",
631
+ "test",
632
+ "validation",
633
+ "dev"
634
+ ],
635
+ "evaluation_splits": [
636
+ "test"
637
+ ],
638
+ "few_shots_split": "dev",
639
+ "few_shots_select": "sequential",
640
+ "generation_size": 1,
641
+ "stop_sequence": [
642
+ "\n"
643
+ ],
644
+ "output_regex": null,
645
+ "frozen": false,
646
+ "suite": [
647
+ "leaderboard",
648
+ "mmlu"
649
+ ],
650
+ "original_num_docs": 173,
651
+ "effective_num_docs": 173,
652
+ "trust_dataset": true,
653
+ "must_remove_duplicate_docs": null
654
+ },
655
+ "leaderboard|mmlu:college_physics": {
656
+ "name": "mmlu:college_physics",
657
+ "prompt_function": "mmlu_harness",
658
+ "hf_repo": "lighteval/mmlu",
659
+ "hf_subset": "college_physics",
660
+ "metric": [
661
+ "loglikelihood_acc"
662
+ ],
663
+ "hf_avail_splits": [
664
+ "auxiliary_train",
665
+ "test",
666
+ "validation",
667
+ "dev"
668
+ ],
669
+ "evaluation_splits": [
670
+ "test"
671
+ ],
672
+ "few_shots_split": "dev",
673
+ "few_shots_select": "sequential",
674
+ "generation_size": 1,
675
+ "stop_sequence": [
676
+ "\n"
677
+ ],
678
+ "output_regex": null,
679
+ "frozen": false,
680
+ "suite": [
681
+ "leaderboard",
682
+ "mmlu"
683
+ ],
684
+ "original_num_docs": 102,
685
+ "effective_num_docs": 102,
686
+ "trust_dataset": true,
687
+ "must_remove_duplicate_docs": null
688
+ },
689
+ "leaderboard|mmlu:computer_security": {
690
+ "name": "mmlu:computer_security",
691
+ "prompt_function": "mmlu_harness",
692
+ "hf_repo": "lighteval/mmlu",
693
+ "hf_subset": "computer_security",
694
+ "metric": [
695
+ "loglikelihood_acc"
696
+ ],
697
+ "hf_avail_splits": [
698
+ "auxiliary_train",
699
+ "test",
700
+ "validation",
701
+ "dev"
702
+ ],
703
+ "evaluation_splits": [
704
+ "test"
705
+ ],
706
+ "few_shots_split": "dev",
707
+ "few_shots_select": "sequential",
708
+ "generation_size": 1,
709
+ "stop_sequence": [
710
+ "\n"
711
+ ],
712
+ "output_regex": null,
713
+ "frozen": false,
714
+ "suite": [
715
+ "leaderboard",
716
+ "mmlu"
717
+ ],
718
+ "original_num_docs": 100,
719
+ "effective_num_docs": 100,
720
+ "trust_dataset": true,
721
+ "must_remove_duplicate_docs": null
722
+ },
723
+ "leaderboard|mmlu:conceptual_physics": {
724
+ "name": "mmlu:conceptual_physics",
725
+ "prompt_function": "mmlu_harness",
726
+ "hf_repo": "lighteval/mmlu",
727
+ "hf_subset": "conceptual_physics",
728
+ "metric": [
729
+ "loglikelihood_acc"
730
+ ],
731
+ "hf_avail_splits": [
732
+ "auxiliary_train",
733
+ "test",
734
+ "validation",
735
+ "dev"
736
+ ],
737
+ "evaluation_splits": [
738
+ "test"
739
+ ],
740
+ "few_shots_split": "dev",
741
+ "few_shots_select": "sequential",
742
+ "generation_size": 1,
743
+ "stop_sequence": [
744
+ "\n"
745
+ ],
746
+ "output_regex": null,
747
+ "frozen": false,
748
+ "suite": [
749
+ "leaderboard",
750
+ "mmlu"
751
+ ],
752
+ "original_num_docs": 235,
753
+ "effective_num_docs": 235,
754
+ "trust_dataset": true,
755
+ "must_remove_duplicate_docs": null
756
+ },
757
+ "leaderboard|mmlu:econometrics": {
758
+ "name": "mmlu:econometrics",
759
+ "prompt_function": "mmlu_harness",
760
+ "hf_repo": "lighteval/mmlu",
761
+ "hf_subset": "econometrics",
762
+ "metric": [
763
+ "loglikelihood_acc"
764
+ ],
765
+ "hf_avail_splits": [
766
+ "auxiliary_train",
767
+ "test",
768
+ "validation",
769
+ "dev"
770
+ ],
771
+ "evaluation_splits": [
772
+ "test"
773
+ ],
774
+ "few_shots_split": "dev",
775
+ "few_shots_select": "sequential",
776
+ "generation_size": 1,
777
+ "stop_sequence": [
778
+ "\n"
779
+ ],
780
+ "output_regex": null,
781
+ "frozen": false,
782
+ "suite": [
783
+ "leaderboard",
784
+ "mmlu"
785
+ ],
786
+ "original_num_docs": 114,
787
+ "effective_num_docs": 114,
788
+ "trust_dataset": true,
789
+ "must_remove_duplicate_docs": null
790
+ },
791
+ "leaderboard|mmlu:electrical_engineering": {
792
+ "name": "mmlu:electrical_engineering",
793
+ "prompt_function": "mmlu_harness",
794
+ "hf_repo": "lighteval/mmlu",
795
+ "hf_subset": "electrical_engineering",
796
+ "metric": [
797
+ "loglikelihood_acc"
798
+ ],
799
+ "hf_avail_splits": [
800
+ "auxiliary_train",
801
+ "test",
802
+ "validation",
803
+ "dev"
804
+ ],
805
+ "evaluation_splits": [
806
+ "test"
807
+ ],
808
+ "few_shots_split": "dev",
809
+ "few_shots_select": "sequential",
810
+ "generation_size": 1,
811
+ "stop_sequence": [
812
+ "\n"
813
+ ],
814
+ "output_regex": null,
815
+ "frozen": false,
816
+ "suite": [
817
+ "leaderboard",
818
+ "mmlu"
819
+ ],
820
+ "original_num_docs": 145,
821
+ "effective_num_docs": 145,
822
+ "trust_dataset": true,
823
+ "must_remove_duplicate_docs": null
824
+ },
825
+ "leaderboard|mmlu:elementary_mathematics": {
826
+ "name": "mmlu:elementary_mathematics",
827
+ "prompt_function": "mmlu_harness",
828
+ "hf_repo": "lighteval/mmlu",
829
+ "hf_subset": "elementary_mathematics",
830
+ "metric": [
831
+ "loglikelihood_acc"
832
+ ],
833
+ "hf_avail_splits": [
834
+ "auxiliary_train",
835
+ "test",
836
+ "validation",
837
+ "dev"
838
+ ],
839
+ "evaluation_splits": [
840
+ "test"
841
+ ],
842
+ "few_shots_split": "dev",
843
+ "few_shots_select": "sequential",
844
+ "generation_size": 1,
845
+ "stop_sequence": [
846
+ "\n"
847
+ ],
848
+ "output_regex": null,
849
+ "frozen": false,
850
+ "suite": [
851
+ "leaderboard",
852
+ "mmlu"
853
+ ],
854
+ "original_num_docs": 378,
855
+ "effective_num_docs": 378,
856
+ "trust_dataset": true,
857
+ "must_remove_duplicate_docs": null
858
+ },
859
+ "leaderboard|mmlu:formal_logic": {
860
+ "name": "mmlu:formal_logic",
861
+ "prompt_function": "mmlu_harness",
862
+ "hf_repo": "lighteval/mmlu",
863
+ "hf_subset": "formal_logic",
864
+ "metric": [
865
+ "loglikelihood_acc"
866
+ ],
867
+ "hf_avail_splits": [
868
+ "auxiliary_train",
869
+ "test",
870
+ "validation",
871
+ "dev"
872
+ ],
873
+ "evaluation_splits": [
874
+ "test"
875
+ ],
876
+ "few_shots_split": "dev",
877
+ "few_shots_select": "sequential",
878
+ "generation_size": 1,
879
+ "stop_sequence": [
880
+ "\n"
881
+ ],
882
+ "output_regex": null,
883
+ "frozen": false,
884
+ "suite": [
885
+ "leaderboard",
886
+ "mmlu"
887
+ ],
888
+ "original_num_docs": 126,
889
+ "effective_num_docs": 126,
890
+ "trust_dataset": true,
891
+ "must_remove_duplicate_docs": null
892
+ },
893
+ "leaderboard|mmlu:global_facts": {
894
+ "name": "mmlu:global_facts",
895
+ "prompt_function": "mmlu_harness",
896
+ "hf_repo": "lighteval/mmlu",
897
+ "hf_subset": "global_facts",
898
+ "metric": [
899
+ "loglikelihood_acc"
900
+ ],
901
+ "hf_avail_splits": [
902
+ "auxiliary_train",
903
+ "test",
904
+ "validation",
905
+ "dev"
906
+ ],
907
+ "evaluation_splits": [
908
+ "test"
909
+ ],
910
+ "few_shots_split": "dev",
911
+ "few_shots_select": "sequential",
912
+ "generation_size": 1,
913
+ "stop_sequence": [
914
+ "\n"
915
+ ],
916
+ "output_regex": null,
917
+ "frozen": false,
918
+ "suite": [
919
+ "leaderboard",
920
+ "mmlu"
921
+ ],
922
+ "original_num_docs": 100,
923
+ "effective_num_docs": 100,
924
+ "trust_dataset": true,
925
+ "must_remove_duplicate_docs": null
926
+ },
927
+ "leaderboard|mmlu:high_school_biology": {
928
+ "name": "mmlu:high_school_biology",
929
+ "prompt_function": "mmlu_harness",
930
+ "hf_repo": "lighteval/mmlu",
931
+ "hf_subset": "high_school_biology",
932
+ "metric": [
933
+ "loglikelihood_acc"
934
+ ],
935
+ "hf_avail_splits": [
936
+ "auxiliary_train",
937
+ "test",
938
+ "validation",
939
+ "dev"
940
+ ],
941
+ "evaluation_splits": [
942
+ "test"
943
+ ],
944
+ "few_shots_split": "dev",
945
+ "few_shots_select": "sequential",
946
+ "generation_size": 1,
947
+ "stop_sequence": [
948
+ "\n"
949
+ ],
950
+ "output_regex": null,
951
+ "frozen": false,
952
+ "suite": [
953
+ "leaderboard",
954
+ "mmlu"
955
+ ],
956
+ "original_num_docs": 310,
957
+ "effective_num_docs": 310,
958
+ "trust_dataset": true,
959
+ "must_remove_duplicate_docs": null
960
+ },
961
+ "leaderboard|mmlu:high_school_chemistry": {
962
+ "name": "mmlu:high_school_chemistry",
963
+ "prompt_function": "mmlu_harness",
964
+ "hf_repo": "lighteval/mmlu",
965
+ "hf_subset": "high_school_chemistry",
966
+ "metric": [
967
+ "loglikelihood_acc"
968
+ ],
969
+ "hf_avail_splits": [
970
+ "auxiliary_train",
971
+ "test",
972
+ "validation",
973
+ "dev"
974
+ ],
975
+ "evaluation_splits": [
976
+ "test"
977
+ ],
978
+ "few_shots_split": "dev",
979
+ "few_shots_select": "sequential",
980
+ "generation_size": 1,
981
+ "stop_sequence": [
982
+ "\n"
983
+ ],
984
+ "output_regex": null,
985
+ "frozen": false,
986
+ "suite": [
987
+ "leaderboard",
988
+ "mmlu"
989
+ ],
990
+ "original_num_docs": 203,
991
+ "effective_num_docs": 203,
992
+ "trust_dataset": true,
993
+ "must_remove_duplicate_docs": null
994
+ },
995
+ "leaderboard|mmlu:high_school_computer_science": {
996
+ "name": "mmlu:high_school_computer_science",
997
+ "prompt_function": "mmlu_harness",
998
+ "hf_repo": "lighteval/mmlu",
999
+ "hf_subset": "high_school_computer_science",
1000
+ "metric": [
1001
+ "loglikelihood_acc"
1002
+ ],
1003
+ "hf_avail_splits": [
1004
+ "auxiliary_train",
1005
+ "test",
1006
+ "validation",
1007
+ "dev"
1008
+ ],
1009
+ "evaluation_splits": [
1010
+ "test"
1011
+ ],
1012
+ "few_shots_split": "dev",
1013
+ "few_shots_select": "sequential",
1014
+ "generation_size": 1,
1015
+ "stop_sequence": [
1016
+ "\n"
1017
+ ],
1018
+ "output_regex": null,
1019
+ "frozen": false,
1020
+ "suite": [
1021
+ "leaderboard",
1022
+ "mmlu"
1023
+ ],
1024
+ "original_num_docs": 100,
1025
+ "effective_num_docs": 100,
1026
+ "trust_dataset": true,
1027
+ "must_remove_duplicate_docs": null
1028
+ },
1029
+ "leaderboard|mmlu:high_school_european_history": {
1030
+ "name": "mmlu:high_school_european_history",
1031
+ "prompt_function": "mmlu_harness",
1032
+ "hf_repo": "lighteval/mmlu",
1033
+ "hf_subset": "high_school_european_history",
1034
+ "metric": [
1035
+ "loglikelihood_acc"
1036
+ ],
1037
+ "hf_avail_splits": [
1038
+ "auxiliary_train",
1039
+ "test",
1040
+ "validation",
1041
+ "dev"
1042
+ ],
1043
+ "evaluation_splits": [
1044
+ "test"
1045
+ ],
1046
+ "few_shots_split": "dev",
1047
+ "few_shots_select": "sequential",
1048
+ "generation_size": 1,
1049
+ "stop_sequence": [
1050
+ "\n"
1051
+ ],
1052
+ "output_regex": null,
1053
+ "frozen": false,
1054
+ "suite": [
1055
+ "leaderboard",
1056
+ "mmlu"
1057
+ ],
1058
+ "original_num_docs": 165,
1059
+ "effective_num_docs": 165,
1060
+ "trust_dataset": true,
1061
+ "must_remove_duplicate_docs": null
1062
+ },
1063
+ "leaderboard|mmlu:high_school_geography": {
1064
+ "name": "mmlu:high_school_geography",
1065
+ "prompt_function": "mmlu_harness",
1066
+ "hf_repo": "lighteval/mmlu",
1067
+ "hf_subset": "high_school_geography",
1068
+ "metric": [
1069
+ "loglikelihood_acc"
1070
+ ],
1071
+ "hf_avail_splits": [
1072
+ "auxiliary_train",
1073
+ "test",
1074
+ "validation",
1075
+ "dev"
1076
+ ],
1077
+ "evaluation_splits": [
1078
+ "test"
1079
+ ],
1080
+ "few_shots_split": "dev",
1081
+ "few_shots_select": "sequential",
1082
+ "generation_size": 1,
1083
+ "stop_sequence": [
1084
+ "\n"
1085
+ ],
1086
+ "output_regex": null,
1087
+ "frozen": false,
1088
+ "suite": [
1089
+ "leaderboard",
1090
+ "mmlu"
1091
+ ],
1092
+ "original_num_docs": 198,
1093
+ "effective_num_docs": 198,
1094
+ "trust_dataset": true,
1095
+ "must_remove_duplicate_docs": null
1096
+ },
1097
+ "leaderboard|mmlu:high_school_government_and_politics": {
1098
+ "name": "mmlu:high_school_government_and_politics",
1099
+ "prompt_function": "mmlu_harness",
1100
+ "hf_repo": "lighteval/mmlu",
1101
+ "hf_subset": "high_school_government_and_politics",
1102
+ "metric": [
1103
+ "loglikelihood_acc"
1104
+ ],
1105
+ "hf_avail_splits": [
1106
+ "auxiliary_train",
1107
+ "test",
1108
+ "validation",
1109
+ "dev"
1110
+ ],
1111
+ "evaluation_splits": [
1112
+ "test"
1113
+ ],
1114
+ "few_shots_split": "dev",
1115
+ "few_shots_select": "sequential",
1116
+ "generation_size": 1,
1117
+ "stop_sequence": [
1118
+ "\n"
1119
+ ],
1120
+ "output_regex": null,
1121
+ "frozen": false,
1122
+ "suite": [
1123
+ "leaderboard",
1124
+ "mmlu"
1125
+ ],
1126
+ "original_num_docs": 193,
1127
+ "effective_num_docs": 193,
1128
+ "trust_dataset": true,
1129
+ "must_remove_duplicate_docs": null
1130
+ },
1131
+ "leaderboard|mmlu:high_school_macroeconomics": {
1132
+ "name": "mmlu:high_school_macroeconomics",
1133
+ "prompt_function": "mmlu_harness",
1134
+ "hf_repo": "lighteval/mmlu",
1135
+ "hf_subset": "high_school_macroeconomics",
1136
+ "metric": [
1137
+ "loglikelihood_acc"
1138
+ ],
1139
+ "hf_avail_splits": [
1140
+ "auxiliary_train",
1141
+ "test",
1142
+ "validation",
1143
+ "dev"
1144
+ ],
1145
+ "evaluation_splits": [
1146
+ "test"
1147
+ ],
1148
+ "few_shots_split": "dev",
1149
+ "few_shots_select": "sequential",
1150
+ "generation_size": 1,
1151
+ "stop_sequence": [
1152
+ "\n"
1153
+ ],
1154
+ "output_regex": null,
1155
+ "frozen": false,
1156
+ "suite": [
1157
+ "leaderboard",
1158
+ "mmlu"
1159
+ ],
1160
+ "original_num_docs": 390,
1161
+ "effective_num_docs": 390,
1162
+ "trust_dataset": true,
1163
+ "must_remove_duplicate_docs": null
1164
+ },
1165
+ "leaderboard|mmlu:high_school_mathematics": {
1166
+ "name": "mmlu:high_school_mathematics",
1167
+ "prompt_function": "mmlu_harness",
1168
+ "hf_repo": "lighteval/mmlu",
1169
+ "hf_subset": "high_school_mathematics",
1170
+ "metric": [
1171
+ "loglikelihood_acc"
1172
+ ],
1173
+ "hf_avail_splits": [
1174
+ "auxiliary_train",
1175
+ "test",
1176
+ "validation",
1177
+ "dev"
1178
+ ],
1179
+ "evaluation_splits": [
1180
+ "test"
1181
+ ],
1182
+ "few_shots_split": "dev",
1183
+ "few_shots_select": "sequential",
1184
+ "generation_size": 1,
1185
+ "stop_sequence": [
1186
+ "\n"
1187
+ ],
1188
+ "output_regex": null,
1189
+ "frozen": false,
1190
+ "suite": [
1191
+ "leaderboard",
1192
+ "mmlu"
1193
+ ],
1194
+ "original_num_docs": 270,
1195
+ "effective_num_docs": 270,
1196
+ "trust_dataset": true,
1197
+ "must_remove_duplicate_docs": null
1198
+ },
1199
+ "leaderboard|mmlu:high_school_microeconomics": {
1200
+ "name": "mmlu:high_school_microeconomics",
1201
+ "prompt_function": "mmlu_harness",
1202
+ "hf_repo": "lighteval/mmlu",
1203
+ "hf_subset": "high_school_microeconomics",
1204
+ "metric": [
1205
+ "loglikelihood_acc"
1206
+ ],
1207
+ "hf_avail_splits": [
1208
+ "auxiliary_train",
1209
+ "test",
1210
+ "validation",
1211
+ "dev"
1212
+ ],
1213
+ "evaluation_splits": [
1214
+ "test"
1215
+ ],
1216
+ "few_shots_split": "dev",
1217
+ "few_shots_select": "sequential",
1218
+ "generation_size": 1,
1219
+ "stop_sequence": [
1220
+ "\n"
1221
+ ],
1222
+ "output_regex": null,
1223
+ "frozen": false,
1224
+ "suite": [
1225
+ "leaderboard",
1226
+ "mmlu"
1227
+ ],
1228
+ "original_num_docs": 238,
1229
+ "effective_num_docs": 238,
1230
+ "trust_dataset": true,
1231
+ "must_remove_duplicate_docs": null
1232
+ },
1233
+ "leaderboard|mmlu:high_school_physics": {
1234
+ "name": "mmlu:high_school_physics",
1235
+ "prompt_function": "mmlu_harness",
1236
+ "hf_repo": "lighteval/mmlu",
1237
+ "hf_subset": "high_school_physics",
1238
+ "metric": [
1239
+ "loglikelihood_acc"
1240
+ ],
1241
+ "hf_avail_splits": [
1242
+ "auxiliary_train",
1243
+ "test",
1244
+ "validation",
1245
+ "dev"
1246
+ ],
1247
+ "evaluation_splits": [
1248
+ "test"
1249
+ ],
1250
+ "few_shots_split": "dev",
1251
+ "few_shots_select": "sequential",
1252
+ "generation_size": 1,
1253
+ "stop_sequence": [
1254
+ "\n"
1255
+ ],
1256
+ "output_regex": null,
1257
+ "frozen": false,
1258
+ "suite": [
1259
+ "leaderboard",
1260
+ "mmlu"
1261
+ ],
1262
+ "original_num_docs": 151,
1263
+ "effective_num_docs": 151,
1264
+ "trust_dataset": true,
1265
+ "must_remove_duplicate_docs": null
1266
+ },
1267
+ "leaderboard|mmlu:high_school_psychology": {
1268
+ "name": "mmlu:high_school_psychology",
1269
+ "prompt_function": "mmlu_harness",
1270
+ "hf_repo": "lighteval/mmlu",
1271
+ "hf_subset": "high_school_psychology",
1272
+ "metric": [
1273
+ "loglikelihood_acc"
1274
+ ],
1275
+ "hf_avail_splits": [
1276
+ "auxiliary_train",
1277
+ "test",
1278
+ "validation",
1279
+ "dev"
1280
+ ],
1281
+ "evaluation_splits": [
1282
+ "test"
1283
+ ],
1284
+ "few_shots_split": "dev",
1285
+ "few_shots_select": "sequential",
1286
+ "generation_size": 1,
1287
+ "stop_sequence": [
1288
+ "\n"
1289
+ ],
1290
+ "output_regex": null,
1291
+ "frozen": false,
1292
+ "suite": [
1293
+ "leaderboard",
1294
+ "mmlu"
1295
+ ],
1296
+ "original_num_docs": 545,
1297
+ "effective_num_docs": 545,
1298
+ "trust_dataset": true,
1299
+ "must_remove_duplicate_docs": null
1300
+ },
1301
+ "leaderboard|mmlu:high_school_statistics": {
1302
+ "name": "mmlu:high_school_statistics",
1303
+ "prompt_function": "mmlu_harness",
1304
+ "hf_repo": "lighteval/mmlu",
1305
+ "hf_subset": "high_school_statistics",
1306
+ "metric": [
1307
+ "loglikelihood_acc"
1308
+ ],
1309
+ "hf_avail_splits": [
1310
+ "auxiliary_train",
1311
+ "test",
1312
+ "validation",
1313
+ "dev"
1314
+ ],
1315
+ "evaluation_splits": [
1316
+ "test"
1317
+ ],
1318
+ "few_shots_split": "dev",
1319
+ "few_shots_select": "sequential",
1320
+ "generation_size": 1,
1321
+ "stop_sequence": [
1322
+ "\n"
1323
+ ],
1324
+ "output_regex": null,
1325
+ "frozen": false,
1326
+ "suite": [
1327
+ "leaderboard",
1328
+ "mmlu"
1329
+ ],
1330
+ "original_num_docs": 216,
1331
+ "effective_num_docs": 216,
1332
+ "trust_dataset": true,
1333
+ "must_remove_duplicate_docs": null
1334
+ },
1335
+ "leaderboard|mmlu:high_school_us_history": {
1336
+ "name": "mmlu:high_school_us_history",
1337
+ "prompt_function": "mmlu_harness",
1338
+ "hf_repo": "lighteval/mmlu",
1339
+ "hf_subset": "high_school_us_history",
1340
+ "metric": [
1341
+ "loglikelihood_acc"
1342
+ ],
1343
+ "hf_avail_splits": [
1344
+ "auxiliary_train",
1345
+ "test",
1346
+ "validation",
1347
+ "dev"
1348
+ ],
1349
+ "evaluation_splits": [
1350
+ "test"
1351
+ ],
1352
+ "few_shots_split": "dev",
1353
+ "few_shots_select": "sequential",
1354
+ "generation_size": 1,
1355
+ "stop_sequence": [
1356
+ "\n"
1357
+ ],
1358
+ "output_regex": null,
1359
+ "frozen": false,
1360
+ "suite": [
1361
+ "leaderboard",
1362
+ "mmlu"
1363
+ ],
1364
+ "original_num_docs": 204,
1365
+ "effective_num_docs": 204,
1366
+ "trust_dataset": true,
1367
+ "must_remove_duplicate_docs": null
1368
+ },
1369
+ "leaderboard|mmlu:high_school_world_history": {
1370
+ "name": "mmlu:high_school_world_history",
1371
+ "prompt_function": "mmlu_harness",
1372
+ "hf_repo": "lighteval/mmlu",
1373
+ "hf_subset": "high_school_world_history",
1374
+ "metric": [
1375
+ "loglikelihood_acc"
1376
+ ],
1377
+ "hf_avail_splits": [
1378
+ "auxiliary_train",
1379
+ "test",
1380
+ "validation",
1381
+ "dev"
1382
+ ],
1383
+ "evaluation_splits": [
1384
+ "test"
1385
+ ],
1386
+ "few_shots_split": "dev",
1387
+ "few_shots_select": "sequential",
1388
+ "generation_size": 1,
1389
+ "stop_sequence": [
1390
+ "\n"
1391
+ ],
1392
+ "output_regex": null,
1393
+ "frozen": false,
1394
+ "suite": [
1395
+ "leaderboard",
1396
+ "mmlu"
1397
+ ],
1398
+ "original_num_docs": 237,
1399
+ "effective_num_docs": 237,
1400
+ "trust_dataset": true,
1401
+ "must_remove_duplicate_docs": null
1402
+ },
1403
+ "leaderboard|mmlu:human_aging": {
1404
+ "name": "mmlu:human_aging",
1405
+ "prompt_function": "mmlu_harness",
1406
+ "hf_repo": "lighteval/mmlu",
1407
+ "hf_subset": "human_aging",
1408
+ "metric": [
1409
+ "loglikelihood_acc"
1410
+ ],
1411
+ "hf_avail_splits": [
1412
+ "auxiliary_train",
1413
+ "test",
1414
+ "validation",
1415
+ "dev"
1416
+ ],
1417
+ "evaluation_splits": [
1418
+ "test"
1419
+ ],
1420
+ "few_shots_split": "dev",
1421
+ "few_shots_select": "sequential",
1422
+ "generation_size": 1,
1423
+ "stop_sequence": [
1424
+ "\n"
1425
+ ],
1426
+ "output_regex": null,
1427
+ "frozen": false,
1428
+ "suite": [
1429
+ "leaderboard",
1430
+ "mmlu"
1431
+ ],
1432
+ "original_num_docs": 223,
1433
+ "effective_num_docs": 223,
1434
+ "trust_dataset": true,
1435
+ "must_remove_duplicate_docs": null
1436
+ },
1437
+ "leaderboard|mmlu:human_sexuality": {
1438
+ "name": "mmlu:human_sexuality",
1439
+ "prompt_function": "mmlu_harness",
1440
+ "hf_repo": "lighteval/mmlu",
1441
+ "hf_subset": "human_sexuality",
1442
+ "metric": [
1443
+ "loglikelihood_acc"
1444
+ ],
1445
+ "hf_avail_splits": [
1446
+ "auxiliary_train",
1447
+ "test",
1448
+ "validation",
1449
+ "dev"
1450
+ ],
1451
+ "evaluation_splits": [
1452
+ "test"
1453
+ ],
1454
+ "few_shots_split": "dev",
1455
+ "few_shots_select": "sequential",
1456
+ "generation_size": 1,
1457
+ "stop_sequence": [
1458
+ "\n"
1459
+ ],
1460
+ "output_regex": null,
1461
+ "frozen": false,
1462
+ "suite": [
1463
+ "leaderboard",
1464
+ "mmlu"
1465
+ ],
1466
+ "original_num_docs": 131,
1467
+ "effective_num_docs": 131,
1468
+ "trust_dataset": true,
1469
+ "must_remove_duplicate_docs": null
1470
+ },
1471
+ "leaderboard|mmlu:international_law": {
1472
+ "name": "mmlu:international_law",
1473
+ "prompt_function": "mmlu_harness",
1474
+ "hf_repo": "lighteval/mmlu",
1475
+ "hf_subset": "international_law",
1476
+ "metric": [
1477
+ "loglikelihood_acc"
1478
+ ],
1479
+ "hf_avail_splits": [
1480
+ "auxiliary_train",
1481
+ "test",
1482
+ "validation",
1483
+ "dev"
1484
+ ],
1485
+ "evaluation_splits": [
1486
+ "test"
1487
+ ],
1488
+ "few_shots_split": "dev",
1489
+ "few_shots_select": "sequential",
1490
+ "generation_size": 1,
1491
+ "stop_sequence": [
1492
+ "\n"
1493
+ ],
1494
+ "output_regex": null,
1495
+ "frozen": false,
1496
+ "suite": [
1497
+ "leaderboard",
1498
+ "mmlu"
1499
+ ],
1500
+ "original_num_docs": 121,
1501
+ "effective_num_docs": 121,
1502
+ "trust_dataset": true,
1503
+ "must_remove_duplicate_docs": null
1504
+ },
1505
+ "leaderboard|mmlu:jurisprudence": {
1506
+ "name": "mmlu:jurisprudence",
1507
+ "prompt_function": "mmlu_harness",
1508
+ "hf_repo": "lighteval/mmlu",
1509
+ "hf_subset": "jurisprudence",
1510
+ "metric": [
1511
+ "loglikelihood_acc"
1512
+ ],
1513
+ "hf_avail_splits": [
1514
+ "auxiliary_train",
1515
+ "test",
1516
+ "validation",
1517
+ "dev"
1518
+ ],
1519
+ "evaluation_splits": [
1520
+ "test"
1521
+ ],
1522
+ "few_shots_split": "dev",
1523
+ "few_shots_select": "sequential",
1524
+ "generation_size": 1,
1525
+ "stop_sequence": [
1526
+ "\n"
1527
+ ],
1528
+ "output_regex": null,
1529
+ "frozen": false,
1530
+ "suite": [
1531
+ "leaderboard",
1532
+ "mmlu"
1533
+ ],
1534
+ "original_num_docs": 108,
1535
+ "effective_num_docs": 108,
1536
+ "trust_dataset": true,
1537
+ "must_remove_duplicate_docs": null
1538
+ },
1539
+ "leaderboard|mmlu:logical_fallacies": {
1540
+ "name": "mmlu:logical_fallacies",
1541
+ "prompt_function": "mmlu_harness",
1542
+ "hf_repo": "lighteval/mmlu",
1543
+ "hf_subset": "logical_fallacies",
1544
+ "metric": [
1545
+ "loglikelihood_acc"
1546
+ ],
1547
+ "hf_avail_splits": [
1548
+ "auxiliary_train",
1549
+ "test",
1550
+ "validation",
1551
+ "dev"
1552
+ ],
1553
+ "evaluation_splits": [
1554
+ "test"
1555
+ ],
1556
+ "few_shots_split": "dev",
1557
+ "few_shots_select": "sequential",
1558
+ "generation_size": 1,
1559
+ "stop_sequence": [
1560
+ "\n"
1561
+ ],
1562
+ "output_regex": null,
1563
+ "frozen": false,
1564
+ "suite": [
1565
+ "leaderboard",
1566
+ "mmlu"
1567
+ ],
1568
+ "original_num_docs": 163,
1569
+ "effective_num_docs": 163,
1570
+ "trust_dataset": true,
1571
+ "must_remove_duplicate_docs": null
1572
+ },
1573
+ "leaderboard|mmlu:machine_learning": {
1574
+ "name": "mmlu:machine_learning",
1575
+ "prompt_function": "mmlu_harness",
1576
+ "hf_repo": "lighteval/mmlu",
1577
+ "hf_subset": "machine_learning",
1578
+ "metric": [
1579
+ "loglikelihood_acc"
1580
+ ],
1581
+ "hf_avail_splits": [
1582
+ "auxiliary_train",
1583
+ "test",
1584
+ "validation",
1585
+ "dev"
1586
+ ],
1587
+ "evaluation_splits": [
1588
+ "test"
1589
+ ],
1590
+ "few_shots_split": "dev",
1591
+ "few_shots_select": "sequential",
1592
+ "generation_size": 1,
1593
+ "stop_sequence": [
1594
+ "\n"
1595
+ ],
1596
+ "output_regex": null,
1597
+ "frozen": false,
1598
+ "suite": [
1599
+ "leaderboard",
1600
+ "mmlu"
1601
+ ],
1602
+ "original_num_docs": 112,
1603
+ "effective_num_docs": 112,
1604
+ "trust_dataset": true,
1605
+ "must_remove_duplicate_docs": null
1606
+ },
1607
+ "leaderboard|mmlu:management": {
1608
+ "name": "mmlu:management",
1609
+ "prompt_function": "mmlu_harness",
1610
+ "hf_repo": "lighteval/mmlu",
1611
+ "hf_subset": "management",
1612
+ "metric": [
1613
+ "loglikelihood_acc"
1614
+ ],
1615
+ "hf_avail_splits": [
1616
+ "auxiliary_train",
1617
+ "test",
1618
+ "validation",
1619
+ "dev"
1620
+ ],
1621
+ "evaluation_splits": [
1622
+ "test"
1623
+ ],
1624
+ "few_shots_split": "dev",
1625
+ "few_shots_select": "sequential",
1626
+ "generation_size": 1,
1627
+ "stop_sequence": [
1628
+ "\n"
1629
+ ],
1630
+ "output_regex": null,
1631
+ "frozen": false,
1632
+ "suite": [
1633
+ "leaderboard",
1634
+ "mmlu"
1635
+ ],
1636
+ "original_num_docs": 103,
1637
+ "effective_num_docs": 103,
1638
+ "trust_dataset": true,
1639
+ "must_remove_duplicate_docs": null
1640
+ },
1641
+ "leaderboard|mmlu:marketing": {
1642
+ "name": "mmlu:marketing",
1643
+ "prompt_function": "mmlu_harness",
1644
+ "hf_repo": "lighteval/mmlu",
1645
+ "hf_subset": "marketing",
1646
+ "metric": [
1647
+ "loglikelihood_acc"
1648
+ ],
1649
+ "hf_avail_splits": [
1650
+ "auxiliary_train",
1651
+ "test",
1652
+ "validation",
1653
+ "dev"
1654
+ ],
1655
+ "evaluation_splits": [
1656
+ "test"
1657
+ ],
1658
+ "few_shots_split": "dev",
1659
+ "few_shots_select": "sequential",
1660
+ "generation_size": 1,
1661
+ "stop_sequence": [
1662
+ "\n"
1663
+ ],
1664
+ "output_regex": null,
1665
+ "frozen": false,
1666
+ "suite": [
1667
+ "leaderboard",
1668
+ "mmlu"
1669
+ ],
1670
+ "original_num_docs": 234,
1671
+ "effective_num_docs": 234,
1672
+ "trust_dataset": true,
1673
+ "must_remove_duplicate_docs": null
1674
+ },
1675
+ "leaderboard|mmlu:medical_genetics": {
1676
+ "name": "mmlu:medical_genetics",
1677
+ "prompt_function": "mmlu_harness",
1678
+ "hf_repo": "lighteval/mmlu",
1679
+ "hf_subset": "medical_genetics",
1680
+ "metric": [
1681
+ "loglikelihood_acc"
1682
+ ],
1683
+ "hf_avail_splits": [
1684
+ "auxiliary_train",
1685
+ "test",
1686
+ "validation",
1687
+ "dev"
1688
+ ],
1689
+ "evaluation_splits": [
1690
+ "test"
1691
+ ],
1692
+ "few_shots_split": "dev",
1693
+ "few_shots_select": "sequential",
1694
+ "generation_size": 1,
1695
+ "stop_sequence": [
1696
+ "\n"
1697
+ ],
1698
+ "output_regex": null,
1699
+ "frozen": false,
1700
+ "suite": [
1701
+ "leaderboard",
1702
+ "mmlu"
1703
+ ],
1704
+ "original_num_docs": 100,
1705
+ "effective_num_docs": 100,
1706
+ "trust_dataset": true,
1707
+ "must_remove_duplicate_docs": null
1708
+ },
1709
+ "leaderboard|mmlu:miscellaneous": {
1710
+ "name": "mmlu:miscellaneous",
1711
+ "prompt_function": "mmlu_harness",
1712
+ "hf_repo": "lighteval/mmlu",
1713
+ "hf_subset": "miscellaneous",
1714
+ "metric": [
1715
+ "loglikelihood_acc"
1716
+ ],
1717
+ "hf_avail_splits": [
1718
+ "auxiliary_train",
1719
+ "test",
1720
+ "validation",
1721
+ "dev"
1722
+ ],
1723
+ "evaluation_splits": [
1724
+ "test"
1725
+ ],
1726
+ "few_shots_split": "dev",
1727
+ "few_shots_select": "sequential",
1728
+ "generation_size": 1,
1729
+ "stop_sequence": [
1730
+ "\n"
1731
+ ],
1732
+ "output_regex": null,
1733
+ "frozen": false,
1734
+ "suite": [
1735
+ "leaderboard",
1736
+ "mmlu"
1737
+ ],
1738
+ "original_num_docs": 783,
1739
+ "effective_num_docs": 783,
1740
+ "trust_dataset": true,
1741
+ "must_remove_duplicate_docs": null
1742
+ },
1743
+ "leaderboard|mmlu:moral_disputes": {
1744
+ "name": "mmlu:moral_disputes",
1745
+ "prompt_function": "mmlu_harness",
1746
+ "hf_repo": "lighteval/mmlu",
1747
+ "hf_subset": "moral_disputes",
1748
+ "metric": [
1749
+ "loglikelihood_acc"
1750
+ ],
1751
+ "hf_avail_splits": [
1752
+ "auxiliary_train",
1753
+ "test",
1754
+ "validation",
1755
+ "dev"
1756
+ ],
1757
+ "evaluation_splits": [
1758
+ "test"
1759
+ ],
1760
+ "few_shots_split": "dev",
1761
+ "few_shots_select": "sequential",
1762
+ "generation_size": 1,
1763
+ "stop_sequence": [
1764
+ "\n"
1765
+ ],
1766
+ "output_regex": null,
1767
+ "frozen": false,
1768
+ "suite": [
1769
+ "leaderboard",
1770
+ "mmlu"
1771
+ ],
1772
+ "original_num_docs": 346,
1773
+ "effective_num_docs": 346,
1774
+ "trust_dataset": true,
1775
+ "must_remove_duplicate_docs": null
1776
+ },
1777
+ "leaderboard|mmlu:moral_scenarios": {
1778
+ "name": "mmlu:moral_scenarios",
1779
+ "prompt_function": "mmlu_harness",
1780
+ "hf_repo": "lighteval/mmlu",
1781
+ "hf_subset": "moral_scenarios",
1782
+ "metric": [
1783
+ "loglikelihood_acc"
1784
+ ],
1785
+ "hf_avail_splits": [
1786
+ "auxiliary_train",
1787
+ "test",
1788
+ "validation",
1789
+ "dev"
1790
+ ],
1791
+ "evaluation_splits": [
1792
+ "test"
1793
+ ],
1794
+ "few_shots_split": "dev",
1795
+ "few_shots_select": "sequential",
1796
+ "generation_size": 1,
1797
+ "stop_sequence": [
1798
+ "\n"
1799
+ ],
1800
+ "output_regex": null,
1801
+ "frozen": false,
1802
+ "suite": [
1803
+ "leaderboard",
1804
+ "mmlu"
1805
+ ],
1806
+ "original_num_docs": 895,
1807
+ "effective_num_docs": 895,
1808
+ "trust_dataset": true,
1809
+ "must_remove_duplicate_docs": null
1810
+ },
1811
+ "leaderboard|mmlu:nutrition": {
1812
+ "name": "mmlu:nutrition",
1813
+ "prompt_function": "mmlu_harness",
1814
+ "hf_repo": "lighteval/mmlu",
1815
+ "hf_subset": "nutrition",
1816
+ "metric": [
1817
+ "loglikelihood_acc"
1818
+ ],
1819
+ "hf_avail_splits": [
1820
+ "auxiliary_train",
1821
+ "test",
1822
+ "validation",
1823
+ "dev"
1824
+ ],
1825
+ "evaluation_splits": [
1826
+ "test"
1827
+ ],
1828
+ "few_shots_split": "dev",
1829
+ "few_shots_select": "sequential",
1830
+ "generation_size": 1,
1831
+ "stop_sequence": [
1832
+ "\n"
1833
+ ],
1834
+ "output_regex": null,
1835
+ "frozen": false,
1836
+ "suite": [
1837
+ "leaderboard",
1838
+ "mmlu"
1839
+ ],
1840
+ "original_num_docs": 306,
1841
+ "effective_num_docs": 306,
1842
+ "trust_dataset": true,
1843
+ "must_remove_duplicate_docs": null
1844
+ },
1845
+ "leaderboard|mmlu:philosophy": {
1846
+ "name": "mmlu:philosophy",
1847
+ "prompt_function": "mmlu_harness",
1848
+ "hf_repo": "lighteval/mmlu",
1849
+ "hf_subset": "philosophy",
1850
+ "metric": [
1851
+ "loglikelihood_acc"
1852
+ ],
1853
+ "hf_avail_splits": [
1854
+ "auxiliary_train",
1855
+ "test",
1856
+ "validation",
1857
+ "dev"
1858
+ ],
1859
+ "evaluation_splits": [
1860
+ "test"
1861
+ ],
1862
+ "few_shots_split": "dev",
1863
+ "few_shots_select": "sequential",
1864
+ "generation_size": 1,
1865
+ "stop_sequence": [
1866
+ "\n"
1867
+ ],
1868
+ "output_regex": null,
1869
+ "frozen": false,
1870
+ "suite": [
1871
+ "leaderboard",
1872
+ "mmlu"
1873
+ ],
1874
+ "original_num_docs": 311,
1875
+ "effective_num_docs": 311,
1876
+ "trust_dataset": true,
1877
+ "must_remove_duplicate_docs": null
1878
+ },
1879
+ "leaderboard|mmlu:prehistory": {
1880
+ "name": "mmlu:prehistory",
1881
+ "prompt_function": "mmlu_harness",
1882
+ "hf_repo": "lighteval/mmlu",
1883
+ "hf_subset": "prehistory",
1884
+ "metric": [
1885
+ "loglikelihood_acc"
1886
+ ],
1887
+ "hf_avail_splits": [
1888
+ "auxiliary_train",
1889
+ "test",
1890
+ "validation",
1891
+ "dev"
1892
+ ],
1893
+ "evaluation_splits": [
1894
+ "test"
1895
+ ],
1896
+ "few_shots_split": "dev",
1897
+ "few_shots_select": "sequential",
1898
+ "generation_size": 1,
1899
+ "stop_sequence": [
1900
+ "\n"
1901
+ ],
1902
+ "output_regex": null,
1903
+ "frozen": false,
1904
+ "suite": [
1905
+ "leaderboard",
1906
+ "mmlu"
1907
+ ],
1908
+ "original_num_docs": 324,
1909
+ "effective_num_docs": 324,
1910
+ "trust_dataset": true,
1911
+ "must_remove_duplicate_docs": null
1912
+ },
1913
+ "leaderboard|mmlu:professional_accounting": {
1914
+ "name": "mmlu:professional_accounting",
1915
+ "prompt_function": "mmlu_harness",
1916
+ "hf_repo": "lighteval/mmlu",
1917
+ "hf_subset": "professional_accounting",
1918
+ "metric": [
1919
+ "loglikelihood_acc"
1920
+ ],
1921
+ "hf_avail_splits": [
1922
+ "auxiliary_train",
1923
+ "test",
1924
+ "validation",
1925
+ "dev"
1926
+ ],
1927
+ "evaluation_splits": [
1928
+ "test"
1929
+ ],
1930
+ "few_shots_split": "dev",
1931
+ "few_shots_select": "sequential",
1932
+ "generation_size": 1,
1933
+ "stop_sequence": [
1934
+ "\n"
1935
+ ],
1936
+ "output_regex": null,
1937
+ "frozen": false,
1938
+ "suite": [
1939
+ "leaderboard",
1940
+ "mmlu"
1941
+ ],
1942
+ "original_num_docs": 282,
1943
+ "effective_num_docs": 282,
1944
+ "trust_dataset": true,
1945
+ "must_remove_duplicate_docs": null
1946
+ },
1947
+ "leaderboard|mmlu:professional_law": {
1948
+ "name": "mmlu:professional_law",
1949
+ "prompt_function": "mmlu_harness",
1950
+ "hf_repo": "lighteval/mmlu",
1951
+ "hf_subset": "professional_law",
1952
+ "metric": [
1953
+ "loglikelihood_acc"
1954
+ ],
1955
+ "hf_avail_splits": [
1956
+ "auxiliary_train",
1957
+ "test",
1958
+ "validation",
1959
+ "dev"
1960
+ ],
1961
+ "evaluation_splits": [
1962
+ "test"
1963
+ ],
1964
+ "few_shots_split": "dev",
1965
+ "few_shots_select": "sequential",
1966
+ "generation_size": 1,
1967
+ "stop_sequence": [
1968
+ "\n"
1969
+ ],
1970
+ "output_regex": null,
1971
+ "frozen": false,
1972
+ "suite": [
1973
+ "leaderboard",
1974
+ "mmlu"
1975
+ ],
1976
+ "original_num_docs": 1534,
1977
+ "effective_num_docs": 1534,
1978
+ "trust_dataset": true,
1979
+ "must_remove_duplicate_docs": null
1980
+ },
1981
+ "leaderboard|mmlu:professional_medicine": {
1982
+ "name": "mmlu:professional_medicine",
1983
+ "prompt_function": "mmlu_harness",
1984
+ "hf_repo": "lighteval/mmlu",
1985
+ "hf_subset": "professional_medicine",
1986
+ "metric": [
1987
+ "loglikelihood_acc"
1988
+ ],
1989
+ "hf_avail_splits": [
1990
+ "auxiliary_train",
1991
+ "test",
1992
+ "validation",
1993
+ "dev"
1994
+ ],
1995
+ "evaluation_splits": [
1996
+ "test"
1997
+ ],
1998
+ "few_shots_split": "dev",
1999
+ "few_shots_select": "sequential",
2000
+ "generation_size": 1,
2001
+ "stop_sequence": [
2002
+ "\n"
2003
+ ],
2004
+ "output_regex": null,
2005
+ "frozen": false,
2006
+ "suite": [
2007
+ "leaderboard",
2008
+ "mmlu"
2009
+ ],
2010
+ "original_num_docs": 272,
2011
+ "effective_num_docs": 272,
2012
+ "trust_dataset": true,
2013
+ "must_remove_duplicate_docs": null
2014
+ },
2015
+ "leaderboard|mmlu:professional_psychology": {
2016
+ "name": "mmlu:professional_psychology",
2017
+ "prompt_function": "mmlu_harness",
2018
+ "hf_repo": "lighteval/mmlu",
2019
+ "hf_subset": "professional_psychology",
2020
+ "metric": [
2021
+ "loglikelihood_acc"
2022
+ ],
2023
+ "hf_avail_splits": [
2024
+ "auxiliary_train",
2025
+ "test",
2026
+ "validation",
2027
+ "dev"
2028
+ ],
2029
+ "evaluation_splits": [
2030
+ "test"
2031
+ ],
2032
+ "few_shots_split": "dev",
2033
+ "few_shots_select": "sequential",
2034
+ "generation_size": 1,
2035
+ "stop_sequence": [
2036
+ "\n"
2037
+ ],
2038
+ "output_regex": null,
2039
+ "frozen": false,
2040
+ "suite": [
2041
+ "leaderboard",
2042
+ "mmlu"
2043
+ ],
2044
+ "original_num_docs": 612,
2045
+ "effective_num_docs": 612,
2046
+ "trust_dataset": true,
2047
+ "must_remove_duplicate_docs": null
2048
+ },
2049
+ "leaderboard|mmlu:public_relations": {
2050
+ "name": "mmlu:public_relations",
2051
+ "prompt_function": "mmlu_harness",
2052
+ "hf_repo": "lighteval/mmlu",
2053
+ "hf_subset": "public_relations",
2054
+ "metric": [
2055
+ "loglikelihood_acc"
2056
+ ],
2057
+ "hf_avail_splits": [
2058
+ "auxiliary_train",
2059
+ "test",
2060
+ "validation",
2061
+ "dev"
2062
+ ],
2063
+ "evaluation_splits": [
2064
+ "test"
2065
+ ],
2066
+ "few_shots_split": "dev",
2067
+ "few_shots_select": "sequential",
2068
+ "generation_size": 1,
2069
+ "stop_sequence": [
2070
+ "\n"
2071
+ ],
2072
+ "output_regex": null,
2073
+ "frozen": false,
2074
+ "suite": [
2075
+ "leaderboard",
2076
+ "mmlu"
2077
+ ],
2078
+ "original_num_docs": 110,
2079
+ "effective_num_docs": 110,
2080
+ "trust_dataset": true,
2081
+ "must_remove_duplicate_docs": null
2082
+ },
2083
+ "leaderboard|mmlu:security_studies": {
2084
+ "name": "mmlu:security_studies",
2085
+ "prompt_function": "mmlu_harness",
2086
+ "hf_repo": "lighteval/mmlu",
2087
+ "hf_subset": "security_studies",
2088
+ "metric": [
2089
+ "loglikelihood_acc"
2090
+ ],
2091
+ "hf_avail_splits": [
2092
+ "auxiliary_train",
2093
+ "test",
2094
+ "validation",
2095
+ "dev"
2096
+ ],
2097
+ "evaluation_splits": [
2098
+ "test"
2099
+ ],
2100
+ "few_shots_split": "dev",
2101
+ "few_shots_select": "sequential",
2102
+ "generation_size": 1,
2103
+ "stop_sequence": [
2104
+ "\n"
2105
+ ],
2106
+ "output_regex": null,
2107
+ "frozen": false,
2108
+ "suite": [
2109
+ "leaderboard",
2110
+ "mmlu"
2111
+ ],
2112
+ "original_num_docs": 245,
2113
+ "effective_num_docs": 245,
2114
+ "trust_dataset": true,
2115
+ "must_remove_duplicate_docs": null
2116
+ },
2117
+ "leaderboard|mmlu:sociology": {
2118
+ "name": "mmlu:sociology",
2119
+ "prompt_function": "mmlu_harness",
2120
+ "hf_repo": "lighteval/mmlu",
2121
+ "hf_subset": "sociology",
2122
+ "metric": [
2123
+ "loglikelihood_acc"
2124
+ ],
2125
+ "hf_avail_splits": [
2126
+ "auxiliary_train",
2127
+ "test",
2128
+ "validation",
2129
+ "dev"
2130
+ ],
2131
+ "evaluation_splits": [
2132
+ "test"
2133
+ ],
2134
+ "few_shots_split": "dev",
2135
+ "few_shots_select": "sequential",
2136
+ "generation_size": 1,
2137
+ "stop_sequence": [
2138
+ "\n"
2139
+ ],
2140
+ "output_regex": null,
2141
+ "frozen": false,
2142
+ "suite": [
2143
+ "leaderboard",
2144
+ "mmlu"
2145
+ ],
2146
+ "original_num_docs": 201,
2147
+ "effective_num_docs": 201,
2148
+ "trust_dataset": true,
2149
+ "must_remove_duplicate_docs": null
2150
+ },
2151
+ "leaderboard|mmlu:us_foreign_policy": {
2152
+ "name": "mmlu:us_foreign_policy",
2153
+ "prompt_function": "mmlu_harness",
2154
+ "hf_repo": "lighteval/mmlu",
2155
+ "hf_subset": "us_foreign_policy",
2156
+ "metric": [
2157
+ "loglikelihood_acc"
2158
+ ],
2159
+ "hf_avail_splits": [
2160
+ "auxiliary_train",
2161
+ "test",
2162
+ "validation",
2163
+ "dev"
2164
+ ],
2165
+ "evaluation_splits": [
2166
+ "test"
2167
+ ],
2168
+ "few_shots_split": "dev",
2169
+ "few_shots_select": "sequential",
2170
+ "generation_size": 1,
2171
+ "stop_sequence": [
2172
+ "\n"
2173
+ ],
2174
+ "output_regex": null,
2175
+ "frozen": false,
2176
+ "suite": [
2177
+ "leaderboard",
2178
+ "mmlu"
2179
+ ],
2180
+ "original_num_docs": 100,
2181
+ "effective_num_docs": 100,
2182
+ "trust_dataset": true,
2183
+ "must_remove_duplicate_docs": null
2184
+ },
2185
+ "leaderboard|mmlu:virology": {
2186
+ "name": "mmlu:virology",
2187
+ "prompt_function": "mmlu_harness",
2188
+ "hf_repo": "lighteval/mmlu",
2189
+ "hf_subset": "virology",
2190
+ "metric": [
2191
+ "loglikelihood_acc"
2192
+ ],
2193
+ "hf_avail_splits": [
2194
+ "auxiliary_train",
2195
+ "test",
2196
+ "validation",
2197
+ "dev"
2198
+ ],
2199
+ "evaluation_splits": [
2200
+ "test"
2201
+ ],
2202
+ "few_shots_split": "dev",
2203
+ "few_shots_select": "sequential",
2204
+ "generation_size": 1,
2205
+ "stop_sequence": [
2206
+ "\n"
2207
+ ],
2208
+ "output_regex": null,
2209
+ "frozen": false,
2210
+ "suite": [
2211
+ "leaderboard",
2212
+ "mmlu"
2213
+ ],
2214
+ "original_num_docs": 166,
2215
+ "effective_num_docs": 166,
2216
+ "trust_dataset": true,
2217
+ "must_remove_duplicate_docs": null
2218
+ },
2219
+ "leaderboard|mmlu:world_religions": {
2220
+ "name": "mmlu:world_religions",
2221
+ "prompt_function": "mmlu_harness",
2222
+ "hf_repo": "lighteval/mmlu",
2223
+ "hf_subset": "world_religions",
2224
+ "metric": [
2225
+ "loglikelihood_acc"
2226
+ ],
2227
+ "hf_avail_splits": [
2228
+ "auxiliary_train",
2229
+ "test",
2230
+ "validation",
2231
+ "dev"
2232
+ ],
2233
+ "evaluation_splits": [
2234
+ "test"
2235
+ ],
2236
+ "few_shots_split": "dev",
2237
+ "few_shots_select": "sequential",
2238
+ "generation_size": 1,
2239
+ "stop_sequence": [
2240
+ "\n"
2241
+ ],
2242
+ "output_regex": null,
2243
+ "frozen": false,
2244
+ "suite": [
2245
+ "leaderboard",
2246
+ "mmlu"
2247
+ ],
2248
+ "original_num_docs": 171,
2249
+ "effective_num_docs": 171,
2250
+ "trust_dataset": true,
2251
+ "must_remove_duplicate_docs": null
2252
+ }
2253
+ },
2254
+ "summary_tasks": {
2255
+ "leaderboard|mmlu:abstract_algebra|5": {
2256
+ "hashes": {
2257
+ "hash_examples": "4c76229e00c9c0e9",
2258
+ "hash_full_prompts": "c3130662e7cc91d3",
2259
+ "hash_input_tokens": "b617a339eb3b3eb7",
2260
+ "hash_cont_tokens": "9e1c9ca2c51de57e"
2261
+ },
2262
+ "truncated": 0,
2263
+ "non_truncated": 100,
2264
+ "padded": 400,
2265
+ "non_padded": 0,
2266
+ "effective_few_shots": 5.0,
2267
+ "num_truncated_few_shots": 0
2268
+ },
2269
+ "leaderboard|mmlu:anatomy|5": {
2270
+ "hashes": {
2271
+ "hash_examples": "6a1f8104dccbd33b",
2272
+ "hash_full_prompts": "05a97165c871964d",
2273
+ "hash_input_tokens": "14e9962d3b1706ea",
2274
+ "hash_cont_tokens": "025910e68cf29c3d"
2275
+ },
2276
+ "truncated": 0,
2277
+ "non_truncated": 135,
2278
+ "padded": 540,
2279
+ "non_padded": 0,
2280
+ "effective_few_shots": 5.0,
2281
+ "num_truncated_few_shots": 0
2282
+ },
2283
+ "leaderboard|mmlu:astronomy|5": {
2284
+ "hashes": {
2285
+ "hash_examples": "1302effa3a76ce4c",
2286
+ "hash_full_prompts": "68355efd63c4de09",
2287
+ "hash_input_tokens": "44bd837a633de965",
2288
+ "hash_cont_tokens": "1a66fd04f03e0517"
2289
+ },
2290
+ "truncated": 0,
2291
+ "non_truncated": 152,
2292
+ "padded": 608,
2293
+ "non_padded": 0,
2294
+ "effective_few_shots": 5.0,
2295
+ "num_truncated_few_shots": 0
2296
+ },
2297
+ "leaderboard|mmlu:business_ethics|5": {
2298
+ "hashes": {
2299
+ "hash_examples": "03cb8bce5336419a",
2300
+ "hash_full_prompts": "8f440e0924442390",
2301
+ "hash_input_tokens": "16217026443317e4",
2302
+ "hash_cont_tokens": "9e1c9ca2c51de57e"
2303
+ },
2304
+ "truncated": 0,
2305
+ "non_truncated": 100,
2306
+ "padded": 400,
2307
+ "non_padded": 0,
2308
+ "effective_few_shots": 5.0,
2309
+ "num_truncated_few_shots": 0
2310
+ },
2311
+ "leaderboard|mmlu:clinical_knowledge|5": {
2312
+ "hashes": {
2313
+ "hash_examples": "ffbb9c7b2be257f9",
2314
+ "hash_full_prompts": "595feee698057167",
2315
+ "hash_input_tokens": "896539d33768791a",
2316
+ "hash_cont_tokens": "de872053260a1588"
2317
+ },
2318
+ "truncated": 0,
2319
+ "non_truncated": 265,
2320
+ "padded": 1060,
2321
+ "non_padded": 0,
2322
+ "effective_few_shots": 5.0,
2323
+ "num_truncated_few_shots": 0
2324
+ },
2325
+ "leaderboard|mmlu:college_biology|5": {
2326
+ "hashes": {
2327
+ "hash_examples": "3ee77f176f38eb8e",
2328
+ "hash_full_prompts": "dcd354e231c805ee",
2329
+ "hash_input_tokens": "56c8c2aa3e63f094",
2330
+ "hash_cont_tokens": "9ace296b3e00bba3"
2331
+ },
2332
+ "truncated": 0,
2333
+ "non_truncated": 144,
2334
+ "padded": 576,
2335
+ "non_padded": 0,
2336
+ "effective_few_shots": 5.0,
2337
+ "num_truncated_few_shots": 0
2338
+ },
2339
+ "leaderboard|mmlu:college_chemistry|5": {
2340
+ "hashes": {
2341
+ "hash_examples": "ce61a69c46d47aeb",
2342
+ "hash_full_prompts": "a520ca0fd7868631",
2343
+ "hash_input_tokens": "0049443634b997e3",
2344
+ "hash_cont_tokens": "9e1c9ca2c51de57e"
2345
+ },
2346
+ "truncated": 0,
2347
+ "non_truncated": 100,
2348
+ "padded": 400,
2349
+ "non_padded": 0,
2350
+ "effective_few_shots": 5.0,
2351
+ "num_truncated_few_shots": 0
2352
+ },
2353
+ "leaderboard|mmlu:college_computer_science|5": {
2354
+ "hashes": {
2355
+ "hash_examples": "32805b52d7d5daab",
2356
+ "hash_full_prompts": "ae8f53adf4b6a6e3",
2357
+ "hash_input_tokens": "894bbabad16b75a1",
2358
+ "hash_cont_tokens": "9e1c9ca2c51de57e"
2359
+ },
2360
+ "truncated": 0,
2361
+ "non_truncated": 100,
2362
+ "padded": 400,
2363
+ "non_padded": 0,
2364
+ "effective_few_shots": 5.0,
2365
+ "num_truncated_few_shots": 0
2366
+ },
2367
+ "leaderboard|mmlu:college_mathematics|5": {
2368
+ "hashes": {
2369
+ "hash_examples": "55da1a0a0bd33722",
2370
+ "hash_full_prompts": "39cd3169534550f3",
2371
+ "hash_input_tokens": "5bfda6d5c7af507c",
2372
+ "hash_cont_tokens": "9e1c9ca2c51de57e"
2373
+ },
2374
+ "truncated": 0,
2375
+ "non_truncated": 100,
2376
+ "padded": 400,
2377
+ "non_padded": 0,
2378
+ "effective_few_shots": 5.0,
2379
+ "num_truncated_few_shots": 0
2380
+ },
2381
+ "leaderboard|mmlu:college_medicine|5": {
2382
+ "hashes": {
2383
+ "hash_examples": "c33e143163049176",
2384
+ "hash_full_prompts": "bca31c5d5f3a0e4a",
2385
+ "hash_input_tokens": "13452a8f3d9b4b3d",
2386
+ "hash_cont_tokens": "c80c0b5489bdbc5a"
2387
+ },
2388
+ "truncated": 0,
2389
+ "non_truncated": 173,
2390
+ "padded": 692,
2391
+ "non_padded": 0,
2392
+ "effective_few_shots": 5.0,
2393
+ "num_truncated_few_shots": 0
2394
+ },
2395
+ "leaderboard|mmlu:college_physics|5": {
2396
+ "hashes": {
2397
+ "hash_examples": "ebdab1cdb7e555df",
2398
+ "hash_full_prompts": "f819d74029f4a018",
2399
+ "hash_input_tokens": "57c45bd30a378407",
2400
+ "hash_cont_tokens": "569fcb9ac44734ae"
2401
+ },
2402
+ "truncated": 0,
2403
+ "non_truncated": 102,
2404
+ "padded": 408,
2405
+ "non_padded": 0,
2406
+ "effective_few_shots": 5.0,
2407
+ "num_truncated_few_shots": 0
2408
+ },
2409
+ "leaderboard|mmlu:computer_security|5": {
2410
+ "hashes": {
2411
+ "hash_examples": "a24fd7d08a560921",
2412
+ "hash_full_prompts": "d0f4d31508009cd6",
2413
+ "hash_input_tokens": "0af9499b3cb67d95",
2414
+ "hash_cont_tokens": "9e1c9ca2c51de57e"
2415
+ },
2416
+ "truncated": 0,
2417
+ "non_truncated": 100,
2418
+ "padded": 400,
2419
+ "non_padded": 0,
2420
+ "effective_few_shots": 5.0,
2421
+ "num_truncated_few_shots": 0
2422
+ },
2423
+ "leaderboard|mmlu:conceptual_physics|5": {
2424
+ "hashes": {
2425
+ "hash_examples": "8300977a79386993",
2426
+ "hash_full_prompts": "6e2f619c2f0da087",
2427
+ "hash_input_tokens": "00b0c9ac0fc683e8",
2428
+ "hash_cont_tokens": "6e88c64c1a76752a"
2429
+ },
2430
+ "truncated": 0,
2431
+ "non_truncated": 235,
2432
+ "padded": 940,
2433
+ "non_padded": 0,
2434
+ "effective_few_shots": 5.0,
2435
+ "num_truncated_few_shots": 0
2436
+ },
2437
+ "leaderboard|mmlu:econometrics|5": {
2438
+ "hashes": {
2439
+ "hash_examples": "ddde36788a04a46f",
2440
+ "hash_full_prompts": "3f81ad69c49e1691",
2441
+ "hash_input_tokens": "9314d720a35c62b6",
2442
+ "hash_cont_tokens": "a315e0e16c922c3c"
2443
+ },
2444
+ "truncated": 0,
2445
+ "non_truncated": 114,
2446
+ "padded": 456,
2447
+ "non_padded": 0,
2448
+ "effective_few_shots": 5.0,
2449
+ "num_truncated_few_shots": 0
2450
+ },
2451
+ "leaderboard|mmlu:electrical_engineering|5": {
2452
+ "hashes": {
2453
+ "hash_examples": "acbc5def98c19b3f",
2454
+ "hash_full_prompts": "f5ab31c3b1d51682",
2455
+ "hash_input_tokens": "863125c49d60d6a4",
2456
+ "hash_cont_tokens": "44c72e6a7422c304"
2457
+ },
2458
+ "truncated": 0,
2459
+ "non_truncated": 145,
2460
+ "padded": 580,
2461
+ "non_padded": 0,
2462
+ "effective_few_shots": 5.0,
2463
+ "num_truncated_few_shots": 0
2464
+ },
2465
+ "leaderboard|mmlu:elementary_mathematics|5": {
2466
+ "hashes": {
2467
+ "hash_examples": "146e61d07497a9bd",
2468
+ "hash_full_prompts": "3e6f38a631108730",
2469
+ "hash_input_tokens": "ed58bf384a932c74",
2470
+ "hash_cont_tokens": "cac0a6c304791bb7"
2471
+ },
2472
+ "truncated": 0,
2473
+ "non_truncated": 378,
2474
+ "padded": 1512,
2475
+ "non_padded": 0,
2476
+ "effective_few_shots": 5.0,
2477
+ "num_truncated_few_shots": 0
2478
+ },
2479
+ "leaderboard|mmlu:formal_logic|5": {
2480
+ "hashes": {
2481
+ "hash_examples": "8635216e1909a03f",
2482
+ "hash_full_prompts": "2db73981fed3cf02",
2483
+ "hash_input_tokens": "78b4957033a990a3",
2484
+ "hash_cont_tokens": "8801fad3bbc72e57"
2485
+ },
2486
+ "truncated": 0,
2487
+ "non_truncated": 126,
2488
+ "padded": 504,
2489
+ "non_padded": 0,
2490
+ "effective_few_shots": 5.0,
2491
+ "num_truncated_few_shots": 0
2492
+ },
2493
+ "leaderboard|mmlu:global_facts|5": {
2494
+ "hashes": {
2495
+ "hash_examples": "30b315aa6353ee47",
2496
+ "hash_full_prompts": "3b5eef82483c02a6",
2497
+ "hash_input_tokens": "65cf7f73e20e1bc1",
2498
+ "hash_cont_tokens": "9e1c9ca2c51de57e"
2499
+ },
2500
+ "truncated": 0,
2501
+ "non_truncated": 100,
2502
+ "padded": 400,
2503
+ "non_padded": 0,
2504
+ "effective_few_shots": 5.0,
2505
+ "num_truncated_few_shots": 0
2506
+ },
2507
+ "leaderboard|mmlu:high_school_biology|5": {
2508
+ "hashes": {
2509
+ "hash_examples": "c9136373af2180de",
2510
+ "hash_full_prompts": "97a500550ada1104",
2511
+ "hash_input_tokens": "1c299ee1038cf043",
2512
+ "hash_cont_tokens": "2d57d9e2c5a1fd64"
2513
+ },
2514
+ "truncated": 0,
2515
+ "non_truncated": 310,
2516
+ "padded": 1240,
2517
+ "non_padded": 0,
2518
+ "effective_few_shots": 5.0,
2519
+ "num_truncated_few_shots": 0
2520
+ },
2521
+ "leaderboard|mmlu:high_school_chemistry|5": {
2522
+ "hashes": {
2523
+ "hash_examples": "b0661bfa1add6404",
2524
+ "hash_full_prompts": "7d42623066fb1e8e",
2525
+ "hash_input_tokens": "38aa4f175383a891",
2526
+ "hash_cont_tokens": "bb0fd92673ddfb31"
2527
+ },
2528
+ "truncated": 0,
2529
+ "non_truncated": 203,
2530
+ "padded": 812,
2531
+ "non_padded": 0,
2532
+ "effective_few_shots": 5.0,
2533
+ "num_truncated_few_shots": 0
2534
+ },
2535
+ "leaderboard|mmlu:high_school_computer_science|5": {
2536
+ "hashes": {
2537
+ "hash_examples": "80fc1d623a3d665f",
2538
+ "hash_full_prompts": "2af192ae1faf8c63",
2539
+ "hash_input_tokens": "5a1229c044a91023",
2540
+ "hash_cont_tokens": "9e1c9ca2c51de57e"
2541
+ },
2542
+ "truncated": 0,
2543
+ "non_truncated": 100,
2544
+ "padded": 400,
2545
+ "non_padded": 0,
2546
+ "effective_few_shots": 5.0,
2547
+ "num_truncated_few_shots": 0
2548
+ },
2549
+ "leaderboard|mmlu:high_school_european_history|5": {
2550
+ "hashes": {
2551
+ "hash_examples": "854da6e5af0fe1a1",
2552
+ "hash_full_prompts": "189af6182c551e23",
2553
+ "hash_input_tokens": "f0e54538395a12c1",
2554
+ "hash_cont_tokens": "16e494cddccc4a04"
2555
+ },
2556
+ "truncated": 0,
2557
+ "non_truncated": 165,
2558
+ "padded": 656,
2559
+ "non_padded": 4,
2560
+ "effective_few_shots": 5.0,
2561
+ "num_truncated_few_shots": 0
2562
+ },
2563
+ "leaderboard|mmlu:high_school_geography|5": {
2564
+ "hashes": {
2565
+ "hash_examples": "7dc963c7acd19ad8",
2566
+ "hash_full_prompts": "0906f591b7f79a10",
2567
+ "hash_input_tokens": "40aceb5dde64fe64",
2568
+ "hash_cont_tokens": "16b7f65a07b3d47b"
2569
+ },
2570
+ "truncated": 0,
2571
+ "non_truncated": 198,
2572
+ "padded": 792,
2573
+ "non_padded": 0,
2574
+ "effective_few_shots": 5.0,
2575
+ "num_truncated_few_shots": 0
2576
+ },
2577
+ "leaderboard|mmlu:high_school_government_and_politics|5": {
2578
+ "hashes": {
2579
+ "hash_examples": "1f675dcdebc9758f",
2580
+ "hash_full_prompts": "7223a4aebabcdcbd",
2581
+ "hash_input_tokens": "96a4444be05f5ede",
2582
+ "hash_cont_tokens": "476e87fd675136aa"
2583
+ },
2584
+ "truncated": 0,
2585
+ "non_truncated": 193,
2586
+ "padded": 772,
2587
+ "non_padded": 0,
2588
+ "effective_few_shots": 5.0,
2589
+ "num_truncated_few_shots": 0
2590
+ },
2591
+ "leaderboard|mmlu:high_school_macroeconomics|5": {
2592
+ "hashes": {
2593
+ "hash_examples": "2fb32cf2d80f0b35",
2594
+ "hash_full_prompts": "9c32c005a808c453",
2595
+ "hash_input_tokens": "a78ba4100d84ecc5",
2596
+ "hash_cont_tokens": "b0c7b4c5f7bdf3e7"
2597
+ },
2598
+ "truncated": 0,
2599
+ "non_truncated": 390,
2600
+ "padded": 1560,
2601
+ "non_padded": 0,
2602
+ "effective_few_shots": 5.0,
2603
+ "num_truncated_few_shots": 0
2604
+ },
2605
+ "leaderboard|mmlu:high_school_mathematics|5": {
2606
+ "hashes": {
2607
+ "hash_examples": "fd6646fdb5d58a1f",
2608
+ "hash_full_prompts": "61845b4e3d0eafe9",
2609
+ "hash_input_tokens": "72e903543d60e864",
2610
+ "hash_cont_tokens": "1a05d6ff49846fd1"
2611
+ },
2612
+ "truncated": 0,
2613
+ "non_truncated": 270,
2614
+ "padded": 1080,
2615
+ "non_padded": 0,
2616
+ "effective_few_shots": 5.0,
2617
+ "num_truncated_few_shots": 0
2618
+ },
2619
+ "leaderboard|mmlu:high_school_microeconomics|5": {
2620
+ "hashes": {
2621
+ "hash_examples": "2118f21f71d87d84",
2622
+ "hash_full_prompts": "020f7f6e77a6b641",
2623
+ "hash_input_tokens": "8b428c95ab32cdeb",
2624
+ "hash_cont_tokens": "0e7f0645ffffd6cd"
2625
+ },
2626
+ "truncated": 0,
2627
+ "non_truncated": 238,
2628
+ "padded": 949,
2629
+ "non_padded": 3,
2630
+ "effective_few_shots": 5.0,
2631
+ "num_truncated_few_shots": 0
2632
+ },
2633
+ "leaderboard|mmlu:high_school_physics|5": {
2634
+ "hashes": {
2635
+ "hash_examples": "dc3ce06378548565",
2636
+ "hash_full_prompts": "571b28c0f53b90a0",
2637
+ "hash_input_tokens": "0862d9ba4184f5e6",
2638
+ "hash_cont_tokens": "41ca6560b8c10183"
2639
+ },
2640
+ "truncated": 0,
2641
+ "non_truncated": 151,
2642
+ "padded": 604,
2643
+ "non_padded": 0,
2644
+ "effective_few_shots": 5.0,
2645
+ "num_truncated_few_shots": 0
2646
+ },
2647
+ "leaderboard|mmlu:high_school_psychology|5": {
2648
+ "hashes": {
2649
+ "hash_examples": "c8d1d98a40e11f2f",
2650
+ "hash_full_prompts": "896e9a19476b90ed",
2651
+ "hash_input_tokens": "539679e51cf0dadf",
2652
+ "hash_cont_tokens": "53a17ff85c607844"
2653
+ },
2654
+ "truncated": 0,
2655
+ "non_truncated": 545,
2656
+ "padded": 2178,
2657
+ "non_padded": 2,
2658
+ "effective_few_shots": 5.0,
2659
+ "num_truncated_few_shots": 0
2660
+ },
2661
+ "leaderboard|mmlu:high_school_statistics|5": {
2662
+ "hashes": {
2663
+ "hash_examples": "666c8759b98ee4ff",
2664
+ "hash_full_prompts": "9ca986b471235e07",
2665
+ "hash_input_tokens": "d2df2e9ec9cc5ff9",
2666
+ "hash_cont_tokens": "bc9063ad140cc941"
2667
+ },
2668
+ "truncated": 0,
2669
+ "non_truncated": 216,
2670
+ "padded": 864,
2671
+ "non_padded": 0,
2672
+ "effective_few_shots": 5.0,
2673
+ "num_truncated_few_shots": 0
2674
+ },
2675
+ "leaderboard|mmlu:high_school_us_history|5": {
2676
+ "hashes": {
2677
+ "hash_examples": "95fef1c4b7d3f81e",
2678
+ "hash_full_prompts": "b4616b587c96945d",
2679
+ "hash_input_tokens": "1b9a891fe1e28335",
2680
+ "hash_cont_tokens": "5cf777085ba01096"
2681
+ },
2682
+ "truncated": 0,
2683
+ "non_truncated": 204,
2684
+ "padded": 816,
2685
+ "non_padded": 0,
2686
+ "effective_few_shots": 5.0,
2687
+ "num_truncated_few_shots": 0
2688
+ },
2689
+ "leaderboard|mmlu:high_school_world_history|5": {
2690
+ "hashes": {
2691
+ "hash_examples": "7e5085b6184b0322",
2692
+ "hash_full_prompts": "e790690fb05fa0d1",
2693
+ "hash_input_tokens": "60fc90341eab6ac2",
2694
+ "hash_cont_tokens": "152af2d9e4830517"
2695
+ },
2696
+ "truncated": 0,
2697
+ "non_truncated": 237,
2698
+ "padded": 948,
2699
+ "non_padded": 0,
2700
+ "effective_few_shots": 5.0,
2701
+ "num_truncated_few_shots": 0
2702
+ },
2703
+ "leaderboard|mmlu:human_aging|5": {
2704
+ "hashes": {
2705
+ "hash_examples": "c17333e7c7c10797",
2706
+ "hash_full_prompts": "327f9f213650f977",
2707
+ "hash_input_tokens": "3527cd9b1efd6b7c",
2708
+ "hash_cont_tokens": "da4d9eaa044021dd"
2709
+ },
2710
+ "truncated": 0,
2711
+ "non_truncated": 223,
2712
+ "padded": 892,
2713
+ "non_padded": 0,
2714
+ "effective_few_shots": 5.0,
2715
+ "num_truncated_few_shots": 0
2716
+ },
2717
+ "leaderboard|mmlu:human_sexuality|5": {
2718
+ "hashes": {
2719
+ "hash_examples": "4edd1e9045df5e3d",
2720
+ "hash_full_prompts": "0b6a52b3d3863745",
2721
+ "hash_input_tokens": "7a97714c98ec3df0",
2722
+ "hash_cont_tokens": "1b99e384258a4eeb"
2723
+ },
2724
+ "truncated": 0,
2725
+ "non_truncated": 131,
2726
+ "padded": 524,
2727
+ "non_padded": 0,
2728
+ "effective_few_shots": 5.0,
2729
+ "num_truncated_few_shots": 0
2730
+ },
2731
+ "leaderboard|mmlu:international_law|5": {
2732
+ "hashes": {
2733
+ "hash_examples": "db2fa00d771a062a",
2734
+ "hash_full_prompts": "429b8d84640cdf75",
2735
+ "hash_input_tokens": "7e572d7ea1a3e509",
2736
+ "hash_cont_tokens": "cbf02c30cdded208"
2737
+ },
2738
+ "truncated": 0,
2739
+ "non_truncated": 121,
2740
+ "padded": 484,
2741
+ "non_padded": 0,
2742
+ "effective_few_shots": 5.0,
2743
+ "num_truncated_few_shots": 0
2744
+ },
2745
+ "leaderboard|mmlu:jurisprudence|5": {
2746
+ "hashes": {
2747
+ "hash_examples": "e956f86b124076fe",
2748
+ "hash_full_prompts": "571f9505d9f6fa3d",
2749
+ "hash_input_tokens": "e771bba2041d48e1",
2750
+ "hash_cont_tokens": "4b248cf879d97a50"
2751
+ },
2752
+ "truncated": 0,
2753
+ "non_truncated": 108,
2754
+ "padded": 424,
2755
+ "non_padded": 8,
2756
+ "effective_few_shots": 5.0,
2757
+ "num_truncated_few_shots": 0
2758
+ },
2759
+ "leaderboard|mmlu:logical_fallacies|5": {
2760
+ "hashes": {
2761
+ "hash_examples": "956e0e6365ab79f1",
2762
+ "hash_full_prompts": "abf6d18a0245c552",
2763
+ "hash_input_tokens": "7016f4de62d61e8f",
2764
+ "hash_cont_tokens": "6d9c35172b158838"
2765
+ },
2766
+ "truncated": 0,
2767
+ "non_truncated": 163,
2768
+ "padded": 632,
2769
+ "non_padded": 20,
2770
+ "effective_few_shots": 5.0,
2771
+ "num_truncated_few_shots": 0
2772
+ },
2773
+ "leaderboard|mmlu:machine_learning|5": {
2774
+ "hashes": {
2775
+ "hash_examples": "397997cc6f4d581e",
2776
+ "hash_full_prompts": "8b9115560a815fab",
2777
+ "hash_input_tokens": "a718bd4f9fb8eab0",
2778
+ "hash_cont_tokens": "66c3ec85fee2fc98"
2779
+ },
2780
+ "truncated": 0,
2781
+ "non_truncated": 112,
2782
+ "padded": 448,
2783
+ "non_padded": 0,
2784
+ "effective_few_shots": 5.0,
2785
+ "num_truncated_few_shots": 0
2786
+ },
2787
+ "leaderboard|mmlu:management|5": {
2788
+ "hashes": {
2789
+ "hash_examples": "2bcbe6f6ca63d740",
2790
+ "hash_full_prompts": "f18191cecdc130be",
2791
+ "hash_input_tokens": "dd6a99048a822e5a",
2792
+ "hash_cont_tokens": "5e2470abd1fb9d10"
2793
+ },
2794
+ "truncated": 0,
2795
+ "non_truncated": 103,
2796
+ "padded": 412,
2797
+ "non_padded": 0,
2798
+ "effective_few_shots": 5.0,
2799
+ "num_truncated_few_shots": 0
2800
+ },
2801
+ "leaderboard|mmlu:marketing|5": {
2802
+ "hashes": {
2803
+ "hash_examples": "8ddb20d964a1b065",
2804
+ "hash_full_prompts": "ad9ff50246bf7d49",
2805
+ "hash_input_tokens": "fb59075fb468b035",
2806
+ "hash_cont_tokens": "27fe68d9630f8999"
2807
+ },
2808
+ "truncated": 0,
2809
+ "non_truncated": 234,
2810
+ "padded": 916,
2811
+ "non_padded": 20,
2812
+ "effective_few_shots": 5.0,
2813
+ "num_truncated_few_shots": 0
2814
+ },
2815
+ "leaderboard|mmlu:medical_genetics|5": {
2816
+ "hashes": {
2817
+ "hash_examples": "182a71f4763d2cea",
2818
+ "hash_full_prompts": "e95c568978da29c1",
2819
+ "hash_input_tokens": "6ec76fde9dca6553",
2820
+ "hash_cont_tokens": "9e1c9ca2c51de57e"
2821
+ },
2822
+ "truncated": 0,
2823
+ "non_truncated": 100,
2824
+ "padded": 400,
2825
+ "non_padded": 0,
2826
+ "effective_few_shots": 5.0,
2827
+ "num_truncated_few_shots": 0
2828
+ },
2829
+ "leaderboard|mmlu:miscellaneous|5": {
2830
+ "hashes": {
2831
+ "hash_examples": "4c404fdbb4ca57fc",
2832
+ "hash_full_prompts": "468305dc71aa217c",
2833
+ "hash_input_tokens": "9ab5ce7430aeeff7",
2834
+ "hash_cont_tokens": "dfa423a160edd337"
2835
+ },
2836
+ "truncated": 0,
2837
+ "non_truncated": 783,
2838
+ "padded": 3128,
2839
+ "non_padded": 4,
2840
+ "effective_few_shots": 5.0,
2841
+ "num_truncated_few_shots": 0
2842
+ },
2843
+ "leaderboard|mmlu:moral_disputes|5": {
2844
+ "hashes": {
2845
+ "hash_examples": "60cbd2baa3fea5c9",
2846
+ "hash_full_prompts": "7a24f9c6f83420f2",
2847
+ "hash_input_tokens": "17712020d9c38d0f",
2848
+ "hash_cont_tokens": "bef966e6669349be"
2849
+ },
2850
+ "truncated": 0,
2851
+ "non_truncated": 346,
2852
+ "padded": 1380,
2853
+ "non_padded": 4,
2854
+ "effective_few_shots": 5.0,
2855
+ "num_truncated_few_shots": 0
2856
+ },
2857
+ "leaderboard|mmlu:moral_scenarios|5": {
2858
+ "hashes": {
2859
+ "hash_examples": "fd8b0431fbdd75ef",
2860
+ "hash_full_prompts": "8723c262038898c8",
2861
+ "hash_input_tokens": "a4a16b58339a1b08",
2862
+ "hash_cont_tokens": "a7bfdd944d86bcb5"
2863
+ },
2864
+ "truncated": 0,
2865
+ "non_truncated": 895,
2866
+ "padded": 3575,
2867
+ "non_padded": 5,
2868
+ "effective_few_shots": 5.0,
2869
+ "num_truncated_few_shots": 0
2870
+ },
2871
+ "leaderboard|mmlu:nutrition|5": {
2872
+ "hashes": {
2873
+ "hash_examples": "71e55e2b829b6528",
2874
+ "hash_full_prompts": "cc3034694d476c82",
2875
+ "hash_input_tokens": "4589c74e55901b66",
2876
+ "hash_cont_tokens": "fcda7736026f2449"
2877
+ },
2878
+ "truncated": 0,
2879
+ "non_truncated": 306,
2880
+ "padded": 1224,
2881
+ "non_padded": 0,
2882
+ "effective_few_shots": 5.0,
2883
+ "num_truncated_few_shots": 0
2884
+ },
2885
+ "leaderboard|mmlu:philosophy|5": {
2886
+ "hashes": {
2887
+ "hash_examples": "a6d489a8d208fa4b",
2888
+ "hash_full_prompts": "d92988a447a6ce08",
2889
+ "hash_input_tokens": "fa85837aaec1aef6",
2890
+ "hash_cont_tokens": "0f39b851342e8986"
2891
+ },
2892
+ "truncated": 0,
2893
+ "non_truncated": 311,
2894
+ "padded": 1244,
2895
+ "non_padded": 0,
2896
+ "effective_few_shots": 5.0,
2897
+ "num_truncated_few_shots": 0
2898
+ },
2899
+ "leaderboard|mmlu:prehistory|5": {
2900
+ "hashes": {
2901
+ "hash_examples": "6cc50f032a19acaa",
2902
+ "hash_full_prompts": "0d0d33c8f9bed861",
2903
+ "hash_input_tokens": "735ed41425466729",
2904
+ "hash_cont_tokens": "b60e45d3e9856b35"
2905
+ },
2906
+ "truncated": 0,
2907
+ "non_truncated": 324,
2908
+ "padded": 1280,
2909
+ "non_padded": 16,
2910
+ "effective_few_shots": 5.0,
2911
+ "num_truncated_few_shots": 0
2912
+ },
2913
+ "leaderboard|mmlu:professional_accounting|5": {
2914
+ "hashes": {
2915
+ "hash_examples": "50f57ab32f5f6cea",
2916
+ "hash_full_prompts": "9c809e7b8ca8ec1f",
2917
+ "hash_input_tokens": "b0c851d675e5355b",
2918
+ "hash_cont_tokens": "a0c4e121b7293818"
2919
+ },
2920
+ "truncated": 0,
2921
+ "non_truncated": 282,
2922
+ "padded": 1112,
2923
+ "non_padded": 16,
2924
+ "effective_few_shots": 5.0,
2925
+ "num_truncated_few_shots": 0
2926
+ },
2927
+ "leaderboard|mmlu:professional_law|5": {
2928
+ "hashes": {
2929
+ "hash_examples": "a8fdc85c64f4b215",
2930
+ "hash_full_prompts": "246b3e8a9054a5de",
2931
+ "hash_input_tokens": "c27b16ef17f69218",
2932
+ "hash_cont_tokens": "68b662abeba54fbc"
2933
+ },
2934
+ "truncated": 0,
2935
+ "non_truncated": 1534,
2936
+ "padded": 6136,
2937
+ "non_padded": 0,
2938
+ "effective_few_shots": 5.0,
2939
+ "num_truncated_few_shots": 0
2940
+ },
2941
+ "leaderboard|mmlu:professional_medicine|5": {
2942
+ "hashes": {
2943
+ "hash_examples": "c373a28a3050a73a",
2944
+ "hash_full_prompts": "f66dd653b5c5022b",
2945
+ "hash_input_tokens": "955343929a6793cb",
2946
+ "hash_cont_tokens": "6caeac5412bb4a09"
2947
+ },
2948
+ "truncated": 0,
2949
+ "non_truncated": 272,
2950
+ "padded": 1088,
2951
+ "non_padded": 0,
2952
+ "effective_few_shots": 5.0,
2953
+ "num_truncated_few_shots": 0
2954
+ },
2955
+ "leaderboard|mmlu:professional_psychology|5": {
2956
+ "hashes": {
2957
+ "hash_examples": "bf5254fe818356af",
2958
+ "hash_full_prompts": "03228f18e58fb42c",
2959
+ "hash_input_tokens": "a18463f8187e4322",
2960
+ "hash_cont_tokens": "79b091252a1095a9"
2961
+ },
2962
+ "truncated": 0,
2963
+ "non_truncated": 612,
2964
+ "padded": 2448,
2965
+ "non_padded": 0,
2966
+ "effective_few_shots": 5.0,
2967
+ "num_truncated_few_shots": 0
2968
+ },
2969
+ "leaderboard|mmlu:public_relations|5": {
2970
+ "hashes": {
2971
+ "hash_examples": "b66d52e28e7d14e0",
2972
+ "hash_full_prompts": "2717ec2f9cc3ea3f",
2973
+ "hash_input_tokens": "3118fb19254356b8",
2974
+ "hash_cont_tokens": "987115a77c8704f0"
2975
+ },
2976
+ "truncated": 0,
2977
+ "non_truncated": 110,
2978
+ "padded": 436,
2979
+ "non_padded": 4,
2980
+ "effective_few_shots": 5.0,
2981
+ "num_truncated_few_shots": 0
2982
+ },
2983
+ "leaderboard|mmlu:security_studies|5": {
2984
+ "hashes": {
2985
+ "hash_examples": "514c14feaf000ad9",
2986
+ "hash_full_prompts": "fd10221b4be3bf11",
2987
+ "hash_input_tokens": "619ae48b231f13d1",
2988
+ "hash_cont_tokens": "6c35bc7e96074b27"
2989
+ },
2990
+ "truncated": 0,
2991
+ "non_truncated": 245,
2992
+ "padded": 980,
2993
+ "non_padded": 0,
2994
+ "effective_few_shots": 5.0,
2995
+ "num_truncated_few_shots": 0
2996
+ },
2997
+ "leaderboard|mmlu:sociology|5": {
2998
+ "hashes": {
2999
+ "hash_examples": "f6c9bc9d18c80870",
3000
+ "hash_full_prompts": "16bc50365bda7e74",
3001
+ "hash_input_tokens": "e77c9db987dfeede",
3002
+ "hash_cont_tokens": "32af622f73b2e657"
3003
+ },
3004
+ "truncated": 0,
3005
+ "non_truncated": 201,
3006
+ "padded": 804,
3007
+ "non_padded": 0,
3008
+ "effective_few_shots": 5.0,
3009
+ "num_truncated_few_shots": 0
3010
+ },
3011
+ "leaderboard|mmlu:us_foreign_policy|5": {
3012
+ "hashes": {
3013
+ "hash_examples": "ed7b78629db6678f",
3014
+ "hash_full_prompts": "249ca3f4999e41ad",
3015
+ "hash_input_tokens": "0fa36661f20b1b58",
3016
+ "hash_cont_tokens": "9e1c9ca2c51de57e"
3017
+ },
3018
+ "truncated": 0,
3019
+ "non_truncated": 100,
3020
+ "padded": 400,
3021
+ "non_padded": 0,
3022
+ "effective_few_shots": 5.0,
3023
+ "num_truncated_few_shots": 0
3024
+ },
3025
+ "leaderboard|mmlu:virology|5": {
3026
+ "hashes": {
3027
+ "hash_examples": "bc52ffdc3f9b994a",
3028
+ "hash_full_prompts": "09939d976cecacd7",
3029
+ "hash_input_tokens": "b8237a5fe3c03938",
3030
+ "hash_cont_tokens": "beded8c3660dc8f5"
3031
+ },
3032
+ "truncated": 0,
3033
+ "non_truncated": 166,
3034
+ "padded": 664,
3035
+ "non_padded": 0,
3036
+ "effective_few_shots": 5.0,
3037
+ "num_truncated_few_shots": 0
3038
+ },
3039
+ "leaderboard|mmlu:world_religions|5": {
3040
+ "hashes": {
3041
+ "hash_examples": "ecdb4a4f94f62930",
3042
+ "hash_full_prompts": "addabd4dc9734c08",
3043
+ "hash_input_tokens": "23943b2941071751",
3044
+ "hash_cont_tokens": "9b1952a4af3d6a73"
3045
+ },
3046
+ "truncated": 0,
3047
+ "non_truncated": 171,
3048
+ "padded": 684,
3049
+ "non_padded": 0,
3050
+ "effective_few_shots": 5.0,
3051
+ "num_truncated_few_shots": 0
3052
+ }
3053
+ },
3054
+ "summary_general": {
3055
+ "hashes": {
3056
+ "hash_examples": "341a076d0beb7048",
3057
+ "hash_full_prompts": "11973fef11ba4c9d",
3058
+ "hash_input_tokens": "0e9d676b8e37ef05",
3059
+ "hash_cont_tokens": "25e9f343d6b95644"
3060
+ },
3061
+ "truncated": 0,
3062
+ "non_truncated": 14042,
3063
+ "padded": 56062,
3064
+ "non_padded": 106,
3065
+ "num_truncated_few_shots": 0
3066
+ }
3067
+ }