lewtun HF staff commited on
Commit
8962799
·
verified ·
1 Parent(s): d5031e7

Upload eval_results/orpo-explorers/hf-llama3-70b-orpo-CapybaraPreferences-bs-512/main/mmlu/results_2024-05-10T15-52-26.542088.json with huggingface_hub

Browse files
eval_results/orpo-explorers/hf-llama3-70b-orpo-CapybaraPreferences-bs-512/main/mmlu/results_2024-05-10T15-52-26.542088.json ADDED
@@ -0,0 +1,3181 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "config_general": {
3
+ "lighteval_sha": "?",
4
+ "num_fewshot_seeds": 1,
5
+ "override_batch_size": 4,
6
+ "max_samples": null,
7
+ "job_id": "",
8
+ "start_time": 874624.771892548,
9
+ "end_time": 885291.537749163,
10
+ "total_evaluation_time_secondes": "10666.765856615035",
11
+ "model_name": "orpo-explorers/hf-llama3-70b-orpo-CapybaraPreferences-bs-512",
12
+ "model_sha": "db613279af79e9565a214c6af94c415ccf35f23a",
13
+ "model_dtype": "torch.bfloat16",
14
+ "model_size": "131.73 GB",
15
+ "config": null
16
+ },
17
+ "results": {
18
+ "leaderboard|mmlu:abstract_algebra|5": {
19
+ "acc": 0.46,
20
+ "acc_stderr": 0.05009082659620332
21
+ },
22
+ "leaderboard|mmlu:anatomy|5": {
23
+ "acc": 0.7777777777777778,
24
+ "acc_stderr": 0.03591444084196969
25
+ },
26
+ "leaderboard|mmlu:astronomy|5": {
27
+ "acc": 0.9276315789473685,
28
+ "acc_stderr": 0.02108501126188411
29
+ },
30
+ "leaderboard|mmlu:business_ethics|5": {
31
+ "acc": 0.82,
32
+ "acc_stderr": 0.03861229196653694
33
+ },
34
+ "leaderboard|mmlu:clinical_knowledge|5": {
35
+ "acc": 0.8264150943396227,
36
+ "acc_stderr": 0.023310583026006255
37
+ },
38
+ "leaderboard|mmlu:college_biology|5": {
39
+ "acc": 0.9236111111111112,
40
+ "acc_stderr": 0.022212203938345918
41
+ },
42
+ "leaderboard|mmlu:college_chemistry|5": {
43
+ "acc": 0.51,
44
+ "acc_stderr": 0.05024183937956912
45
+ },
46
+ "leaderboard|mmlu:college_computer_science|5": {
47
+ "acc": 0.73,
48
+ "acc_stderr": 0.044619604333847394
49
+ },
50
+ "leaderboard|mmlu:college_mathematics|5": {
51
+ "acc": 0.52,
52
+ "acc_stderr": 0.050211673156867795
53
+ },
54
+ "leaderboard|mmlu:college_medicine|5": {
55
+ "acc": 0.8034682080924855,
56
+ "acc_stderr": 0.03029957466478814
57
+ },
58
+ "leaderboard|mmlu:college_physics|5": {
59
+ "acc": 0.5196078431372549,
60
+ "acc_stderr": 0.04971358884367406
61
+ },
62
+ "leaderboard|mmlu:computer_security|5": {
63
+ "acc": 0.85,
64
+ "acc_stderr": 0.035887028128263714
65
+ },
66
+ "leaderboard|mmlu:conceptual_physics|5": {
67
+ "acc": 0.8212765957446808,
68
+ "acc_stderr": 0.02504537327205097
69
+ },
70
+ "leaderboard|mmlu:econometrics|5": {
71
+ "acc": 0.7017543859649122,
72
+ "acc_stderr": 0.04303684033537316
73
+ },
74
+ "leaderboard|mmlu:electrical_engineering|5": {
75
+ "acc": 0.7655172413793103,
76
+ "acc_stderr": 0.035306258743465914
77
+ },
78
+ "leaderboard|mmlu:elementary_mathematics|5": {
79
+ "acc": 0.6031746031746031,
80
+ "acc_stderr": 0.02519710107424649
81
+ },
82
+ "leaderboard|mmlu:formal_logic|5": {
83
+ "acc": 0.626984126984127,
84
+ "acc_stderr": 0.04325506042017086
85
+ },
86
+ "leaderboard|mmlu:global_facts|5": {
87
+ "acc": 0.48,
88
+ "acc_stderr": 0.05021167315686779
89
+ },
90
+ "leaderboard|mmlu:high_school_biology|5": {
91
+ "acc": 0.9064516129032258,
92
+ "acc_stderr": 0.01656575466827097
93
+ },
94
+ "leaderboard|mmlu:high_school_chemistry|5": {
95
+ "acc": 0.7536945812807881,
96
+ "acc_stderr": 0.03031509928561773
97
+ },
98
+ "leaderboard|mmlu:high_school_computer_science|5": {
99
+ "acc": 0.89,
100
+ "acc_stderr": 0.03144660377352203
101
+ },
102
+ "leaderboard|mmlu:high_school_european_history|5": {
103
+ "acc": 0.8666666666666667,
104
+ "acc_stderr": 0.026544435312706467
105
+ },
106
+ "leaderboard|mmlu:high_school_geography|5": {
107
+ "acc": 0.9343434343434344,
108
+ "acc_stderr": 0.017646526677233335
109
+ },
110
+ "leaderboard|mmlu:high_school_government_and_politics|5": {
111
+ "acc": 0.9844559585492227,
112
+ "acc_stderr": 0.00892749271508434
113
+ },
114
+ "leaderboard|mmlu:high_school_macroeconomics|5": {
115
+ "acc": 0.8153846153846154,
116
+ "acc_stderr": 0.0196716324131003
117
+ },
118
+ "leaderboard|mmlu:high_school_mathematics|5": {
119
+ "acc": 0.44814814814814813,
120
+ "acc_stderr": 0.03032116719631628
121
+ },
122
+ "leaderboard|mmlu:high_school_microeconomics|5": {
123
+ "acc": 0.8949579831932774,
124
+ "acc_stderr": 0.019916300758805225
125
+ },
126
+ "leaderboard|mmlu:high_school_physics|5": {
127
+ "acc": 0.6026490066225165,
128
+ "acc_stderr": 0.03995524007681689
129
+ },
130
+ "leaderboard|mmlu:high_school_psychology|5": {
131
+ "acc": 0.926605504587156,
132
+ "acc_stderr": 0.011180976446357573
133
+ },
134
+ "leaderboard|mmlu:high_school_statistics|5": {
135
+ "acc": 0.7037037037037037,
136
+ "acc_stderr": 0.03114144782353604
137
+ },
138
+ "leaderboard|mmlu:high_school_us_history|5": {
139
+ "acc": 0.9411764705882353,
140
+ "acc_stderr": 0.016514409561025806
141
+ },
142
+ "leaderboard|mmlu:high_school_world_history|5": {
143
+ "acc": 0.9451476793248945,
144
+ "acc_stderr": 0.014821471997344066
145
+ },
146
+ "leaderboard|mmlu:human_aging|5": {
147
+ "acc": 0.8071748878923767,
148
+ "acc_stderr": 0.026478240960489365
149
+ },
150
+ "leaderboard|mmlu:human_sexuality|5": {
151
+ "acc": 0.8625954198473282,
152
+ "acc_stderr": 0.030194823996804468
153
+ },
154
+ "leaderboard|mmlu:international_law|5": {
155
+ "acc": 0.9008264462809917,
156
+ "acc_stderr": 0.027285246312758957
157
+ },
158
+ "leaderboard|mmlu:jurisprudence|5": {
159
+ "acc": 0.8888888888888888,
160
+ "acc_stderr": 0.030381596756651662
161
+ },
162
+ "leaderboard|mmlu:logical_fallacies|5": {
163
+ "acc": 0.8588957055214724,
164
+ "acc_stderr": 0.027351605518389752
165
+ },
166
+ "leaderboard|mmlu:machine_learning|5": {
167
+ "acc": 0.625,
168
+ "acc_stderr": 0.04595091388086298
169
+ },
170
+ "leaderboard|mmlu:management|5": {
171
+ "acc": 0.912621359223301,
172
+ "acc_stderr": 0.027960689125970644
173
+ },
174
+ "leaderboard|mmlu:marketing|5": {
175
+ "acc": 0.9145299145299145,
176
+ "acc_stderr": 0.018315891685625835
177
+ },
178
+ "leaderboard|mmlu:medical_genetics|5": {
179
+ "acc": 0.91,
180
+ "acc_stderr": 0.028762349126466115
181
+ },
182
+ "leaderboard|mmlu:miscellaneous|5": {
183
+ "acc": 0.9054916985951469,
184
+ "acc_stderr": 0.010461015338193068
185
+ },
186
+ "leaderboard|mmlu:moral_disputes|5": {
187
+ "acc": 0.8583815028901735,
188
+ "acc_stderr": 0.01877113868405901
189
+ },
190
+ "leaderboard|mmlu:moral_scenarios|5": {
191
+ "acc": 0.6245810055865921,
192
+ "acc_stderr": 0.016195104248463526
193
+ },
194
+ "leaderboard|mmlu:nutrition|5": {
195
+ "acc": 0.8758169934640523,
196
+ "acc_stderr": 0.01888373290962623
197
+ },
198
+ "leaderboard|mmlu:philosophy|5": {
199
+ "acc": 0.8585209003215434,
200
+ "acc_stderr": 0.019794326658090555
201
+ },
202
+ "leaderboard|mmlu:prehistory|5": {
203
+ "acc": 0.9104938271604939,
204
+ "acc_stderr": 0.01588414107393756
205
+ },
206
+ "leaderboard|mmlu:professional_accounting|5": {
207
+ "acc": 0.6382978723404256,
208
+ "acc_stderr": 0.028663820147199492
209
+ },
210
+ "leaderboard|mmlu:professional_law|5": {
211
+ "acc": 0.6316818774445893,
212
+ "acc_stderr": 0.012319403369564642
213
+ },
214
+ "leaderboard|mmlu:professional_medicine|5": {
215
+ "acc": 0.8713235294117647,
216
+ "acc_stderr": 0.020340173153899015
217
+ },
218
+ "leaderboard|mmlu:professional_psychology|5": {
219
+ "acc": 0.8545751633986928,
220
+ "acc_stderr": 0.014261782879481034
221
+ },
222
+ "leaderboard|mmlu:public_relations|5": {
223
+ "acc": 0.7363636363636363,
224
+ "acc_stderr": 0.04220224692971987
225
+ },
226
+ "leaderboard|mmlu:security_studies|5": {
227
+ "acc": 0.8408163265306122,
228
+ "acc_stderr": 0.023420972069166348
229
+ },
230
+ "leaderboard|mmlu:sociology|5": {
231
+ "acc": 0.9353233830845771,
232
+ "acc_stderr": 0.017391600291491064
233
+ },
234
+ "leaderboard|mmlu:us_foreign_policy|5": {
235
+ "acc": 0.92,
236
+ "acc_stderr": 0.0272659924344291
237
+ },
238
+ "leaderboard|mmlu:virology|5": {
239
+ "acc": 0.5662650602409639,
240
+ "acc_stderr": 0.03858158940685515
241
+ },
242
+ "leaderboard|mmlu:world_religions|5": {
243
+ "acc": 0.8830409356725146,
244
+ "acc_stderr": 0.02464806896136616
245
+ },
246
+ "leaderboard|mmlu:_average|5": {
247
+ "acc": 0.7872300046778806,
248
+ "acc_stderr": 0.028087473645007556
249
+ },
250
+ "all": {
251
+ "acc": 0.7872300046778806,
252
+ "acc_stderr": 0.028087473645007556
253
+ }
254
+ },
255
+ "versions": {
256
+ "leaderboard|mmlu:abstract_algebra|5": 0,
257
+ "leaderboard|mmlu:anatomy|5": 0,
258
+ "leaderboard|mmlu:astronomy|5": 0,
259
+ "leaderboard|mmlu:business_ethics|5": 0,
260
+ "leaderboard|mmlu:clinical_knowledge|5": 0,
261
+ "leaderboard|mmlu:college_biology|5": 0,
262
+ "leaderboard|mmlu:college_chemistry|5": 0,
263
+ "leaderboard|mmlu:college_computer_science|5": 0,
264
+ "leaderboard|mmlu:college_mathematics|5": 0,
265
+ "leaderboard|mmlu:college_medicine|5": 0,
266
+ "leaderboard|mmlu:college_physics|5": 0,
267
+ "leaderboard|mmlu:computer_security|5": 0,
268
+ "leaderboard|mmlu:conceptual_physics|5": 0,
269
+ "leaderboard|mmlu:econometrics|5": 0,
270
+ "leaderboard|mmlu:electrical_engineering|5": 0,
271
+ "leaderboard|mmlu:elementary_mathematics|5": 0,
272
+ "leaderboard|mmlu:formal_logic|5": 0,
273
+ "leaderboard|mmlu:global_facts|5": 0,
274
+ "leaderboard|mmlu:high_school_biology|5": 0,
275
+ "leaderboard|mmlu:high_school_chemistry|5": 0,
276
+ "leaderboard|mmlu:high_school_computer_science|5": 0,
277
+ "leaderboard|mmlu:high_school_european_history|5": 0,
278
+ "leaderboard|mmlu:high_school_geography|5": 0,
279
+ "leaderboard|mmlu:high_school_government_and_politics|5": 0,
280
+ "leaderboard|mmlu:high_school_macroeconomics|5": 0,
281
+ "leaderboard|mmlu:high_school_mathematics|5": 0,
282
+ "leaderboard|mmlu:high_school_microeconomics|5": 0,
283
+ "leaderboard|mmlu:high_school_physics|5": 0,
284
+ "leaderboard|mmlu:high_school_psychology|5": 0,
285
+ "leaderboard|mmlu:high_school_statistics|5": 0,
286
+ "leaderboard|mmlu:high_school_us_history|5": 0,
287
+ "leaderboard|mmlu:high_school_world_history|5": 0,
288
+ "leaderboard|mmlu:human_aging|5": 0,
289
+ "leaderboard|mmlu:human_sexuality|5": 0,
290
+ "leaderboard|mmlu:international_law|5": 0,
291
+ "leaderboard|mmlu:jurisprudence|5": 0,
292
+ "leaderboard|mmlu:logical_fallacies|5": 0,
293
+ "leaderboard|mmlu:machine_learning|5": 0,
294
+ "leaderboard|mmlu:management|5": 0,
295
+ "leaderboard|mmlu:marketing|5": 0,
296
+ "leaderboard|mmlu:medical_genetics|5": 0,
297
+ "leaderboard|mmlu:miscellaneous|5": 0,
298
+ "leaderboard|mmlu:moral_disputes|5": 0,
299
+ "leaderboard|mmlu:moral_scenarios|5": 0,
300
+ "leaderboard|mmlu:nutrition|5": 0,
301
+ "leaderboard|mmlu:philosophy|5": 0,
302
+ "leaderboard|mmlu:prehistory|5": 0,
303
+ "leaderboard|mmlu:professional_accounting|5": 0,
304
+ "leaderboard|mmlu:professional_law|5": 0,
305
+ "leaderboard|mmlu:professional_medicine|5": 0,
306
+ "leaderboard|mmlu:professional_psychology|5": 0,
307
+ "leaderboard|mmlu:public_relations|5": 0,
308
+ "leaderboard|mmlu:security_studies|5": 0,
309
+ "leaderboard|mmlu:sociology|5": 0,
310
+ "leaderboard|mmlu:us_foreign_policy|5": 0,
311
+ "leaderboard|mmlu:virology|5": 0,
312
+ "leaderboard|mmlu:world_religions|5": 0
313
+ },
314
+ "config_tasks": {
315
+ "leaderboard|mmlu:abstract_algebra": {
316
+ "name": "mmlu:abstract_algebra",
317
+ "prompt_function": "mmlu_harness",
318
+ "hf_repo": "lighteval/mmlu",
319
+ "hf_subset": "abstract_algebra",
320
+ "metric": [
321
+ "loglikelihood_acc"
322
+ ],
323
+ "hf_avail_splits": [
324
+ "auxiliary_train",
325
+ "test",
326
+ "validation",
327
+ "dev"
328
+ ],
329
+ "evaluation_splits": [
330
+ "test"
331
+ ],
332
+ "few_shots_split": "dev",
333
+ "few_shots_select": "sequential",
334
+ "generation_size": 1,
335
+ "stop_sequence": [
336
+ "\n"
337
+ ],
338
+ "output_regex": null,
339
+ "num_samples": null,
340
+ "frozen": false,
341
+ "suite": [
342
+ "leaderboard",
343
+ "mmlu"
344
+ ],
345
+ "original_num_docs": 100,
346
+ "effective_num_docs": 100,
347
+ "trust_dataset": true,
348
+ "must_remove_duplicate_docs": null,
349
+ "version": 0
350
+ },
351
+ "leaderboard|mmlu:anatomy": {
352
+ "name": "mmlu:anatomy",
353
+ "prompt_function": "mmlu_harness",
354
+ "hf_repo": "lighteval/mmlu",
355
+ "hf_subset": "anatomy",
356
+ "metric": [
357
+ "loglikelihood_acc"
358
+ ],
359
+ "hf_avail_splits": [
360
+ "auxiliary_train",
361
+ "test",
362
+ "validation",
363
+ "dev"
364
+ ],
365
+ "evaluation_splits": [
366
+ "test"
367
+ ],
368
+ "few_shots_split": "dev",
369
+ "few_shots_select": "sequential",
370
+ "generation_size": 1,
371
+ "stop_sequence": [
372
+ "\n"
373
+ ],
374
+ "output_regex": null,
375
+ "num_samples": null,
376
+ "frozen": false,
377
+ "suite": [
378
+ "leaderboard",
379
+ "mmlu"
380
+ ],
381
+ "original_num_docs": 135,
382
+ "effective_num_docs": 135,
383
+ "trust_dataset": true,
384
+ "must_remove_duplicate_docs": null,
385
+ "version": 0
386
+ },
387
+ "leaderboard|mmlu:astronomy": {
388
+ "name": "mmlu:astronomy",
389
+ "prompt_function": "mmlu_harness",
390
+ "hf_repo": "lighteval/mmlu",
391
+ "hf_subset": "astronomy",
392
+ "metric": [
393
+ "loglikelihood_acc"
394
+ ],
395
+ "hf_avail_splits": [
396
+ "auxiliary_train",
397
+ "test",
398
+ "validation",
399
+ "dev"
400
+ ],
401
+ "evaluation_splits": [
402
+ "test"
403
+ ],
404
+ "few_shots_split": "dev",
405
+ "few_shots_select": "sequential",
406
+ "generation_size": 1,
407
+ "stop_sequence": [
408
+ "\n"
409
+ ],
410
+ "output_regex": null,
411
+ "num_samples": null,
412
+ "frozen": false,
413
+ "suite": [
414
+ "leaderboard",
415
+ "mmlu"
416
+ ],
417
+ "original_num_docs": 152,
418
+ "effective_num_docs": 152,
419
+ "trust_dataset": true,
420
+ "must_remove_duplicate_docs": null,
421
+ "version": 0
422
+ },
423
+ "leaderboard|mmlu:business_ethics": {
424
+ "name": "mmlu:business_ethics",
425
+ "prompt_function": "mmlu_harness",
426
+ "hf_repo": "lighteval/mmlu",
427
+ "hf_subset": "business_ethics",
428
+ "metric": [
429
+ "loglikelihood_acc"
430
+ ],
431
+ "hf_avail_splits": [
432
+ "auxiliary_train",
433
+ "test",
434
+ "validation",
435
+ "dev"
436
+ ],
437
+ "evaluation_splits": [
438
+ "test"
439
+ ],
440
+ "few_shots_split": "dev",
441
+ "few_shots_select": "sequential",
442
+ "generation_size": 1,
443
+ "stop_sequence": [
444
+ "\n"
445
+ ],
446
+ "output_regex": null,
447
+ "num_samples": null,
448
+ "frozen": false,
449
+ "suite": [
450
+ "leaderboard",
451
+ "mmlu"
452
+ ],
453
+ "original_num_docs": 100,
454
+ "effective_num_docs": 100,
455
+ "trust_dataset": true,
456
+ "must_remove_duplicate_docs": null,
457
+ "version": 0
458
+ },
459
+ "leaderboard|mmlu:clinical_knowledge": {
460
+ "name": "mmlu:clinical_knowledge",
461
+ "prompt_function": "mmlu_harness",
462
+ "hf_repo": "lighteval/mmlu",
463
+ "hf_subset": "clinical_knowledge",
464
+ "metric": [
465
+ "loglikelihood_acc"
466
+ ],
467
+ "hf_avail_splits": [
468
+ "auxiliary_train",
469
+ "test",
470
+ "validation",
471
+ "dev"
472
+ ],
473
+ "evaluation_splits": [
474
+ "test"
475
+ ],
476
+ "few_shots_split": "dev",
477
+ "few_shots_select": "sequential",
478
+ "generation_size": 1,
479
+ "stop_sequence": [
480
+ "\n"
481
+ ],
482
+ "output_regex": null,
483
+ "num_samples": null,
484
+ "frozen": false,
485
+ "suite": [
486
+ "leaderboard",
487
+ "mmlu"
488
+ ],
489
+ "original_num_docs": 265,
490
+ "effective_num_docs": 265,
491
+ "trust_dataset": true,
492
+ "must_remove_duplicate_docs": null,
493
+ "version": 0
494
+ },
495
+ "leaderboard|mmlu:college_biology": {
496
+ "name": "mmlu:college_biology",
497
+ "prompt_function": "mmlu_harness",
498
+ "hf_repo": "lighteval/mmlu",
499
+ "hf_subset": "college_biology",
500
+ "metric": [
501
+ "loglikelihood_acc"
502
+ ],
503
+ "hf_avail_splits": [
504
+ "auxiliary_train",
505
+ "test",
506
+ "validation",
507
+ "dev"
508
+ ],
509
+ "evaluation_splits": [
510
+ "test"
511
+ ],
512
+ "few_shots_split": "dev",
513
+ "few_shots_select": "sequential",
514
+ "generation_size": 1,
515
+ "stop_sequence": [
516
+ "\n"
517
+ ],
518
+ "output_regex": null,
519
+ "num_samples": null,
520
+ "frozen": false,
521
+ "suite": [
522
+ "leaderboard",
523
+ "mmlu"
524
+ ],
525
+ "original_num_docs": 144,
526
+ "effective_num_docs": 144,
527
+ "trust_dataset": true,
528
+ "must_remove_duplicate_docs": null,
529
+ "version": 0
530
+ },
531
+ "leaderboard|mmlu:college_chemistry": {
532
+ "name": "mmlu:college_chemistry",
533
+ "prompt_function": "mmlu_harness",
534
+ "hf_repo": "lighteval/mmlu",
535
+ "hf_subset": "college_chemistry",
536
+ "metric": [
537
+ "loglikelihood_acc"
538
+ ],
539
+ "hf_avail_splits": [
540
+ "auxiliary_train",
541
+ "test",
542
+ "validation",
543
+ "dev"
544
+ ],
545
+ "evaluation_splits": [
546
+ "test"
547
+ ],
548
+ "few_shots_split": "dev",
549
+ "few_shots_select": "sequential",
550
+ "generation_size": 1,
551
+ "stop_sequence": [
552
+ "\n"
553
+ ],
554
+ "output_regex": null,
555
+ "num_samples": null,
556
+ "frozen": false,
557
+ "suite": [
558
+ "leaderboard",
559
+ "mmlu"
560
+ ],
561
+ "original_num_docs": 100,
562
+ "effective_num_docs": 100,
563
+ "trust_dataset": true,
564
+ "must_remove_duplicate_docs": null,
565
+ "version": 0
566
+ },
567
+ "leaderboard|mmlu:college_computer_science": {
568
+ "name": "mmlu:college_computer_science",
569
+ "prompt_function": "mmlu_harness",
570
+ "hf_repo": "lighteval/mmlu",
571
+ "hf_subset": "college_computer_science",
572
+ "metric": [
573
+ "loglikelihood_acc"
574
+ ],
575
+ "hf_avail_splits": [
576
+ "auxiliary_train",
577
+ "test",
578
+ "validation",
579
+ "dev"
580
+ ],
581
+ "evaluation_splits": [
582
+ "test"
583
+ ],
584
+ "few_shots_split": "dev",
585
+ "few_shots_select": "sequential",
586
+ "generation_size": 1,
587
+ "stop_sequence": [
588
+ "\n"
589
+ ],
590
+ "output_regex": null,
591
+ "num_samples": null,
592
+ "frozen": false,
593
+ "suite": [
594
+ "leaderboard",
595
+ "mmlu"
596
+ ],
597
+ "original_num_docs": 100,
598
+ "effective_num_docs": 100,
599
+ "trust_dataset": true,
600
+ "must_remove_duplicate_docs": null,
601
+ "version": 0
602
+ },
603
+ "leaderboard|mmlu:college_mathematics": {
604
+ "name": "mmlu:college_mathematics",
605
+ "prompt_function": "mmlu_harness",
606
+ "hf_repo": "lighteval/mmlu",
607
+ "hf_subset": "college_mathematics",
608
+ "metric": [
609
+ "loglikelihood_acc"
610
+ ],
611
+ "hf_avail_splits": [
612
+ "auxiliary_train",
613
+ "test",
614
+ "validation",
615
+ "dev"
616
+ ],
617
+ "evaluation_splits": [
618
+ "test"
619
+ ],
620
+ "few_shots_split": "dev",
621
+ "few_shots_select": "sequential",
622
+ "generation_size": 1,
623
+ "stop_sequence": [
624
+ "\n"
625
+ ],
626
+ "output_regex": null,
627
+ "num_samples": null,
628
+ "frozen": false,
629
+ "suite": [
630
+ "leaderboard",
631
+ "mmlu"
632
+ ],
633
+ "original_num_docs": 100,
634
+ "effective_num_docs": 100,
635
+ "trust_dataset": true,
636
+ "must_remove_duplicate_docs": null,
637
+ "version": 0
638
+ },
639
+ "leaderboard|mmlu:college_medicine": {
640
+ "name": "mmlu:college_medicine",
641
+ "prompt_function": "mmlu_harness",
642
+ "hf_repo": "lighteval/mmlu",
643
+ "hf_subset": "college_medicine",
644
+ "metric": [
645
+ "loglikelihood_acc"
646
+ ],
647
+ "hf_avail_splits": [
648
+ "auxiliary_train",
649
+ "test",
650
+ "validation",
651
+ "dev"
652
+ ],
653
+ "evaluation_splits": [
654
+ "test"
655
+ ],
656
+ "few_shots_split": "dev",
657
+ "few_shots_select": "sequential",
658
+ "generation_size": 1,
659
+ "stop_sequence": [
660
+ "\n"
661
+ ],
662
+ "output_regex": null,
663
+ "num_samples": null,
664
+ "frozen": false,
665
+ "suite": [
666
+ "leaderboard",
667
+ "mmlu"
668
+ ],
669
+ "original_num_docs": 173,
670
+ "effective_num_docs": 173,
671
+ "trust_dataset": true,
672
+ "must_remove_duplicate_docs": null,
673
+ "version": 0
674
+ },
675
+ "leaderboard|mmlu:college_physics": {
676
+ "name": "mmlu:college_physics",
677
+ "prompt_function": "mmlu_harness",
678
+ "hf_repo": "lighteval/mmlu",
679
+ "hf_subset": "college_physics",
680
+ "metric": [
681
+ "loglikelihood_acc"
682
+ ],
683
+ "hf_avail_splits": [
684
+ "auxiliary_train",
685
+ "test",
686
+ "validation",
687
+ "dev"
688
+ ],
689
+ "evaluation_splits": [
690
+ "test"
691
+ ],
692
+ "few_shots_split": "dev",
693
+ "few_shots_select": "sequential",
694
+ "generation_size": 1,
695
+ "stop_sequence": [
696
+ "\n"
697
+ ],
698
+ "output_regex": null,
699
+ "num_samples": null,
700
+ "frozen": false,
701
+ "suite": [
702
+ "leaderboard",
703
+ "mmlu"
704
+ ],
705
+ "original_num_docs": 102,
706
+ "effective_num_docs": 102,
707
+ "trust_dataset": true,
708
+ "must_remove_duplicate_docs": null,
709
+ "version": 0
710
+ },
711
+ "leaderboard|mmlu:computer_security": {
712
+ "name": "mmlu:computer_security",
713
+ "prompt_function": "mmlu_harness",
714
+ "hf_repo": "lighteval/mmlu",
715
+ "hf_subset": "computer_security",
716
+ "metric": [
717
+ "loglikelihood_acc"
718
+ ],
719
+ "hf_avail_splits": [
720
+ "auxiliary_train",
721
+ "test",
722
+ "validation",
723
+ "dev"
724
+ ],
725
+ "evaluation_splits": [
726
+ "test"
727
+ ],
728
+ "few_shots_split": "dev",
729
+ "few_shots_select": "sequential",
730
+ "generation_size": 1,
731
+ "stop_sequence": [
732
+ "\n"
733
+ ],
734
+ "output_regex": null,
735
+ "num_samples": null,
736
+ "frozen": false,
737
+ "suite": [
738
+ "leaderboard",
739
+ "mmlu"
740
+ ],
741
+ "original_num_docs": 100,
742
+ "effective_num_docs": 100,
743
+ "trust_dataset": true,
744
+ "must_remove_duplicate_docs": null,
745
+ "version": 0
746
+ },
747
+ "leaderboard|mmlu:conceptual_physics": {
748
+ "name": "mmlu:conceptual_physics",
749
+ "prompt_function": "mmlu_harness",
750
+ "hf_repo": "lighteval/mmlu",
751
+ "hf_subset": "conceptual_physics",
752
+ "metric": [
753
+ "loglikelihood_acc"
754
+ ],
755
+ "hf_avail_splits": [
756
+ "auxiliary_train",
757
+ "test",
758
+ "validation",
759
+ "dev"
760
+ ],
761
+ "evaluation_splits": [
762
+ "test"
763
+ ],
764
+ "few_shots_split": "dev",
765
+ "few_shots_select": "sequential",
766
+ "generation_size": 1,
767
+ "stop_sequence": [
768
+ "\n"
769
+ ],
770
+ "output_regex": null,
771
+ "num_samples": null,
772
+ "frozen": false,
773
+ "suite": [
774
+ "leaderboard",
775
+ "mmlu"
776
+ ],
777
+ "original_num_docs": 235,
778
+ "effective_num_docs": 235,
779
+ "trust_dataset": true,
780
+ "must_remove_duplicate_docs": null,
781
+ "version": 0
782
+ },
783
+ "leaderboard|mmlu:econometrics": {
784
+ "name": "mmlu:econometrics",
785
+ "prompt_function": "mmlu_harness",
786
+ "hf_repo": "lighteval/mmlu",
787
+ "hf_subset": "econometrics",
788
+ "metric": [
789
+ "loglikelihood_acc"
790
+ ],
791
+ "hf_avail_splits": [
792
+ "auxiliary_train",
793
+ "test",
794
+ "validation",
795
+ "dev"
796
+ ],
797
+ "evaluation_splits": [
798
+ "test"
799
+ ],
800
+ "few_shots_split": "dev",
801
+ "few_shots_select": "sequential",
802
+ "generation_size": 1,
803
+ "stop_sequence": [
804
+ "\n"
805
+ ],
806
+ "output_regex": null,
807
+ "num_samples": null,
808
+ "frozen": false,
809
+ "suite": [
810
+ "leaderboard",
811
+ "mmlu"
812
+ ],
813
+ "original_num_docs": 114,
814
+ "effective_num_docs": 114,
815
+ "trust_dataset": true,
816
+ "must_remove_duplicate_docs": null,
817
+ "version": 0
818
+ },
819
+ "leaderboard|mmlu:electrical_engineering": {
820
+ "name": "mmlu:electrical_engineering",
821
+ "prompt_function": "mmlu_harness",
822
+ "hf_repo": "lighteval/mmlu",
823
+ "hf_subset": "electrical_engineering",
824
+ "metric": [
825
+ "loglikelihood_acc"
826
+ ],
827
+ "hf_avail_splits": [
828
+ "auxiliary_train",
829
+ "test",
830
+ "validation",
831
+ "dev"
832
+ ],
833
+ "evaluation_splits": [
834
+ "test"
835
+ ],
836
+ "few_shots_split": "dev",
837
+ "few_shots_select": "sequential",
838
+ "generation_size": 1,
839
+ "stop_sequence": [
840
+ "\n"
841
+ ],
842
+ "output_regex": null,
843
+ "num_samples": null,
844
+ "frozen": false,
845
+ "suite": [
846
+ "leaderboard",
847
+ "mmlu"
848
+ ],
849
+ "original_num_docs": 145,
850
+ "effective_num_docs": 145,
851
+ "trust_dataset": true,
852
+ "must_remove_duplicate_docs": null,
853
+ "version": 0
854
+ },
855
+ "leaderboard|mmlu:elementary_mathematics": {
856
+ "name": "mmlu:elementary_mathematics",
857
+ "prompt_function": "mmlu_harness",
858
+ "hf_repo": "lighteval/mmlu",
859
+ "hf_subset": "elementary_mathematics",
860
+ "metric": [
861
+ "loglikelihood_acc"
862
+ ],
863
+ "hf_avail_splits": [
864
+ "auxiliary_train",
865
+ "test",
866
+ "validation",
867
+ "dev"
868
+ ],
869
+ "evaluation_splits": [
870
+ "test"
871
+ ],
872
+ "few_shots_split": "dev",
873
+ "few_shots_select": "sequential",
874
+ "generation_size": 1,
875
+ "stop_sequence": [
876
+ "\n"
877
+ ],
878
+ "output_regex": null,
879
+ "num_samples": null,
880
+ "frozen": false,
881
+ "suite": [
882
+ "leaderboard",
883
+ "mmlu"
884
+ ],
885
+ "original_num_docs": 378,
886
+ "effective_num_docs": 378,
887
+ "trust_dataset": true,
888
+ "must_remove_duplicate_docs": null,
889
+ "version": 0
890
+ },
891
+ "leaderboard|mmlu:formal_logic": {
892
+ "name": "mmlu:formal_logic",
893
+ "prompt_function": "mmlu_harness",
894
+ "hf_repo": "lighteval/mmlu",
895
+ "hf_subset": "formal_logic",
896
+ "metric": [
897
+ "loglikelihood_acc"
898
+ ],
899
+ "hf_avail_splits": [
900
+ "auxiliary_train",
901
+ "test",
902
+ "validation",
903
+ "dev"
904
+ ],
905
+ "evaluation_splits": [
906
+ "test"
907
+ ],
908
+ "few_shots_split": "dev",
909
+ "few_shots_select": "sequential",
910
+ "generation_size": 1,
911
+ "stop_sequence": [
912
+ "\n"
913
+ ],
914
+ "output_regex": null,
915
+ "num_samples": null,
916
+ "frozen": false,
917
+ "suite": [
918
+ "leaderboard",
919
+ "mmlu"
920
+ ],
921
+ "original_num_docs": 126,
922
+ "effective_num_docs": 126,
923
+ "trust_dataset": true,
924
+ "must_remove_duplicate_docs": null,
925
+ "version": 0
926
+ },
927
+ "leaderboard|mmlu:global_facts": {
928
+ "name": "mmlu:global_facts",
929
+ "prompt_function": "mmlu_harness",
930
+ "hf_repo": "lighteval/mmlu",
931
+ "hf_subset": "global_facts",
932
+ "metric": [
933
+ "loglikelihood_acc"
934
+ ],
935
+ "hf_avail_splits": [
936
+ "auxiliary_train",
937
+ "test",
938
+ "validation",
939
+ "dev"
940
+ ],
941
+ "evaluation_splits": [
942
+ "test"
943
+ ],
944
+ "few_shots_split": "dev",
945
+ "few_shots_select": "sequential",
946
+ "generation_size": 1,
947
+ "stop_sequence": [
948
+ "\n"
949
+ ],
950
+ "output_regex": null,
951
+ "num_samples": null,
952
+ "frozen": false,
953
+ "suite": [
954
+ "leaderboard",
955
+ "mmlu"
956
+ ],
957
+ "original_num_docs": 100,
958
+ "effective_num_docs": 100,
959
+ "trust_dataset": true,
960
+ "must_remove_duplicate_docs": null,
961
+ "version": 0
962
+ },
963
+ "leaderboard|mmlu:high_school_biology": {
964
+ "name": "mmlu:high_school_biology",
965
+ "prompt_function": "mmlu_harness",
966
+ "hf_repo": "lighteval/mmlu",
967
+ "hf_subset": "high_school_biology",
968
+ "metric": [
969
+ "loglikelihood_acc"
970
+ ],
971
+ "hf_avail_splits": [
972
+ "auxiliary_train",
973
+ "test",
974
+ "validation",
975
+ "dev"
976
+ ],
977
+ "evaluation_splits": [
978
+ "test"
979
+ ],
980
+ "few_shots_split": "dev",
981
+ "few_shots_select": "sequential",
982
+ "generation_size": 1,
983
+ "stop_sequence": [
984
+ "\n"
985
+ ],
986
+ "output_regex": null,
987
+ "num_samples": null,
988
+ "frozen": false,
989
+ "suite": [
990
+ "leaderboard",
991
+ "mmlu"
992
+ ],
993
+ "original_num_docs": 310,
994
+ "effective_num_docs": 310,
995
+ "trust_dataset": true,
996
+ "must_remove_duplicate_docs": null,
997
+ "version": 0
998
+ },
999
+ "leaderboard|mmlu:high_school_chemistry": {
1000
+ "name": "mmlu:high_school_chemistry",
1001
+ "prompt_function": "mmlu_harness",
1002
+ "hf_repo": "lighteval/mmlu",
1003
+ "hf_subset": "high_school_chemistry",
1004
+ "metric": [
1005
+ "loglikelihood_acc"
1006
+ ],
1007
+ "hf_avail_splits": [
1008
+ "auxiliary_train",
1009
+ "test",
1010
+ "validation",
1011
+ "dev"
1012
+ ],
1013
+ "evaluation_splits": [
1014
+ "test"
1015
+ ],
1016
+ "few_shots_split": "dev",
1017
+ "few_shots_select": "sequential",
1018
+ "generation_size": 1,
1019
+ "stop_sequence": [
1020
+ "\n"
1021
+ ],
1022
+ "output_regex": null,
1023
+ "num_samples": null,
1024
+ "frozen": false,
1025
+ "suite": [
1026
+ "leaderboard",
1027
+ "mmlu"
1028
+ ],
1029
+ "original_num_docs": 203,
1030
+ "effective_num_docs": 203,
1031
+ "trust_dataset": true,
1032
+ "must_remove_duplicate_docs": null,
1033
+ "version": 0
1034
+ },
1035
+ "leaderboard|mmlu:high_school_computer_science": {
1036
+ "name": "mmlu:high_school_computer_science",
1037
+ "prompt_function": "mmlu_harness",
1038
+ "hf_repo": "lighteval/mmlu",
1039
+ "hf_subset": "high_school_computer_science",
1040
+ "metric": [
1041
+ "loglikelihood_acc"
1042
+ ],
1043
+ "hf_avail_splits": [
1044
+ "auxiliary_train",
1045
+ "test",
1046
+ "validation",
1047
+ "dev"
1048
+ ],
1049
+ "evaluation_splits": [
1050
+ "test"
1051
+ ],
1052
+ "few_shots_split": "dev",
1053
+ "few_shots_select": "sequential",
1054
+ "generation_size": 1,
1055
+ "stop_sequence": [
1056
+ "\n"
1057
+ ],
1058
+ "output_regex": null,
1059
+ "num_samples": null,
1060
+ "frozen": false,
1061
+ "suite": [
1062
+ "leaderboard",
1063
+ "mmlu"
1064
+ ],
1065
+ "original_num_docs": 100,
1066
+ "effective_num_docs": 100,
1067
+ "trust_dataset": true,
1068
+ "must_remove_duplicate_docs": null,
1069
+ "version": 0
1070
+ },
1071
+ "leaderboard|mmlu:high_school_european_history": {
1072
+ "name": "mmlu:high_school_european_history",
1073
+ "prompt_function": "mmlu_harness",
1074
+ "hf_repo": "lighteval/mmlu",
1075
+ "hf_subset": "high_school_european_history",
1076
+ "metric": [
1077
+ "loglikelihood_acc"
1078
+ ],
1079
+ "hf_avail_splits": [
1080
+ "auxiliary_train",
1081
+ "test",
1082
+ "validation",
1083
+ "dev"
1084
+ ],
1085
+ "evaluation_splits": [
1086
+ "test"
1087
+ ],
1088
+ "few_shots_split": "dev",
1089
+ "few_shots_select": "sequential",
1090
+ "generation_size": 1,
1091
+ "stop_sequence": [
1092
+ "\n"
1093
+ ],
1094
+ "output_regex": null,
1095
+ "num_samples": null,
1096
+ "frozen": false,
1097
+ "suite": [
1098
+ "leaderboard",
1099
+ "mmlu"
1100
+ ],
1101
+ "original_num_docs": 165,
1102
+ "effective_num_docs": 165,
1103
+ "trust_dataset": true,
1104
+ "must_remove_duplicate_docs": null,
1105
+ "version": 0
1106
+ },
1107
+ "leaderboard|mmlu:high_school_geography": {
1108
+ "name": "mmlu:high_school_geography",
1109
+ "prompt_function": "mmlu_harness",
1110
+ "hf_repo": "lighteval/mmlu",
1111
+ "hf_subset": "high_school_geography",
1112
+ "metric": [
1113
+ "loglikelihood_acc"
1114
+ ],
1115
+ "hf_avail_splits": [
1116
+ "auxiliary_train",
1117
+ "test",
1118
+ "validation",
1119
+ "dev"
1120
+ ],
1121
+ "evaluation_splits": [
1122
+ "test"
1123
+ ],
1124
+ "few_shots_split": "dev",
1125
+ "few_shots_select": "sequential",
1126
+ "generation_size": 1,
1127
+ "stop_sequence": [
1128
+ "\n"
1129
+ ],
1130
+ "output_regex": null,
1131
+ "num_samples": null,
1132
+ "frozen": false,
1133
+ "suite": [
1134
+ "leaderboard",
1135
+ "mmlu"
1136
+ ],
1137
+ "original_num_docs": 198,
1138
+ "effective_num_docs": 198,
1139
+ "trust_dataset": true,
1140
+ "must_remove_duplicate_docs": null,
1141
+ "version": 0
1142
+ },
1143
+ "leaderboard|mmlu:high_school_government_and_politics": {
1144
+ "name": "mmlu:high_school_government_and_politics",
1145
+ "prompt_function": "mmlu_harness",
1146
+ "hf_repo": "lighteval/mmlu",
1147
+ "hf_subset": "high_school_government_and_politics",
1148
+ "metric": [
1149
+ "loglikelihood_acc"
1150
+ ],
1151
+ "hf_avail_splits": [
1152
+ "auxiliary_train",
1153
+ "test",
1154
+ "validation",
1155
+ "dev"
1156
+ ],
1157
+ "evaluation_splits": [
1158
+ "test"
1159
+ ],
1160
+ "few_shots_split": "dev",
1161
+ "few_shots_select": "sequential",
1162
+ "generation_size": 1,
1163
+ "stop_sequence": [
1164
+ "\n"
1165
+ ],
1166
+ "output_regex": null,
1167
+ "num_samples": null,
1168
+ "frozen": false,
1169
+ "suite": [
1170
+ "leaderboard",
1171
+ "mmlu"
1172
+ ],
1173
+ "original_num_docs": 193,
1174
+ "effective_num_docs": 193,
1175
+ "trust_dataset": true,
1176
+ "must_remove_duplicate_docs": null,
1177
+ "version": 0
1178
+ },
1179
+ "leaderboard|mmlu:high_school_macroeconomics": {
1180
+ "name": "mmlu:high_school_macroeconomics",
1181
+ "prompt_function": "mmlu_harness",
1182
+ "hf_repo": "lighteval/mmlu",
1183
+ "hf_subset": "high_school_macroeconomics",
1184
+ "metric": [
1185
+ "loglikelihood_acc"
1186
+ ],
1187
+ "hf_avail_splits": [
1188
+ "auxiliary_train",
1189
+ "test",
1190
+ "validation",
1191
+ "dev"
1192
+ ],
1193
+ "evaluation_splits": [
1194
+ "test"
1195
+ ],
1196
+ "few_shots_split": "dev",
1197
+ "few_shots_select": "sequential",
1198
+ "generation_size": 1,
1199
+ "stop_sequence": [
1200
+ "\n"
1201
+ ],
1202
+ "output_regex": null,
1203
+ "num_samples": null,
1204
+ "frozen": false,
1205
+ "suite": [
1206
+ "leaderboard",
1207
+ "mmlu"
1208
+ ],
1209
+ "original_num_docs": 390,
1210
+ "effective_num_docs": 390,
1211
+ "trust_dataset": true,
1212
+ "must_remove_duplicate_docs": null,
1213
+ "version": 0
1214
+ },
1215
+ "leaderboard|mmlu:high_school_mathematics": {
1216
+ "name": "mmlu:high_school_mathematics",
1217
+ "prompt_function": "mmlu_harness",
1218
+ "hf_repo": "lighteval/mmlu",
1219
+ "hf_subset": "high_school_mathematics",
1220
+ "metric": [
1221
+ "loglikelihood_acc"
1222
+ ],
1223
+ "hf_avail_splits": [
1224
+ "auxiliary_train",
1225
+ "test",
1226
+ "validation",
1227
+ "dev"
1228
+ ],
1229
+ "evaluation_splits": [
1230
+ "test"
1231
+ ],
1232
+ "few_shots_split": "dev",
1233
+ "few_shots_select": "sequential",
1234
+ "generation_size": 1,
1235
+ "stop_sequence": [
1236
+ "\n"
1237
+ ],
1238
+ "output_regex": null,
1239
+ "num_samples": null,
1240
+ "frozen": false,
1241
+ "suite": [
1242
+ "leaderboard",
1243
+ "mmlu"
1244
+ ],
1245
+ "original_num_docs": 270,
1246
+ "effective_num_docs": 270,
1247
+ "trust_dataset": true,
1248
+ "must_remove_duplicate_docs": null,
1249
+ "version": 0
1250
+ },
1251
+ "leaderboard|mmlu:high_school_microeconomics": {
1252
+ "name": "mmlu:high_school_microeconomics",
1253
+ "prompt_function": "mmlu_harness",
1254
+ "hf_repo": "lighteval/mmlu",
1255
+ "hf_subset": "high_school_microeconomics",
1256
+ "metric": [
1257
+ "loglikelihood_acc"
1258
+ ],
1259
+ "hf_avail_splits": [
1260
+ "auxiliary_train",
1261
+ "test",
1262
+ "validation",
1263
+ "dev"
1264
+ ],
1265
+ "evaluation_splits": [
1266
+ "test"
1267
+ ],
1268
+ "few_shots_split": "dev",
1269
+ "few_shots_select": "sequential",
1270
+ "generation_size": 1,
1271
+ "stop_sequence": [
1272
+ "\n"
1273
+ ],
1274
+ "output_regex": null,
1275
+ "num_samples": null,
1276
+ "frozen": false,
1277
+ "suite": [
1278
+ "leaderboard",
1279
+ "mmlu"
1280
+ ],
1281
+ "original_num_docs": 238,
1282
+ "effective_num_docs": 238,
1283
+ "trust_dataset": true,
1284
+ "must_remove_duplicate_docs": null,
1285
+ "version": 0
1286
+ },
1287
+ "leaderboard|mmlu:high_school_physics": {
1288
+ "name": "mmlu:high_school_physics",
1289
+ "prompt_function": "mmlu_harness",
1290
+ "hf_repo": "lighteval/mmlu",
1291
+ "hf_subset": "high_school_physics",
1292
+ "metric": [
1293
+ "loglikelihood_acc"
1294
+ ],
1295
+ "hf_avail_splits": [
1296
+ "auxiliary_train",
1297
+ "test",
1298
+ "validation",
1299
+ "dev"
1300
+ ],
1301
+ "evaluation_splits": [
1302
+ "test"
1303
+ ],
1304
+ "few_shots_split": "dev",
1305
+ "few_shots_select": "sequential",
1306
+ "generation_size": 1,
1307
+ "stop_sequence": [
1308
+ "\n"
1309
+ ],
1310
+ "output_regex": null,
1311
+ "num_samples": null,
1312
+ "frozen": false,
1313
+ "suite": [
1314
+ "leaderboard",
1315
+ "mmlu"
1316
+ ],
1317
+ "original_num_docs": 151,
1318
+ "effective_num_docs": 151,
1319
+ "trust_dataset": true,
1320
+ "must_remove_duplicate_docs": null,
1321
+ "version": 0
1322
+ },
1323
+ "leaderboard|mmlu:high_school_psychology": {
1324
+ "name": "mmlu:high_school_psychology",
1325
+ "prompt_function": "mmlu_harness",
1326
+ "hf_repo": "lighteval/mmlu",
1327
+ "hf_subset": "high_school_psychology",
1328
+ "metric": [
1329
+ "loglikelihood_acc"
1330
+ ],
1331
+ "hf_avail_splits": [
1332
+ "auxiliary_train",
1333
+ "test",
1334
+ "validation",
1335
+ "dev"
1336
+ ],
1337
+ "evaluation_splits": [
1338
+ "test"
1339
+ ],
1340
+ "few_shots_split": "dev",
1341
+ "few_shots_select": "sequential",
1342
+ "generation_size": 1,
1343
+ "stop_sequence": [
1344
+ "\n"
1345
+ ],
1346
+ "output_regex": null,
1347
+ "num_samples": null,
1348
+ "frozen": false,
1349
+ "suite": [
1350
+ "leaderboard",
1351
+ "mmlu"
1352
+ ],
1353
+ "original_num_docs": 545,
1354
+ "effective_num_docs": 545,
1355
+ "trust_dataset": true,
1356
+ "must_remove_duplicate_docs": null,
1357
+ "version": 0
1358
+ },
1359
+ "leaderboard|mmlu:high_school_statistics": {
1360
+ "name": "mmlu:high_school_statistics",
1361
+ "prompt_function": "mmlu_harness",
1362
+ "hf_repo": "lighteval/mmlu",
1363
+ "hf_subset": "high_school_statistics",
1364
+ "metric": [
1365
+ "loglikelihood_acc"
1366
+ ],
1367
+ "hf_avail_splits": [
1368
+ "auxiliary_train",
1369
+ "test",
1370
+ "validation",
1371
+ "dev"
1372
+ ],
1373
+ "evaluation_splits": [
1374
+ "test"
1375
+ ],
1376
+ "few_shots_split": "dev",
1377
+ "few_shots_select": "sequential",
1378
+ "generation_size": 1,
1379
+ "stop_sequence": [
1380
+ "\n"
1381
+ ],
1382
+ "output_regex": null,
1383
+ "num_samples": null,
1384
+ "frozen": false,
1385
+ "suite": [
1386
+ "leaderboard",
1387
+ "mmlu"
1388
+ ],
1389
+ "original_num_docs": 216,
1390
+ "effective_num_docs": 216,
1391
+ "trust_dataset": true,
1392
+ "must_remove_duplicate_docs": null,
1393
+ "version": 0
1394
+ },
1395
+ "leaderboard|mmlu:high_school_us_history": {
1396
+ "name": "mmlu:high_school_us_history",
1397
+ "prompt_function": "mmlu_harness",
1398
+ "hf_repo": "lighteval/mmlu",
1399
+ "hf_subset": "high_school_us_history",
1400
+ "metric": [
1401
+ "loglikelihood_acc"
1402
+ ],
1403
+ "hf_avail_splits": [
1404
+ "auxiliary_train",
1405
+ "test",
1406
+ "validation",
1407
+ "dev"
1408
+ ],
1409
+ "evaluation_splits": [
1410
+ "test"
1411
+ ],
1412
+ "few_shots_split": "dev",
1413
+ "few_shots_select": "sequential",
1414
+ "generation_size": 1,
1415
+ "stop_sequence": [
1416
+ "\n"
1417
+ ],
1418
+ "output_regex": null,
1419
+ "num_samples": null,
1420
+ "frozen": false,
1421
+ "suite": [
1422
+ "leaderboard",
1423
+ "mmlu"
1424
+ ],
1425
+ "original_num_docs": 204,
1426
+ "effective_num_docs": 204,
1427
+ "trust_dataset": true,
1428
+ "must_remove_duplicate_docs": null,
1429
+ "version": 0
1430
+ },
1431
+ "leaderboard|mmlu:high_school_world_history": {
1432
+ "name": "mmlu:high_school_world_history",
1433
+ "prompt_function": "mmlu_harness",
1434
+ "hf_repo": "lighteval/mmlu",
1435
+ "hf_subset": "high_school_world_history",
1436
+ "metric": [
1437
+ "loglikelihood_acc"
1438
+ ],
1439
+ "hf_avail_splits": [
1440
+ "auxiliary_train",
1441
+ "test",
1442
+ "validation",
1443
+ "dev"
1444
+ ],
1445
+ "evaluation_splits": [
1446
+ "test"
1447
+ ],
1448
+ "few_shots_split": "dev",
1449
+ "few_shots_select": "sequential",
1450
+ "generation_size": 1,
1451
+ "stop_sequence": [
1452
+ "\n"
1453
+ ],
1454
+ "output_regex": null,
1455
+ "num_samples": null,
1456
+ "frozen": false,
1457
+ "suite": [
1458
+ "leaderboard",
1459
+ "mmlu"
1460
+ ],
1461
+ "original_num_docs": 237,
1462
+ "effective_num_docs": 237,
1463
+ "trust_dataset": true,
1464
+ "must_remove_duplicate_docs": null,
1465
+ "version": 0
1466
+ },
1467
+ "leaderboard|mmlu:human_aging": {
1468
+ "name": "mmlu:human_aging",
1469
+ "prompt_function": "mmlu_harness",
1470
+ "hf_repo": "lighteval/mmlu",
1471
+ "hf_subset": "human_aging",
1472
+ "metric": [
1473
+ "loglikelihood_acc"
1474
+ ],
1475
+ "hf_avail_splits": [
1476
+ "auxiliary_train",
1477
+ "test",
1478
+ "validation",
1479
+ "dev"
1480
+ ],
1481
+ "evaluation_splits": [
1482
+ "test"
1483
+ ],
1484
+ "few_shots_split": "dev",
1485
+ "few_shots_select": "sequential",
1486
+ "generation_size": 1,
1487
+ "stop_sequence": [
1488
+ "\n"
1489
+ ],
1490
+ "output_regex": null,
1491
+ "num_samples": null,
1492
+ "frozen": false,
1493
+ "suite": [
1494
+ "leaderboard",
1495
+ "mmlu"
1496
+ ],
1497
+ "original_num_docs": 223,
1498
+ "effective_num_docs": 223,
1499
+ "trust_dataset": true,
1500
+ "must_remove_duplicate_docs": null,
1501
+ "version": 0
1502
+ },
1503
+ "leaderboard|mmlu:human_sexuality": {
1504
+ "name": "mmlu:human_sexuality",
1505
+ "prompt_function": "mmlu_harness",
1506
+ "hf_repo": "lighteval/mmlu",
1507
+ "hf_subset": "human_sexuality",
1508
+ "metric": [
1509
+ "loglikelihood_acc"
1510
+ ],
1511
+ "hf_avail_splits": [
1512
+ "auxiliary_train",
1513
+ "test",
1514
+ "validation",
1515
+ "dev"
1516
+ ],
1517
+ "evaluation_splits": [
1518
+ "test"
1519
+ ],
1520
+ "few_shots_split": "dev",
1521
+ "few_shots_select": "sequential",
1522
+ "generation_size": 1,
1523
+ "stop_sequence": [
1524
+ "\n"
1525
+ ],
1526
+ "output_regex": null,
1527
+ "num_samples": null,
1528
+ "frozen": false,
1529
+ "suite": [
1530
+ "leaderboard",
1531
+ "mmlu"
1532
+ ],
1533
+ "original_num_docs": 131,
1534
+ "effective_num_docs": 131,
1535
+ "trust_dataset": true,
1536
+ "must_remove_duplicate_docs": null,
1537
+ "version": 0
1538
+ },
1539
+ "leaderboard|mmlu:international_law": {
1540
+ "name": "mmlu:international_law",
1541
+ "prompt_function": "mmlu_harness",
1542
+ "hf_repo": "lighteval/mmlu",
1543
+ "hf_subset": "international_law",
1544
+ "metric": [
1545
+ "loglikelihood_acc"
1546
+ ],
1547
+ "hf_avail_splits": [
1548
+ "auxiliary_train",
1549
+ "test",
1550
+ "validation",
1551
+ "dev"
1552
+ ],
1553
+ "evaluation_splits": [
1554
+ "test"
1555
+ ],
1556
+ "few_shots_split": "dev",
1557
+ "few_shots_select": "sequential",
1558
+ "generation_size": 1,
1559
+ "stop_sequence": [
1560
+ "\n"
1561
+ ],
1562
+ "output_regex": null,
1563
+ "num_samples": null,
1564
+ "frozen": false,
1565
+ "suite": [
1566
+ "leaderboard",
1567
+ "mmlu"
1568
+ ],
1569
+ "original_num_docs": 121,
1570
+ "effective_num_docs": 121,
1571
+ "trust_dataset": true,
1572
+ "must_remove_duplicate_docs": null,
1573
+ "version": 0
1574
+ },
1575
+ "leaderboard|mmlu:jurisprudence": {
1576
+ "name": "mmlu:jurisprudence",
1577
+ "prompt_function": "mmlu_harness",
1578
+ "hf_repo": "lighteval/mmlu",
1579
+ "hf_subset": "jurisprudence",
1580
+ "metric": [
1581
+ "loglikelihood_acc"
1582
+ ],
1583
+ "hf_avail_splits": [
1584
+ "auxiliary_train",
1585
+ "test",
1586
+ "validation",
1587
+ "dev"
1588
+ ],
1589
+ "evaluation_splits": [
1590
+ "test"
1591
+ ],
1592
+ "few_shots_split": "dev",
1593
+ "few_shots_select": "sequential",
1594
+ "generation_size": 1,
1595
+ "stop_sequence": [
1596
+ "\n"
1597
+ ],
1598
+ "output_regex": null,
1599
+ "num_samples": null,
1600
+ "frozen": false,
1601
+ "suite": [
1602
+ "leaderboard",
1603
+ "mmlu"
1604
+ ],
1605
+ "original_num_docs": 108,
1606
+ "effective_num_docs": 108,
1607
+ "trust_dataset": true,
1608
+ "must_remove_duplicate_docs": null,
1609
+ "version": 0
1610
+ },
1611
+ "leaderboard|mmlu:logical_fallacies": {
1612
+ "name": "mmlu:logical_fallacies",
1613
+ "prompt_function": "mmlu_harness",
1614
+ "hf_repo": "lighteval/mmlu",
1615
+ "hf_subset": "logical_fallacies",
1616
+ "metric": [
1617
+ "loglikelihood_acc"
1618
+ ],
1619
+ "hf_avail_splits": [
1620
+ "auxiliary_train",
1621
+ "test",
1622
+ "validation",
1623
+ "dev"
1624
+ ],
1625
+ "evaluation_splits": [
1626
+ "test"
1627
+ ],
1628
+ "few_shots_split": "dev",
1629
+ "few_shots_select": "sequential",
1630
+ "generation_size": 1,
1631
+ "stop_sequence": [
1632
+ "\n"
1633
+ ],
1634
+ "output_regex": null,
1635
+ "num_samples": null,
1636
+ "frozen": false,
1637
+ "suite": [
1638
+ "leaderboard",
1639
+ "mmlu"
1640
+ ],
1641
+ "original_num_docs": 163,
1642
+ "effective_num_docs": 163,
1643
+ "trust_dataset": true,
1644
+ "must_remove_duplicate_docs": null,
1645
+ "version": 0
1646
+ },
1647
+ "leaderboard|mmlu:machine_learning": {
1648
+ "name": "mmlu:machine_learning",
1649
+ "prompt_function": "mmlu_harness",
1650
+ "hf_repo": "lighteval/mmlu",
1651
+ "hf_subset": "machine_learning",
1652
+ "metric": [
1653
+ "loglikelihood_acc"
1654
+ ],
1655
+ "hf_avail_splits": [
1656
+ "auxiliary_train",
1657
+ "test",
1658
+ "validation",
1659
+ "dev"
1660
+ ],
1661
+ "evaluation_splits": [
1662
+ "test"
1663
+ ],
1664
+ "few_shots_split": "dev",
1665
+ "few_shots_select": "sequential",
1666
+ "generation_size": 1,
1667
+ "stop_sequence": [
1668
+ "\n"
1669
+ ],
1670
+ "output_regex": null,
1671
+ "num_samples": null,
1672
+ "frozen": false,
1673
+ "suite": [
1674
+ "leaderboard",
1675
+ "mmlu"
1676
+ ],
1677
+ "original_num_docs": 112,
1678
+ "effective_num_docs": 112,
1679
+ "trust_dataset": true,
1680
+ "must_remove_duplicate_docs": null,
1681
+ "version": 0
1682
+ },
1683
+ "leaderboard|mmlu:management": {
1684
+ "name": "mmlu:management",
1685
+ "prompt_function": "mmlu_harness",
1686
+ "hf_repo": "lighteval/mmlu",
1687
+ "hf_subset": "management",
1688
+ "metric": [
1689
+ "loglikelihood_acc"
1690
+ ],
1691
+ "hf_avail_splits": [
1692
+ "auxiliary_train",
1693
+ "test",
1694
+ "validation",
1695
+ "dev"
1696
+ ],
1697
+ "evaluation_splits": [
1698
+ "test"
1699
+ ],
1700
+ "few_shots_split": "dev",
1701
+ "few_shots_select": "sequential",
1702
+ "generation_size": 1,
1703
+ "stop_sequence": [
1704
+ "\n"
1705
+ ],
1706
+ "output_regex": null,
1707
+ "num_samples": null,
1708
+ "frozen": false,
1709
+ "suite": [
1710
+ "leaderboard",
1711
+ "mmlu"
1712
+ ],
1713
+ "original_num_docs": 103,
1714
+ "effective_num_docs": 103,
1715
+ "trust_dataset": true,
1716
+ "must_remove_duplicate_docs": null,
1717
+ "version": 0
1718
+ },
1719
+ "leaderboard|mmlu:marketing": {
1720
+ "name": "mmlu:marketing",
1721
+ "prompt_function": "mmlu_harness",
1722
+ "hf_repo": "lighteval/mmlu",
1723
+ "hf_subset": "marketing",
1724
+ "metric": [
1725
+ "loglikelihood_acc"
1726
+ ],
1727
+ "hf_avail_splits": [
1728
+ "auxiliary_train",
1729
+ "test",
1730
+ "validation",
1731
+ "dev"
1732
+ ],
1733
+ "evaluation_splits": [
1734
+ "test"
1735
+ ],
1736
+ "few_shots_split": "dev",
1737
+ "few_shots_select": "sequential",
1738
+ "generation_size": 1,
1739
+ "stop_sequence": [
1740
+ "\n"
1741
+ ],
1742
+ "output_regex": null,
1743
+ "num_samples": null,
1744
+ "frozen": false,
1745
+ "suite": [
1746
+ "leaderboard",
1747
+ "mmlu"
1748
+ ],
1749
+ "original_num_docs": 234,
1750
+ "effective_num_docs": 234,
1751
+ "trust_dataset": true,
1752
+ "must_remove_duplicate_docs": null,
1753
+ "version": 0
1754
+ },
1755
+ "leaderboard|mmlu:medical_genetics": {
1756
+ "name": "mmlu:medical_genetics",
1757
+ "prompt_function": "mmlu_harness",
1758
+ "hf_repo": "lighteval/mmlu",
1759
+ "hf_subset": "medical_genetics",
1760
+ "metric": [
1761
+ "loglikelihood_acc"
1762
+ ],
1763
+ "hf_avail_splits": [
1764
+ "auxiliary_train",
1765
+ "test",
1766
+ "validation",
1767
+ "dev"
1768
+ ],
1769
+ "evaluation_splits": [
1770
+ "test"
1771
+ ],
1772
+ "few_shots_split": "dev",
1773
+ "few_shots_select": "sequential",
1774
+ "generation_size": 1,
1775
+ "stop_sequence": [
1776
+ "\n"
1777
+ ],
1778
+ "output_regex": null,
1779
+ "num_samples": null,
1780
+ "frozen": false,
1781
+ "suite": [
1782
+ "leaderboard",
1783
+ "mmlu"
1784
+ ],
1785
+ "original_num_docs": 100,
1786
+ "effective_num_docs": 100,
1787
+ "trust_dataset": true,
1788
+ "must_remove_duplicate_docs": null,
1789
+ "version": 0
1790
+ },
1791
+ "leaderboard|mmlu:miscellaneous": {
1792
+ "name": "mmlu:miscellaneous",
1793
+ "prompt_function": "mmlu_harness",
1794
+ "hf_repo": "lighteval/mmlu",
1795
+ "hf_subset": "miscellaneous",
1796
+ "metric": [
1797
+ "loglikelihood_acc"
1798
+ ],
1799
+ "hf_avail_splits": [
1800
+ "auxiliary_train",
1801
+ "test",
1802
+ "validation",
1803
+ "dev"
1804
+ ],
1805
+ "evaluation_splits": [
1806
+ "test"
1807
+ ],
1808
+ "few_shots_split": "dev",
1809
+ "few_shots_select": "sequential",
1810
+ "generation_size": 1,
1811
+ "stop_sequence": [
1812
+ "\n"
1813
+ ],
1814
+ "output_regex": null,
1815
+ "num_samples": null,
1816
+ "frozen": false,
1817
+ "suite": [
1818
+ "leaderboard",
1819
+ "mmlu"
1820
+ ],
1821
+ "original_num_docs": 783,
1822
+ "effective_num_docs": 783,
1823
+ "trust_dataset": true,
1824
+ "must_remove_duplicate_docs": null,
1825
+ "version": 0
1826
+ },
1827
+ "leaderboard|mmlu:moral_disputes": {
1828
+ "name": "mmlu:moral_disputes",
1829
+ "prompt_function": "mmlu_harness",
1830
+ "hf_repo": "lighteval/mmlu",
1831
+ "hf_subset": "moral_disputes",
1832
+ "metric": [
1833
+ "loglikelihood_acc"
1834
+ ],
1835
+ "hf_avail_splits": [
1836
+ "auxiliary_train",
1837
+ "test",
1838
+ "validation",
1839
+ "dev"
1840
+ ],
1841
+ "evaluation_splits": [
1842
+ "test"
1843
+ ],
1844
+ "few_shots_split": "dev",
1845
+ "few_shots_select": "sequential",
1846
+ "generation_size": 1,
1847
+ "stop_sequence": [
1848
+ "\n"
1849
+ ],
1850
+ "output_regex": null,
1851
+ "num_samples": null,
1852
+ "frozen": false,
1853
+ "suite": [
1854
+ "leaderboard",
1855
+ "mmlu"
1856
+ ],
1857
+ "original_num_docs": 346,
1858
+ "effective_num_docs": 346,
1859
+ "trust_dataset": true,
1860
+ "must_remove_duplicate_docs": null,
1861
+ "version": 0
1862
+ },
1863
+ "leaderboard|mmlu:moral_scenarios": {
1864
+ "name": "mmlu:moral_scenarios",
1865
+ "prompt_function": "mmlu_harness",
1866
+ "hf_repo": "lighteval/mmlu",
1867
+ "hf_subset": "moral_scenarios",
1868
+ "metric": [
1869
+ "loglikelihood_acc"
1870
+ ],
1871
+ "hf_avail_splits": [
1872
+ "auxiliary_train",
1873
+ "test",
1874
+ "validation",
1875
+ "dev"
1876
+ ],
1877
+ "evaluation_splits": [
1878
+ "test"
1879
+ ],
1880
+ "few_shots_split": "dev",
1881
+ "few_shots_select": "sequential",
1882
+ "generation_size": 1,
1883
+ "stop_sequence": [
1884
+ "\n"
1885
+ ],
1886
+ "output_regex": null,
1887
+ "num_samples": null,
1888
+ "frozen": false,
1889
+ "suite": [
1890
+ "leaderboard",
1891
+ "mmlu"
1892
+ ],
1893
+ "original_num_docs": 895,
1894
+ "effective_num_docs": 895,
1895
+ "trust_dataset": true,
1896
+ "must_remove_duplicate_docs": null,
1897
+ "version": 0
1898
+ },
1899
+ "leaderboard|mmlu:nutrition": {
1900
+ "name": "mmlu:nutrition",
1901
+ "prompt_function": "mmlu_harness",
1902
+ "hf_repo": "lighteval/mmlu",
1903
+ "hf_subset": "nutrition",
1904
+ "metric": [
1905
+ "loglikelihood_acc"
1906
+ ],
1907
+ "hf_avail_splits": [
1908
+ "auxiliary_train",
1909
+ "test",
1910
+ "validation",
1911
+ "dev"
1912
+ ],
1913
+ "evaluation_splits": [
1914
+ "test"
1915
+ ],
1916
+ "few_shots_split": "dev",
1917
+ "few_shots_select": "sequential",
1918
+ "generation_size": 1,
1919
+ "stop_sequence": [
1920
+ "\n"
1921
+ ],
1922
+ "output_regex": null,
1923
+ "num_samples": null,
1924
+ "frozen": false,
1925
+ "suite": [
1926
+ "leaderboard",
1927
+ "mmlu"
1928
+ ],
1929
+ "original_num_docs": 306,
1930
+ "effective_num_docs": 306,
1931
+ "trust_dataset": true,
1932
+ "must_remove_duplicate_docs": null,
1933
+ "version": 0
1934
+ },
1935
+ "leaderboard|mmlu:philosophy": {
1936
+ "name": "mmlu:philosophy",
1937
+ "prompt_function": "mmlu_harness",
1938
+ "hf_repo": "lighteval/mmlu",
1939
+ "hf_subset": "philosophy",
1940
+ "metric": [
1941
+ "loglikelihood_acc"
1942
+ ],
1943
+ "hf_avail_splits": [
1944
+ "auxiliary_train",
1945
+ "test",
1946
+ "validation",
1947
+ "dev"
1948
+ ],
1949
+ "evaluation_splits": [
1950
+ "test"
1951
+ ],
1952
+ "few_shots_split": "dev",
1953
+ "few_shots_select": "sequential",
1954
+ "generation_size": 1,
1955
+ "stop_sequence": [
1956
+ "\n"
1957
+ ],
1958
+ "output_regex": null,
1959
+ "num_samples": null,
1960
+ "frozen": false,
1961
+ "suite": [
1962
+ "leaderboard",
1963
+ "mmlu"
1964
+ ],
1965
+ "original_num_docs": 311,
1966
+ "effective_num_docs": 311,
1967
+ "trust_dataset": true,
1968
+ "must_remove_duplicate_docs": null,
1969
+ "version": 0
1970
+ },
1971
+ "leaderboard|mmlu:prehistory": {
1972
+ "name": "mmlu:prehistory",
1973
+ "prompt_function": "mmlu_harness",
1974
+ "hf_repo": "lighteval/mmlu",
1975
+ "hf_subset": "prehistory",
1976
+ "metric": [
1977
+ "loglikelihood_acc"
1978
+ ],
1979
+ "hf_avail_splits": [
1980
+ "auxiliary_train",
1981
+ "test",
1982
+ "validation",
1983
+ "dev"
1984
+ ],
1985
+ "evaluation_splits": [
1986
+ "test"
1987
+ ],
1988
+ "few_shots_split": "dev",
1989
+ "few_shots_select": "sequential",
1990
+ "generation_size": 1,
1991
+ "stop_sequence": [
1992
+ "\n"
1993
+ ],
1994
+ "output_regex": null,
1995
+ "num_samples": null,
1996
+ "frozen": false,
1997
+ "suite": [
1998
+ "leaderboard",
1999
+ "mmlu"
2000
+ ],
2001
+ "original_num_docs": 324,
2002
+ "effective_num_docs": 324,
2003
+ "trust_dataset": true,
2004
+ "must_remove_duplicate_docs": null,
2005
+ "version": 0
2006
+ },
2007
+ "leaderboard|mmlu:professional_accounting": {
2008
+ "name": "mmlu:professional_accounting",
2009
+ "prompt_function": "mmlu_harness",
2010
+ "hf_repo": "lighteval/mmlu",
2011
+ "hf_subset": "professional_accounting",
2012
+ "metric": [
2013
+ "loglikelihood_acc"
2014
+ ],
2015
+ "hf_avail_splits": [
2016
+ "auxiliary_train",
2017
+ "test",
2018
+ "validation",
2019
+ "dev"
2020
+ ],
2021
+ "evaluation_splits": [
2022
+ "test"
2023
+ ],
2024
+ "few_shots_split": "dev",
2025
+ "few_shots_select": "sequential",
2026
+ "generation_size": 1,
2027
+ "stop_sequence": [
2028
+ "\n"
2029
+ ],
2030
+ "output_regex": null,
2031
+ "num_samples": null,
2032
+ "frozen": false,
2033
+ "suite": [
2034
+ "leaderboard",
2035
+ "mmlu"
2036
+ ],
2037
+ "original_num_docs": 282,
2038
+ "effective_num_docs": 282,
2039
+ "trust_dataset": true,
2040
+ "must_remove_duplicate_docs": null,
2041
+ "version": 0
2042
+ },
2043
+ "leaderboard|mmlu:professional_law": {
2044
+ "name": "mmlu:professional_law",
2045
+ "prompt_function": "mmlu_harness",
2046
+ "hf_repo": "lighteval/mmlu",
2047
+ "hf_subset": "professional_law",
2048
+ "metric": [
2049
+ "loglikelihood_acc"
2050
+ ],
2051
+ "hf_avail_splits": [
2052
+ "auxiliary_train",
2053
+ "test",
2054
+ "validation",
2055
+ "dev"
2056
+ ],
2057
+ "evaluation_splits": [
2058
+ "test"
2059
+ ],
2060
+ "few_shots_split": "dev",
2061
+ "few_shots_select": "sequential",
2062
+ "generation_size": 1,
2063
+ "stop_sequence": [
2064
+ "\n"
2065
+ ],
2066
+ "output_regex": null,
2067
+ "num_samples": null,
2068
+ "frozen": false,
2069
+ "suite": [
2070
+ "leaderboard",
2071
+ "mmlu"
2072
+ ],
2073
+ "original_num_docs": 1534,
2074
+ "effective_num_docs": 1534,
2075
+ "trust_dataset": true,
2076
+ "must_remove_duplicate_docs": null,
2077
+ "version": 0
2078
+ },
2079
+ "leaderboard|mmlu:professional_medicine": {
2080
+ "name": "mmlu:professional_medicine",
2081
+ "prompt_function": "mmlu_harness",
2082
+ "hf_repo": "lighteval/mmlu",
2083
+ "hf_subset": "professional_medicine",
2084
+ "metric": [
2085
+ "loglikelihood_acc"
2086
+ ],
2087
+ "hf_avail_splits": [
2088
+ "auxiliary_train",
2089
+ "test",
2090
+ "validation",
2091
+ "dev"
2092
+ ],
2093
+ "evaluation_splits": [
2094
+ "test"
2095
+ ],
2096
+ "few_shots_split": "dev",
2097
+ "few_shots_select": "sequential",
2098
+ "generation_size": 1,
2099
+ "stop_sequence": [
2100
+ "\n"
2101
+ ],
2102
+ "output_regex": null,
2103
+ "num_samples": null,
2104
+ "frozen": false,
2105
+ "suite": [
2106
+ "leaderboard",
2107
+ "mmlu"
2108
+ ],
2109
+ "original_num_docs": 272,
2110
+ "effective_num_docs": 272,
2111
+ "trust_dataset": true,
2112
+ "must_remove_duplicate_docs": null,
2113
+ "version": 0
2114
+ },
2115
+ "leaderboard|mmlu:professional_psychology": {
2116
+ "name": "mmlu:professional_psychology",
2117
+ "prompt_function": "mmlu_harness",
2118
+ "hf_repo": "lighteval/mmlu",
2119
+ "hf_subset": "professional_psychology",
2120
+ "metric": [
2121
+ "loglikelihood_acc"
2122
+ ],
2123
+ "hf_avail_splits": [
2124
+ "auxiliary_train",
2125
+ "test",
2126
+ "validation",
2127
+ "dev"
2128
+ ],
2129
+ "evaluation_splits": [
2130
+ "test"
2131
+ ],
2132
+ "few_shots_split": "dev",
2133
+ "few_shots_select": "sequential",
2134
+ "generation_size": 1,
2135
+ "stop_sequence": [
2136
+ "\n"
2137
+ ],
2138
+ "output_regex": null,
2139
+ "num_samples": null,
2140
+ "frozen": false,
2141
+ "suite": [
2142
+ "leaderboard",
2143
+ "mmlu"
2144
+ ],
2145
+ "original_num_docs": 612,
2146
+ "effective_num_docs": 612,
2147
+ "trust_dataset": true,
2148
+ "must_remove_duplicate_docs": null,
2149
+ "version": 0
2150
+ },
2151
+ "leaderboard|mmlu:public_relations": {
2152
+ "name": "mmlu:public_relations",
2153
+ "prompt_function": "mmlu_harness",
2154
+ "hf_repo": "lighteval/mmlu",
2155
+ "hf_subset": "public_relations",
2156
+ "metric": [
2157
+ "loglikelihood_acc"
2158
+ ],
2159
+ "hf_avail_splits": [
2160
+ "auxiliary_train",
2161
+ "test",
2162
+ "validation",
2163
+ "dev"
2164
+ ],
2165
+ "evaluation_splits": [
2166
+ "test"
2167
+ ],
2168
+ "few_shots_split": "dev",
2169
+ "few_shots_select": "sequential",
2170
+ "generation_size": 1,
2171
+ "stop_sequence": [
2172
+ "\n"
2173
+ ],
2174
+ "output_regex": null,
2175
+ "num_samples": null,
2176
+ "frozen": false,
2177
+ "suite": [
2178
+ "leaderboard",
2179
+ "mmlu"
2180
+ ],
2181
+ "original_num_docs": 110,
2182
+ "effective_num_docs": 110,
2183
+ "trust_dataset": true,
2184
+ "must_remove_duplicate_docs": null,
2185
+ "version": 0
2186
+ },
2187
+ "leaderboard|mmlu:security_studies": {
2188
+ "name": "mmlu:security_studies",
2189
+ "prompt_function": "mmlu_harness",
2190
+ "hf_repo": "lighteval/mmlu",
2191
+ "hf_subset": "security_studies",
2192
+ "metric": [
2193
+ "loglikelihood_acc"
2194
+ ],
2195
+ "hf_avail_splits": [
2196
+ "auxiliary_train",
2197
+ "test",
2198
+ "validation",
2199
+ "dev"
2200
+ ],
2201
+ "evaluation_splits": [
2202
+ "test"
2203
+ ],
2204
+ "few_shots_split": "dev",
2205
+ "few_shots_select": "sequential",
2206
+ "generation_size": 1,
2207
+ "stop_sequence": [
2208
+ "\n"
2209
+ ],
2210
+ "output_regex": null,
2211
+ "num_samples": null,
2212
+ "frozen": false,
2213
+ "suite": [
2214
+ "leaderboard",
2215
+ "mmlu"
2216
+ ],
2217
+ "original_num_docs": 245,
2218
+ "effective_num_docs": 245,
2219
+ "trust_dataset": true,
2220
+ "must_remove_duplicate_docs": null,
2221
+ "version": 0
2222
+ },
2223
+ "leaderboard|mmlu:sociology": {
2224
+ "name": "mmlu:sociology",
2225
+ "prompt_function": "mmlu_harness",
2226
+ "hf_repo": "lighteval/mmlu",
2227
+ "hf_subset": "sociology",
2228
+ "metric": [
2229
+ "loglikelihood_acc"
2230
+ ],
2231
+ "hf_avail_splits": [
2232
+ "auxiliary_train",
2233
+ "test",
2234
+ "validation",
2235
+ "dev"
2236
+ ],
2237
+ "evaluation_splits": [
2238
+ "test"
2239
+ ],
2240
+ "few_shots_split": "dev",
2241
+ "few_shots_select": "sequential",
2242
+ "generation_size": 1,
2243
+ "stop_sequence": [
2244
+ "\n"
2245
+ ],
2246
+ "output_regex": null,
2247
+ "num_samples": null,
2248
+ "frozen": false,
2249
+ "suite": [
2250
+ "leaderboard",
2251
+ "mmlu"
2252
+ ],
2253
+ "original_num_docs": 201,
2254
+ "effective_num_docs": 201,
2255
+ "trust_dataset": true,
2256
+ "must_remove_duplicate_docs": null,
2257
+ "version": 0
2258
+ },
2259
+ "leaderboard|mmlu:us_foreign_policy": {
2260
+ "name": "mmlu:us_foreign_policy",
2261
+ "prompt_function": "mmlu_harness",
2262
+ "hf_repo": "lighteval/mmlu",
2263
+ "hf_subset": "us_foreign_policy",
2264
+ "metric": [
2265
+ "loglikelihood_acc"
2266
+ ],
2267
+ "hf_avail_splits": [
2268
+ "auxiliary_train",
2269
+ "test",
2270
+ "validation",
2271
+ "dev"
2272
+ ],
2273
+ "evaluation_splits": [
2274
+ "test"
2275
+ ],
2276
+ "few_shots_split": "dev",
2277
+ "few_shots_select": "sequential",
2278
+ "generation_size": 1,
2279
+ "stop_sequence": [
2280
+ "\n"
2281
+ ],
2282
+ "output_regex": null,
2283
+ "num_samples": null,
2284
+ "frozen": false,
2285
+ "suite": [
2286
+ "leaderboard",
2287
+ "mmlu"
2288
+ ],
2289
+ "original_num_docs": 100,
2290
+ "effective_num_docs": 100,
2291
+ "trust_dataset": true,
2292
+ "must_remove_duplicate_docs": null,
2293
+ "version": 0
2294
+ },
2295
+ "leaderboard|mmlu:virology": {
2296
+ "name": "mmlu:virology",
2297
+ "prompt_function": "mmlu_harness",
2298
+ "hf_repo": "lighteval/mmlu",
2299
+ "hf_subset": "virology",
2300
+ "metric": [
2301
+ "loglikelihood_acc"
2302
+ ],
2303
+ "hf_avail_splits": [
2304
+ "auxiliary_train",
2305
+ "test",
2306
+ "validation",
2307
+ "dev"
2308
+ ],
2309
+ "evaluation_splits": [
2310
+ "test"
2311
+ ],
2312
+ "few_shots_split": "dev",
2313
+ "few_shots_select": "sequential",
2314
+ "generation_size": 1,
2315
+ "stop_sequence": [
2316
+ "\n"
2317
+ ],
2318
+ "output_regex": null,
2319
+ "num_samples": null,
2320
+ "frozen": false,
2321
+ "suite": [
2322
+ "leaderboard",
2323
+ "mmlu"
2324
+ ],
2325
+ "original_num_docs": 166,
2326
+ "effective_num_docs": 166,
2327
+ "trust_dataset": true,
2328
+ "must_remove_duplicate_docs": null,
2329
+ "version": 0
2330
+ },
2331
+ "leaderboard|mmlu:world_religions": {
2332
+ "name": "mmlu:world_religions",
2333
+ "prompt_function": "mmlu_harness",
2334
+ "hf_repo": "lighteval/mmlu",
2335
+ "hf_subset": "world_religions",
2336
+ "metric": [
2337
+ "loglikelihood_acc"
2338
+ ],
2339
+ "hf_avail_splits": [
2340
+ "auxiliary_train",
2341
+ "test",
2342
+ "validation",
2343
+ "dev"
2344
+ ],
2345
+ "evaluation_splits": [
2346
+ "test"
2347
+ ],
2348
+ "few_shots_split": "dev",
2349
+ "few_shots_select": "sequential",
2350
+ "generation_size": 1,
2351
+ "stop_sequence": [
2352
+ "\n"
2353
+ ],
2354
+ "output_regex": null,
2355
+ "num_samples": null,
2356
+ "frozen": false,
2357
+ "suite": [
2358
+ "leaderboard",
2359
+ "mmlu"
2360
+ ],
2361
+ "original_num_docs": 171,
2362
+ "effective_num_docs": 171,
2363
+ "trust_dataset": true,
2364
+ "must_remove_duplicate_docs": null,
2365
+ "version": 0
2366
+ }
2367
+ },
2368
+ "summary_tasks": {
2369
+ "leaderboard|mmlu:abstract_algebra|5": {
2370
+ "hashes": {
2371
+ "hash_examples": "4c76229e00c9c0e9",
2372
+ "hash_full_prompts": "be1d976e6d5c128f",
2373
+ "hash_input_tokens": "cb6ea25e3d135f17",
2374
+ "hash_cont_tokens": "b1e74e2fab182909"
2375
+ },
2376
+ "truncated": 0,
2377
+ "non_truncated": 100,
2378
+ "padded": 400,
2379
+ "non_padded": 0,
2380
+ "effective_few_shots": 5.0,
2381
+ "num_truncated_few_shots": 0
2382
+ },
2383
+ "leaderboard|mmlu:anatomy|5": {
2384
+ "hashes": {
2385
+ "hash_examples": "6a1f8104dccbd33b",
2386
+ "hash_full_prompts": "1c7da9587c3d0a6f",
2387
+ "hash_input_tokens": "07b3b55a0c130436",
2388
+ "hash_cont_tokens": "a14b5b1906dc16a3"
2389
+ },
2390
+ "truncated": 0,
2391
+ "non_truncated": 135,
2392
+ "padded": 540,
2393
+ "non_padded": 0,
2394
+ "effective_few_shots": 5.0,
2395
+ "num_truncated_few_shots": 0
2396
+ },
2397
+ "leaderboard|mmlu:astronomy|5": {
2398
+ "hashes": {
2399
+ "hash_examples": "1302effa3a76ce4c",
2400
+ "hash_full_prompts": "a0ae0020f158d231",
2401
+ "hash_input_tokens": "7862c0d6ce06aa8d",
2402
+ "hash_cont_tokens": "235273fd0bc50bcd"
2403
+ },
2404
+ "truncated": 0,
2405
+ "non_truncated": 152,
2406
+ "padded": 608,
2407
+ "non_padded": 0,
2408
+ "effective_few_shots": 5.0,
2409
+ "num_truncated_few_shots": 0
2410
+ },
2411
+ "leaderboard|mmlu:business_ethics|5": {
2412
+ "hashes": {
2413
+ "hash_examples": "03cb8bce5336419a",
2414
+ "hash_full_prompts": "8ec175d3cf10fa5f",
2415
+ "hash_input_tokens": "3aaf5a6cbf0d70fe",
2416
+ "hash_cont_tokens": "b1e74e2fab182909"
2417
+ },
2418
+ "truncated": 0,
2419
+ "non_truncated": 100,
2420
+ "padded": 400,
2421
+ "non_padded": 0,
2422
+ "effective_few_shots": 5.0,
2423
+ "num_truncated_few_shots": 0
2424
+ },
2425
+ "leaderboard|mmlu:clinical_knowledge|5": {
2426
+ "hashes": {
2427
+ "hash_examples": "ffbb9c7b2be257f9",
2428
+ "hash_full_prompts": "065ecbcacc97fe38",
2429
+ "hash_input_tokens": "4ffa980fa5170868",
2430
+ "hash_cont_tokens": "c27aff2906fc75aa"
2431
+ },
2432
+ "truncated": 0,
2433
+ "non_truncated": 265,
2434
+ "padded": 1060,
2435
+ "non_padded": 0,
2436
+ "effective_few_shots": 5.0,
2437
+ "num_truncated_few_shots": 0
2438
+ },
2439
+ "leaderboard|mmlu:college_biology|5": {
2440
+ "hashes": {
2441
+ "hash_examples": "3ee77f176f38eb8e",
2442
+ "hash_full_prompts": "b0b5a4b2c59b2e12",
2443
+ "hash_input_tokens": "d36c7e22c178faa8",
2444
+ "hash_cont_tokens": "28f68b5aab4efb1c"
2445
+ },
2446
+ "truncated": 0,
2447
+ "non_truncated": 144,
2448
+ "padded": 576,
2449
+ "non_padded": 0,
2450
+ "effective_few_shots": 5.0,
2451
+ "num_truncated_few_shots": 0
2452
+ },
2453
+ "leaderboard|mmlu:college_chemistry|5": {
2454
+ "hashes": {
2455
+ "hash_examples": "ce61a69c46d47aeb",
2456
+ "hash_full_prompts": "f8f19b11abdde15b",
2457
+ "hash_input_tokens": "efa0a4e595ae9a69",
2458
+ "hash_cont_tokens": "b1e74e2fab182909"
2459
+ },
2460
+ "truncated": 0,
2461
+ "non_truncated": 100,
2462
+ "padded": 400,
2463
+ "non_padded": 0,
2464
+ "effective_few_shots": 5.0,
2465
+ "num_truncated_few_shots": 0
2466
+ },
2467
+ "leaderboard|mmlu:college_computer_science|5": {
2468
+ "hashes": {
2469
+ "hash_examples": "32805b52d7d5daab",
2470
+ "hash_full_prompts": "1e62a12c19661965",
2471
+ "hash_input_tokens": "3ef2e80eae923d3a",
2472
+ "hash_cont_tokens": "b1e74e2fab182909"
2473
+ },
2474
+ "truncated": 0,
2475
+ "non_truncated": 100,
2476
+ "padded": 400,
2477
+ "non_padded": 0,
2478
+ "effective_few_shots": 5.0,
2479
+ "num_truncated_few_shots": 0
2480
+ },
2481
+ "leaderboard|mmlu:college_mathematics|5": {
2482
+ "hashes": {
2483
+ "hash_examples": "55da1a0a0bd33722",
2484
+ "hash_full_prompts": "05042157014b04db",
2485
+ "hash_input_tokens": "dfe5df2213f06caa",
2486
+ "hash_cont_tokens": "b1e74e2fab182909"
2487
+ },
2488
+ "truncated": 0,
2489
+ "non_truncated": 100,
2490
+ "padded": 400,
2491
+ "non_padded": 0,
2492
+ "effective_few_shots": 5.0,
2493
+ "num_truncated_few_shots": 0
2494
+ },
2495
+ "leaderboard|mmlu:college_medicine|5": {
2496
+ "hashes": {
2497
+ "hash_examples": "c33e143163049176",
2498
+ "hash_full_prompts": "f2b0d469ebd172ed",
2499
+ "hash_input_tokens": "713d394fbcdb0b68",
2500
+ "hash_cont_tokens": "a7bc5e74098b6e5f"
2501
+ },
2502
+ "truncated": 0,
2503
+ "non_truncated": 173,
2504
+ "padded": 692,
2505
+ "non_padded": 0,
2506
+ "effective_few_shots": 5.0,
2507
+ "num_truncated_few_shots": 0
2508
+ },
2509
+ "leaderboard|mmlu:college_physics|5": {
2510
+ "hashes": {
2511
+ "hash_examples": "ebdab1cdb7e555df",
2512
+ "hash_full_prompts": "d944c0d7596f962b",
2513
+ "hash_input_tokens": "ec137e8bf3f41722",
2514
+ "hash_cont_tokens": "e50fa3937d31d8fb"
2515
+ },
2516
+ "truncated": 0,
2517
+ "non_truncated": 102,
2518
+ "padded": 408,
2519
+ "non_padded": 0,
2520
+ "effective_few_shots": 5.0,
2521
+ "num_truncated_few_shots": 0
2522
+ },
2523
+ "leaderboard|mmlu:computer_security|5": {
2524
+ "hashes": {
2525
+ "hash_examples": "a24fd7d08a560921",
2526
+ "hash_full_prompts": "ebf096907d9a05a3",
2527
+ "hash_input_tokens": "7263c5f54df20bd4",
2528
+ "hash_cont_tokens": "b1e74e2fab182909"
2529
+ },
2530
+ "truncated": 0,
2531
+ "non_truncated": 100,
2532
+ "padded": 400,
2533
+ "non_padded": 0,
2534
+ "effective_few_shots": 5.0,
2535
+ "num_truncated_few_shots": 0
2536
+ },
2537
+ "leaderboard|mmlu:conceptual_physics|5": {
2538
+ "hashes": {
2539
+ "hash_examples": "8300977a79386993",
2540
+ "hash_full_prompts": "ca95b98e981be9a3",
2541
+ "hash_input_tokens": "9f25c452193c6f6d",
2542
+ "hash_cont_tokens": "a9551e5af217ca25"
2543
+ },
2544
+ "truncated": 0,
2545
+ "non_truncated": 235,
2546
+ "padded": 940,
2547
+ "non_padded": 0,
2548
+ "effective_few_shots": 5.0,
2549
+ "num_truncated_few_shots": 0
2550
+ },
2551
+ "leaderboard|mmlu:econometrics|5": {
2552
+ "hashes": {
2553
+ "hash_examples": "ddde36788a04a46f",
2554
+ "hash_full_prompts": "a39ce1f0f63cc17e",
2555
+ "hash_input_tokens": "b7c509ab4a371966",
2556
+ "hash_cont_tokens": "1616cbbcc0299188"
2557
+ },
2558
+ "truncated": 0,
2559
+ "non_truncated": 114,
2560
+ "padded": 456,
2561
+ "non_padded": 0,
2562
+ "effective_few_shots": 5.0,
2563
+ "num_truncated_few_shots": 0
2564
+ },
2565
+ "leaderboard|mmlu:electrical_engineering|5": {
2566
+ "hashes": {
2567
+ "hash_examples": "acbc5def98c19b3f",
2568
+ "hash_full_prompts": "20affdb7036531a8",
2569
+ "hash_input_tokens": "de47659889826285",
2570
+ "hash_cont_tokens": "13d52dc7c10431df"
2571
+ },
2572
+ "truncated": 0,
2573
+ "non_truncated": 145,
2574
+ "padded": 580,
2575
+ "non_padded": 0,
2576
+ "effective_few_shots": 5.0,
2577
+ "num_truncated_few_shots": 0
2578
+ },
2579
+ "leaderboard|mmlu:elementary_mathematics|5": {
2580
+ "hashes": {
2581
+ "hash_examples": "146e61d07497a9bd",
2582
+ "hash_full_prompts": "af38febe4083c2e1",
2583
+ "hash_input_tokens": "f1f35a825740953c",
2584
+ "hash_cont_tokens": "f7e8022519425282"
2585
+ },
2586
+ "truncated": 0,
2587
+ "non_truncated": 378,
2588
+ "padded": 1506,
2589
+ "non_padded": 6,
2590
+ "effective_few_shots": 5.0,
2591
+ "num_truncated_few_shots": 0
2592
+ },
2593
+ "leaderboard|mmlu:formal_logic|5": {
2594
+ "hashes": {
2595
+ "hash_examples": "8635216e1909a03f",
2596
+ "hash_full_prompts": "34180417d28a3f13",
2597
+ "hash_input_tokens": "f39f72b5553fe014",
2598
+ "hash_cont_tokens": "bec51e4e496b5986"
2599
+ },
2600
+ "truncated": 0,
2601
+ "non_truncated": 126,
2602
+ "padded": 504,
2603
+ "non_padded": 0,
2604
+ "effective_few_shots": 5.0,
2605
+ "num_truncated_few_shots": 0
2606
+ },
2607
+ "leaderboard|mmlu:global_facts|5": {
2608
+ "hashes": {
2609
+ "hash_examples": "30b315aa6353ee47",
2610
+ "hash_full_prompts": "52019a6beb4a621d",
2611
+ "hash_input_tokens": "238d07cd64f33c25",
2612
+ "hash_cont_tokens": "b1e74e2fab182909"
2613
+ },
2614
+ "truncated": 0,
2615
+ "non_truncated": 100,
2616
+ "padded": 400,
2617
+ "non_padded": 0,
2618
+ "effective_few_shots": 5.0,
2619
+ "num_truncated_few_shots": 0
2620
+ },
2621
+ "leaderboard|mmlu:high_school_biology|5": {
2622
+ "hashes": {
2623
+ "hash_examples": "c9136373af2180de",
2624
+ "hash_full_prompts": "837e077bea943513",
2625
+ "hash_input_tokens": "a81a6d1f49555626",
2626
+ "hash_cont_tokens": "7c5f05353074320e"
2627
+ },
2628
+ "truncated": 0,
2629
+ "non_truncated": 310,
2630
+ "padded": 1236,
2631
+ "non_padded": 4,
2632
+ "effective_few_shots": 5.0,
2633
+ "num_truncated_few_shots": 0
2634
+ },
2635
+ "leaderboard|mmlu:high_school_chemistry|5": {
2636
+ "hashes": {
2637
+ "hash_examples": "b0661bfa1add6404",
2638
+ "hash_full_prompts": "c2440245d3bbb010",
2639
+ "hash_input_tokens": "0af96600f4924e59",
2640
+ "hash_cont_tokens": "a062b42dc4e451a1"
2641
+ },
2642
+ "truncated": 0,
2643
+ "non_truncated": 203,
2644
+ "padded": 808,
2645
+ "non_padded": 4,
2646
+ "effective_few_shots": 5.0,
2647
+ "num_truncated_few_shots": 0
2648
+ },
2649
+ "leaderboard|mmlu:high_school_computer_science|5": {
2650
+ "hashes": {
2651
+ "hash_examples": "80fc1d623a3d665f",
2652
+ "hash_full_prompts": "c967b6d80154bace",
2653
+ "hash_input_tokens": "289dbe8c5cf0bd3d",
2654
+ "hash_cont_tokens": "b1e74e2fab182909"
2655
+ },
2656
+ "truncated": 0,
2657
+ "non_truncated": 100,
2658
+ "padded": 400,
2659
+ "non_padded": 0,
2660
+ "effective_few_shots": 5.0,
2661
+ "num_truncated_few_shots": 0
2662
+ },
2663
+ "leaderboard|mmlu:high_school_european_history|5": {
2664
+ "hashes": {
2665
+ "hash_examples": "854da6e5af0fe1a1",
2666
+ "hash_full_prompts": "ac93a021a27fa592",
2667
+ "hash_input_tokens": "fe15cd70476c1759",
2668
+ "hash_cont_tokens": "b7342549497ce598"
2669
+ },
2670
+ "truncated": 0,
2671
+ "non_truncated": 165,
2672
+ "padded": 656,
2673
+ "non_padded": 4,
2674
+ "effective_few_shots": 5.0,
2675
+ "num_truncated_few_shots": 0
2676
+ },
2677
+ "leaderboard|mmlu:high_school_geography|5": {
2678
+ "hashes": {
2679
+ "hash_examples": "7dc963c7acd19ad8",
2680
+ "hash_full_prompts": "00bcca1547e57a8a",
2681
+ "hash_input_tokens": "76232f6e0cfbaee7",
2682
+ "hash_cont_tokens": "ba635a50235d17d6"
2683
+ },
2684
+ "truncated": 0,
2685
+ "non_truncated": 198,
2686
+ "padded": 785,
2687
+ "non_padded": 7,
2688
+ "effective_few_shots": 5.0,
2689
+ "num_truncated_few_shots": 0
2690
+ },
2691
+ "leaderboard|mmlu:high_school_government_and_politics|5": {
2692
+ "hashes": {
2693
+ "hash_examples": "1f675dcdebc9758f",
2694
+ "hash_full_prompts": "e2160476bdff4519",
2695
+ "hash_input_tokens": "e05b685b42c97ceb",
2696
+ "hash_cont_tokens": "861078cb569a9a2d"
2697
+ },
2698
+ "truncated": 0,
2699
+ "non_truncated": 193,
2700
+ "padded": 772,
2701
+ "non_padded": 0,
2702
+ "effective_few_shots": 5.0,
2703
+ "num_truncated_few_shots": 0
2704
+ },
2705
+ "leaderboard|mmlu:high_school_macroeconomics|5": {
2706
+ "hashes": {
2707
+ "hash_examples": "2fb32cf2d80f0b35",
2708
+ "hash_full_prompts": "f4f2159df23a4347",
2709
+ "hash_input_tokens": "d4fe6d4234a70524",
2710
+ "hash_cont_tokens": "1bd5d8a9878df20b"
2711
+ },
2712
+ "truncated": 0,
2713
+ "non_truncated": 390,
2714
+ "padded": 1552,
2715
+ "non_padded": 8,
2716
+ "effective_few_shots": 5.0,
2717
+ "num_truncated_few_shots": 0
2718
+ },
2719
+ "leaderboard|mmlu:high_school_mathematics|5": {
2720
+ "hashes": {
2721
+ "hash_examples": "fd6646fdb5d58a1f",
2722
+ "hash_full_prompts": "3291431d2bd97481",
2723
+ "hash_input_tokens": "02aac2b890712e9d",
2724
+ "hash_cont_tokens": "d641c253ea3fb50b"
2725
+ },
2726
+ "truncated": 0,
2727
+ "non_truncated": 270,
2728
+ "padded": 1048,
2729
+ "non_padded": 32,
2730
+ "effective_few_shots": 5.0,
2731
+ "num_truncated_few_shots": 0
2732
+ },
2733
+ "leaderboard|mmlu:high_school_microeconomics|5": {
2734
+ "hashes": {
2735
+ "hash_examples": "2118f21f71d87d84",
2736
+ "hash_full_prompts": "d1855383c248010f",
2737
+ "hash_input_tokens": "fa540619b01f06ee",
2738
+ "hash_cont_tokens": "ba80bf94e62b9d1d"
2739
+ },
2740
+ "truncated": 0,
2741
+ "non_truncated": 238,
2742
+ "padded": 924,
2743
+ "non_padded": 28,
2744
+ "effective_few_shots": 5.0,
2745
+ "num_truncated_few_shots": 0
2746
+ },
2747
+ "leaderboard|mmlu:high_school_physics|5": {
2748
+ "hashes": {
2749
+ "hash_examples": "dc3ce06378548565",
2750
+ "hash_full_prompts": "f36ec9d80c8b0f0c",
2751
+ "hash_input_tokens": "633f962bfab1bac3",
2752
+ "hash_cont_tokens": "38f92c2d4b51791c"
2753
+ },
2754
+ "truncated": 0,
2755
+ "non_truncated": 151,
2756
+ "padded": 604,
2757
+ "non_padded": 0,
2758
+ "effective_few_shots": 5.0,
2759
+ "num_truncated_few_shots": 0
2760
+ },
2761
+ "leaderboard|mmlu:high_school_psychology|5": {
2762
+ "hashes": {
2763
+ "hash_examples": "c8d1d98a40e11f2f",
2764
+ "hash_full_prompts": "dde45e16fb767c58",
2765
+ "hash_input_tokens": "669f0abeeca5b7fc",
2766
+ "hash_cont_tokens": "c73b94409db7bea8"
2767
+ },
2768
+ "truncated": 0,
2769
+ "non_truncated": 545,
2770
+ "padded": 2176,
2771
+ "non_padded": 4,
2772
+ "effective_few_shots": 5.0,
2773
+ "num_truncated_few_shots": 0
2774
+ },
2775
+ "leaderboard|mmlu:high_school_statistics|5": {
2776
+ "hashes": {
2777
+ "hash_examples": "666c8759b98ee4ff",
2778
+ "hash_full_prompts": "c93affcf5336cf5e",
2779
+ "hash_input_tokens": "e295c8102d282e5d",
2780
+ "hash_cont_tokens": "550de2236ddcd3d7"
2781
+ },
2782
+ "truncated": 0,
2783
+ "non_truncated": 216,
2784
+ "padded": 864,
2785
+ "non_padded": 0,
2786
+ "effective_few_shots": 5.0,
2787
+ "num_truncated_few_shots": 0
2788
+ },
2789
+ "leaderboard|mmlu:high_school_us_history|5": {
2790
+ "hashes": {
2791
+ "hash_examples": "95fef1c4b7d3f81e",
2792
+ "hash_full_prompts": "2e2a9e8162867f6e",
2793
+ "hash_input_tokens": "e6dfbd3be06806cc",
2794
+ "hash_cont_tokens": "fa0ad891ef2b914f"
2795
+ },
2796
+ "truncated": 0,
2797
+ "non_truncated": 204,
2798
+ "padded": 816,
2799
+ "non_padded": 0,
2800
+ "effective_few_shots": 5.0,
2801
+ "num_truncated_few_shots": 0
2802
+ },
2803
+ "leaderboard|mmlu:high_school_world_history|5": {
2804
+ "hashes": {
2805
+ "hash_examples": "7e5085b6184b0322",
2806
+ "hash_full_prompts": "94ec503aeac5584a",
2807
+ "hash_input_tokens": "0d636de7a339500c",
2808
+ "hash_cont_tokens": "a762b3a2973ca3b3"
2809
+ },
2810
+ "truncated": 0,
2811
+ "non_truncated": 237,
2812
+ "padded": 948,
2813
+ "non_padded": 0,
2814
+ "effective_few_shots": 5.0,
2815
+ "num_truncated_few_shots": 0
2816
+ },
2817
+ "leaderboard|mmlu:human_aging|5": {
2818
+ "hashes": {
2819
+ "hash_examples": "c17333e7c7c10797",
2820
+ "hash_full_prompts": "d682a28b02a3a8da",
2821
+ "hash_input_tokens": "c2558da830bcdfa2",
2822
+ "hash_cont_tokens": "cc785052ada0f4d2"
2823
+ },
2824
+ "truncated": 0,
2825
+ "non_truncated": 223,
2826
+ "padded": 892,
2827
+ "non_padded": 0,
2828
+ "effective_few_shots": 5.0,
2829
+ "num_truncated_few_shots": 0
2830
+ },
2831
+ "leaderboard|mmlu:human_sexuality|5": {
2832
+ "hashes": {
2833
+ "hash_examples": "4edd1e9045df5e3d",
2834
+ "hash_full_prompts": "4d88d67635deb401",
2835
+ "hash_input_tokens": "837e881958642ed9",
2836
+ "hash_cont_tokens": "ba1fca3d357e2778"
2837
+ },
2838
+ "truncated": 0,
2839
+ "non_truncated": 131,
2840
+ "padded": 524,
2841
+ "non_padded": 0,
2842
+ "effective_few_shots": 5.0,
2843
+ "num_truncated_few_shots": 0
2844
+ },
2845
+ "leaderboard|mmlu:international_law|5": {
2846
+ "hashes": {
2847
+ "hash_examples": "db2fa00d771a062a",
2848
+ "hash_full_prompts": "b0ff9397f10d9235",
2849
+ "hash_input_tokens": "dc7fec54b4431648",
2850
+ "hash_cont_tokens": "cc18c6558eedc4bc"
2851
+ },
2852
+ "truncated": 0,
2853
+ "non_truncated": 121,
2854
+ "padded": 484,
2855
+ "non_padded": 0,
2856
+ "effective_few_shots": 5.0,
2857
+ "num_truncated_few_shots": 0
2858
+ },
2859
+ "leaderboard|mmlu:jurisprudence|5": {
2860
+ "hashes": {
2861
+ "hash_examples": "e956f86b124076fe",
2862
+ "hash_full_prompts": "e23a26d767e0dd7b",
2863
+ "hash_input_tokens": "602b0b5b9a4790f6",
2864
+ "hash_cont_tokens": "8931513df4f32f4a"
2865
+ },
2866
+ "truncated": 0,
2867
+ "non_truncated": 108,
2868
+ "padded": 420,
2869
+ "non_padded": 12,
2870
+ "effective_few_shots": 5.0,
2871
+ "num_truncated_few_shots": 0
2872
+ },
2873
+ "leaderboard|mmlu:logical_fallacies|5": {
2874
+ "hashes": {
2875
+ "hash_examples": "956e0e6365ab79f1",
2876
+ "hash_full_prompts": "cd8348b5d9267855",
2877
+ "hash_input_tokens": "9bb92944b592abaf",
2878
+ "hash_cont_tokens": "1cdf879b3cebe91e"
2879
+ },
2880
+ "truncated": 0,
2881
+ "non_truncated": 163,
2882
+ "padded": 648,
2883
+ "non_padded": 4,
2884
+ "effective_few_shots": 5.0,
2885
+ "num_truncated_few_shots": 0
2886
+ },
2887
+ "leaderboard|mmlu:machine_learning|5": {
2888
+ "hashes": {
2889
+ "hash_examples": "397997cc6f4d581e",
2890
+ "hash_full_prompts": "5a5fcc4f47d608f0",
2891
+ "hash_input_tokens": "58bcbf0e2b14be5b",
2892
+ "hash_cont_tokens": "7545fb7f81f641be"
2893
+ },
2894
+ "truncated": 0,
2895
+ "non_truncated": 112,
2896
+ "padded": 448,
2897
+ "non_padded": 0,
2898
+ "effective_few_shots": 5.0,
2899
+ "num_truncated_few_shots": 0
2900
+ },
2901
+ "leaderboard|mmlu:management|5": {
2902
+ "hashes": {
2903
+ "hash_examples": "2bcbe6f6ca63d740",
2904
+ "hash_full_prompts": "b4250ce9c0dc270c",
2905
+ "hash_input_tokens": "8fe683759ced01d7",
2906
+ "hash_cont_tokens": "dac3108173edd07e"
2907
+ },
2908
+ "truncated": 0,
2909
+ "non_truncated": 103,
2910
+ "padded": 412,
2911
+ "non_padded": 0,
2912
+ "effective_few_shots": 5.0,
2913
+ "num_truncated_few_shots": 0
2914
+ },
2915
+ "leaderboard|mmlu:marketing|5": {
2916
+ "hashes": {
2917
+ "hash_examples": "8ddb20d964a1b065",
2918
+ "hash_full_prompts": "069a2cd6bba2fd5d",
2919
+ "hash_input_tokens": "482a24cb8bc8baba",
2920
+ "hash_cont_tokens": "86873731b8b2342d"
2921
+ },
2922
+ "truncated": 0,
2923
+ "non_truncated": 234,
2924
+ "padded": 892,
2925
+ "non_padded": 44,
2926
+ "effective_few_shots": 5.0,
2927
+ "num_truncated_few_shots": 0
2928
+ },
2929
+ "leaderboard|mmlu:medical_genetics|5": {
2930
+ "hashes": {
2931
+ "hash_examples": "182a71f4763d2cea",
2932
+ "hash_full_prompts": "60fa8284012873a5",
2933
+ "hash_input_tokens": "5f1642b026bbc5b4",
2934
+ "hash_cont_tokens": "b1e74e2fab182909"
2935
+ },
2936
+ "truncated": 0,
2937
+ "non_truncated": 100,
2938
+ "padded": 400,
2939
+ "non_padded": 0,
2940
+ "effective_few_shots": 5.0,
2941
+ "num_truncated_few_shots": 0
2942
+ },
2943
+ "leaderboard|mmlu:miscellaneous|5": {
2944
+ "hashes": {
2945
+ "hash_examples": "4c404fdbb4ca57fc",
2946
+ "hash_full_prompts": "ba283db2543204fc",
2947
+ "hash_input_tokens": "66ddccbe8bd5eab6",
2948
+ "hash_cont_tokens": "ff17a87c03e638c1"
2949
+ },
2950
+ "truncated": 0,
2951
+ "non_truncated": 783,
2952
+ "padded": 3132,
2953
+ "non_padded": 0,
2954
+ "effective_few_shots": 5.0,
2955
+ "num_truncated_few_shots": 0
2956
+ },
2957
+ "leaderboard|mmlu:moral_disputes|5": {
2958
+ "hashes": {
2959
+ "hash_examples": "60cbd2baa3fea5c9",
2960
+ "hash_full_prompts": "70b81077595e1ec3",
2961
+ "hash_input_tokens": "b7959738fbca7bab",
2962
+ "hash_cont_tokens": "1d40b5bbe8afbaed"
2963
+ },
2964
+ "truncated": 0,
2965
+ "non_truncated": 346,
2966
+ "padded": 1380,
2967
+ "non_padded": 4,
2968
+ "effective_few_shots": 5.0,
2969
+ "num_truncated_few_shots": 0
2970
+ },
2971
+ "leaderboard|mmlu:moral_scenarios|5": {
2972
+ "hashes": {
2973
+ "hash_examples": "fd8b0431fbdd75ef",
2974
+ "hash_full_prompts": "6ba82d87dbc35d49",
2975
+ "hash_input_tokens": "c5d3f35b0c482b0c",
2976
+ "hash_cont_tokens": "1d48b7d571b76d89"
2977
+ },
2978
+ "truncated": 0,
2979
+ "non_truncated": 895,
2980
+ "padded": 3447,
2981
+ "non_padded": 133,
2982
+ "effective_few_shots": 5.0,
2983
+ "num_truncated_few_shots": 0
2984
+ },
2985
+ "leaderboard|mmlu:nutrition|5": {
2986
+ "hashes": {
2987
+ "hash_examples": "71e55e2b829b6528",
2988
+ "hash_full_prompts": "2624daa57b3f97dd",
2989
+ "hash_input_tokens": "a8b0a3e4afd48413",
2990
+ "hash_cont_tokens": "664d16d1431ecbc7"
2991
+ },
2992
+ "truncated": 0,
2993
+ "non_truncated": 306,
2994
+ "padded": 1224,
2995
+ "non_padded": 0,
2996
+ "effective_few_shots": 5.0,
2997
+ "num_truncated_few_shots": 0
2998
+ },
2999
+ "leaderboard|mmlu:philosophy|5": {
3000
+ "hashes": {
3001
+ "hash_examples": "a6d489a8d208fa4b",
3002
+ "hash_full_prompts": "295b3fb7dd4f86b5",
3003
+ "hash_input_tokens": "15d5ee73c0fc5831",
3004
+ "hash_cont_tokens": "92ca5851410cb91d"
3005
+ },
3006
+ "truncated": 0,
3007
+ "non_truncated": 311,
3008
+ "padded": 1244,
3009
+ "non_padded": 0,
3010
+ "effective_few_shots": 5.0,
3011
+ "num_truncated_few_shots": 0
3012
+ },
3013
+ "leaderboard|mmlu:prehistory|5": {
3014
+ "hashes": {
3015
+ "hash_examples": "6cc50f032a19acaa",
3016
+ "hash_full_prompts": "53b7d0d6b018b188",
3017
+ "hash_input_tokens": "b6a335990adc1033",
3018
+ "hash_cont_tokens": "bba4bbb234487df6"
3019
+ },
3020
+ "truncated": 0,
3021
+ "non_truncated": 324,
3022
+ "padded": 1256,
3023
+ "non_padded": 40,
3024
+ "effective_few_shots": 5.0,
3025
+ "num_truncated_few_shots": 0
3026
+ },
3027
+ "leaderboard|mmlu:professional_accounting|5": {
3028
+ "hashes": {
3029
+ "hash_examples": "50f57ab32f5f6cea",
3030
+ "hash_full_prompts": "be79aed856d17e6b",
3031
+ "hash_input_tokens": "6d823e7f4fa524dd",
3032
+ "hash_cont_tokens": "f4a54bb8d07b6cf9"
3033
+ },
3034
+ "truncated": 0,
3035
+ "non_truncated": 282,
3036
+ "padded": 1108,
3037
+ "non_padded": 20,
3038
+ "effective_few_shots": 5.0,
3039
+ "num_truncated_few_shots": 0
3040
+ },
3041
+ "leaderboard|mmlu:professional_law|5": {
3042
+ "hashes": {
3043
+ "hash_examples": "a8fdc85c64f4b215",
3044
+ "hash_full_prompts": "e3cfe51a29b412b3",
3045
+ "hash_input_tokens": "97ac12132430f10e",
3046
+ "hash_cont_tokens": "f5012b40482f1956"
3047
+ },
3048
+ "truncated": 0,
3049
+ "non_truncated": 1534,
3050
+ "padded": 6136,
3051
+ "non_padded": 0,
3052
+ "effective_few_shots": 5.0,
3053
+ "num_truncated_few_shots": 0
3054
+ },
3055
+ "leaderboard|mmlu:professional_medicine|5": {
3056
+ "hashes": {
3057
+ "hash_examples": "c373a28a3050a73a",
3058
+ "hash_full_prompts": "756a5cc4bfbc8d94",
3059
+ "hash_input_tokens": "6a2dfdb20458f2ff",
3060
+ "hash_cont_tokens": "2a0af8cca646c87c"
3061
+ },
3062
+ "truncated": 0,
3063
+ "non_truncated": 272,
3064
+ "padded": 1088,
3065
+ "non_padded": 0,
3066
+ "effective_few_shots": 5.0,
3067
+ "num_truncated_few_shots": 0
3068
+ },
3069
+ "leaderboard|mmlu:professional_psychology|5": {
3070
+ "hashes": {
3071
+ "hash_examples": "bf5254fe818356af",
3072
+ "hash_full_prompts": "d63b5400dc44ea47",
3073
+ "hash_input_tokens": "f92df04f5ba817ed",
3074
+ "hash_cont_tokens": "1be95eae5e663495"
3075
+ },
3076
+ "truncated": 0,
3077
+ "non_truncated": 612,
3078
+ "padded": 2448,
3079
+ "non_padded": 0,
3080
+ "effective_few_shots": 5.0,
3081
+ "num_truncated_few_shots": 0
3082
+ },
3083
+ "leaderboard|mmlu:public_relations|5": {
3084
+ "hashes": {
3085
+ "hash_examples": "b66d52e28e7d14e0",
3086
+ "hash_full_prompts": "6266533bf8e7c7e7",
3087
+ "hash_input_tokens": "ec3a2222cdba8ee6",
3088
+ "hash_cont_tokens": "d885165284a3d1dc"
3089
+ },
3090
+ "truncated": 0,
3091
+ "non_truncated": 110,
3092
+ "padded": 432,
3093
+ "non_padded": 8,
3094
+ "effective_few_shots": 5.0,
3095
+ "num_truncated_few_shots": 0
3096
+ },
3097
+ "leaderboard|mmlu:security_studies|5": {
3098
+ "hashes": {
3099
+ "hash_examples": "514c14feaf000ad9",
3100
+ "hash_full_prompts": "59cd46479169a2f1",
3101
+ "hash_input_tokens": "3c1155d2e699cc36",
3102
+ "hash_cont_tokens": "4b188bcf8e4c63dc"
3103
+ },
3104
+ "truncated": 0,
3105
+ "non_truncated": 245,
3106
+ "padded": 980,
3107
+ "non_padded": 0,
3108
+ "effective_few_shots": 5.0,
3109
+ "num_truncated_few_shots": 0
3110
+ },
3111
+ "leaderboard|mmlu:sociology|5": {
3112
+ "hashes": {
3113
+ "hash_examples": "f6c9bc9d18c80870",
3114
+ "hash_full_prompts": "f154de679c2b8b52",
3115
+ "hash_input_tokens": "b0d318e249ae90a4",
3116
+ "hash_cont_tokens": "25ae64adfded17db"
3117
+ },
3118
+ "truncated": 0,
3119
+ "non_truncated": 201,
3120
+ "padded": 804,
3121
+ "non_padded": 0,
3122
+ "effective_few_shots": 5.0,
3123
+ "num_truncated_few_shots": 0
3124
+ },
3125
+ "leaderboard|mmlu:us_foreign_policy|5": {
3126
+ "hashes": {
3127
+ "hash_examples": "ed7b78629db6678f",
3128
+ "hash_full_prompts": "a4f8e139cbc77589",
3129
+ "hash_input_tokens": "24e2e88be9400b2a",
3130
+ "hash_cont_tokens": "b1e74e2fab182909"
3131
+ },
3132
+ "truncated": 0,
3133
+ "non_truncated": 100,
3134
+ "padded": 400,
3135
+ "non_padded": 0,
3136
+ "effective_few_shots": 5.0,
3137
+ "num_truncated_few_shots": 0
3138
+ },
3139
+ "leaderboard|mmlu:virology|5": {
3140
+ "hashes": {
3141
+ "hash_examples": "bc52ffdc3f9b994a",
3142
+ "hash_full_prompts": "1eb86dddbda6aaf2",
3143
+ "hash_input_tokens": "1bf98aaa3b5a1fa9",
3144
+ "hash_cont_tokens": "b9a3303d5aa72742"
3145
+ },
3146
+ "truncated": 0,
3147
+ "non_truncated": 166,
3148
+ "padded": 664,
3149
+ "non_padded": 0,
3150
+ "effective_few_shots": 5.0,
3151
+ "num_truncated_few_shots": 0
3152
+ },
3153
+ "leaderboard|mmlu:world_religions|5": {
3154
+ "hashes": {
3155
+ "hash_examples": "ecdb4a4f94f62930",
3156
+ "hash_full_prompts": "ab245d4bb2be4d68",
3157
+ "hash_input_tokens": "2dd2033fe8b8d185",
3158
+ "hash_cont_tokens": "bbd486c0f082eb01"
3159
+ },
3160
+ "truncated": 0,
3161
+ "non_truncated": 171,
3162
+ "padded": 684,
3163
+ "non_padded": 0,
3164
+ "effective_few_shots": 5.0,
3165
+ "num_truncated_few_shots": 0
3166
+ }
3167
+ },
3168
+ "summary_general": {
3169
+ "hashes": {
3170
+ "hash_examples": "341a076d0beb7048",
3171
+ "hash_full_prompts": "1af54109608ae73f",
3172
+ "hash_input_tokens": "58370c3f1934882c",
3173
+ "hash_cont_tokens": "cabc83a7863b3d1b"
3174
+ },
3175
+ "truncated": 0,
3176
+ "non_truncated": 14042,
3177
+ "padded": 55806,
3178
+ "non_padded": 362,
3179
+ "num_truncated_few_shots": 0
3180
+ }
3181
+ }