File size: 9,970 Bytes
aae96b6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
base_model:
- lars1234/Mistral-Small-24B-Instruct-2501-writer
---
vllm (pretrained=/root/autodl-tmp/Mistral-Small-24B-Instruct-2501-writer,add_bos_token=true,max_model_len=4096,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.924|±  |0.0168|
|     |       |strict-match    |     5|exact_match|↑  |0.920|±  |0.0172|

vllm (pretrained=/root/autodl-tmp/Mistral-Small-24B-Instruct-2501-writer,add_bos_token=true,max_model_len=4096,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.918|±  |0.0123|
|     |       |strict-match    |     5|exact_match|↑  |0.908|±  |0.0129|

vllm (pretrained=/root/autodl-tmp/Mistral-Small-24B-Instruct-2501-writer,add_bos_token=true,max_model_len=4096,dtype=bfloat16,max_num_seqs=3), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.7977|±  |0.0131|
| - humanities     |      2|none  |      |acc   |↑  |0.8256|±  |0.0263|
| - other          |      2|none  |      |acc   |↑  |0.8205|±  |0.0265|
| - social sciences|      2|none  |      |acc   |↑  |0.8556|±  |0.0256|
| - stem           |      2|none  |      |acc   |↑  |0.7263|±  |0.0249|


vllm (pretrained=/root/autodl-tmp/70-512-df10-uc,add_bos_token=true,max_model_len=4096,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.904|±  |0.0187|
|     |       |strict-match    |     5|exact_match|↑  |0.904|±  |0.0187|

vllm (pretrained=/root/autodl-tmp/70-512-df10-uc,add_bos_token=true,max_model_len=4096,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.904|±  |0.0132|
|     |       |strict-match    |     5|exact_match|↑  |0.900|±  |0.0134|

vllm (pretrained=/root/autodl-tmp/70-512-df10-uc,add_bos_token=true,max_model_len=4096,dtype=bfloat16,max_num_seqs=3), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.7836|±  |0.0132|
| - humanities     |      2|none  |      |acc   |↑  |0.8103|±  |0.0261|
| - other          |      2|none  |      |acc   |↑  |0.8000|±  |0.0269|
| - social sciences|      2|none  |      |acc   |↑  |0.8556|±  |0.0248|
| - stem           |      2|none  |      |acc   |↑  |0.7088|±  |0.0256|


vllm (pretrained=/root/autodl-tmp/70-512-df8,add_bos_token=true,max_model_len=4096,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.896|±  |0.0193|
|     |       |strict-match    |     5|exact_match|↑  |0.896|±  |0.0193|

vllm (pretrained=/root/autodl-tmp/70-512-df8,add_bos_token=true,max_model_len=4096,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.894|±  |0.0138|
|     |       |strict-match    |     5|exact_match|↑  |0.886|±  |0.0142|

vllm (pretrained=/root/autodl-tmp/70-512-df8,add_bos_token=true,max_model_len=4096,dtype=bfloat16,max_num_seqs=3), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.7801|±  |0.0134|
| - humanities     |      2|none  |      |acc   |↑  |0.8154|±  |0.0261|
| - other          |      2|none  |      |acc   |↑  |0.7795|±  |0.0280|
| - social sciences|      2|none  |      |acc   |↑  |0.8444|±  |0.0263|
| - stem           |      2|none  |      |acc   |↑  |0.7158|±  |0.0254|


vllm (pretrained=/root/autodl-tmp/70-512-df10,add_bos_token=true,max_model_len=4096,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.92|±  |0.0172|
|     |       |strict-match    |     5|exact_match|↑  | 0.92|±  |0.0172|

vllm (pretrained=/root/autodl-tmp/70-512-df10,add_bos_token=true,max_model_len=4096,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.910|±  |0.0128|
|     |       |strict-match    |     5|exact_match|↑  |0.906|±  |0.0131|

vllm (pretrained=/root/autodl-tmp/70-512-df10,add_bos_token=true,max_model_len=4096,dtype=bfloat16,max_num_seqs=3), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.7836|±  |0.0133|
| - humanities     |      2|none  |      |acc   |↑  |0.8103|±  |0.0267|
| - other          |      2|none  |      |acc   |↑  |0.7949|±  |0.0271|
| - social sciences|      2|none  |      |acc   |↑  |0.8556|±  |0.0252|
| - stem           |      2|none  |      |acc   |↑  |0.7123|±  |0.0257|


vllm (pretrained=/root/autodl-tmp/70-512-df11,add_bos_token=true,max_model_len=4096,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.912|±  |0.0180|
|     |       |strict-match    |     5|exact_match|↑  |0.908|±  |0.0183|

vllm (pretrained=/root/autodl-tmp/70-512-df11,add_bos_token=true,max_model_len=4096,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.902|±  |0.0133|
|     |       |strict-match    |     5|exact_match|↑  |0.896|±  |0.0137|

vllm (pretrained=/root/autodl-tmp/70-512-df11,add_bos_token=true,max_model_len=4096,dtype=bfloat16,max_num_seqs=3), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.7953|±  |0.0129|
| - humanities     |      2|none  |      |acc   |↑  |0.8308|±  |0.0251|
| - other          |      2|none  |      |acc   |↑  |0.7846|±  |0.0275|
| - social sciences|      2|none  |      |acc   |↑  |0.8722|±  |0.0241|
| - stem           |      2|none  |      |acc   |↑  |0.7298|±  |0.0247|


vllm (pretrained=/root/autodl-tmp/86-512-uc,add_bos_token=true,max_model_len=4096,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.888|±  |0.0200|
|     |       |strict-match    |     5|exact_match|↑  |0.884|±  |0.0203|

vllm (pretrained=/root/autodl-tmp/86-512-uc,add_bos_token=true,max_model_len=4096,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.880|±  |0.0145|
|     |       |strict-match    |     5|exact_match|↑  |0.872|±  |0.0150|

vllm (pretrained=/root/autodl-tmp/86-512-uc,add_bos_token=true,max_model_len=4096,dtype=bfloat16,max_num_seqs=3), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.7743|±  |0.0134|
| - humanities     |      2|none  |      |acc   |↑  |0.7897|±  |0.0280|
| - other          |      2|none  |      |acc   |↑  |0.7744|±  |0.0280|
| - social sciences|      2|none  |      |acc   |↑  |0.8833|±  |0.0233|
| - stem           |      2|none  |      |acc   |↑  |0.6947|±  |0.0258|