alexmarques commited on
Commit
ee2cc51
·
verified ·
1 Parent(s): 2886071

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -26
README.md CHANGED
@@ -32,7 +32,7 @@ base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
32
  - **Model Developers:** Neural Magic
33
 
34
  Quantized version of [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct).
35
- It achieves an average score of 67.57 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 69.32.
36
 
37
  ### Model Optimizations
38
 
@@ -129,6 +129,8 @@ The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande an
129
  Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
130
  This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals).
131
 
 
 
132
  ### Accuracy
133
 
134
  #### Open LLM Leaderboard evaluation scores
@@ -146,29 +148,29 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
146
  <tr>
147
  <td>MMLU (5-shot)
148
  </td>
149
- <td>69.43
150
  </td>
151
- <td>67.68
152
  </td>
153
- <td>97.5%
154
  </td>
155
  </tr>
156
  <tr>
157
  <td>MMLU (CoT, 0-shot)
158
  </td>
159
- <td>72.56
160
  </td>
161
- <td>70.36
162
  </td>
163
- <td>97.0%
164
  </td>
165
  </tr>
166
  <tr>
167
  <td>ARC Challenge (0-shot)
168
  </td>
169
- <td>81.57
170
  </td>
171
- <td>79.95
172
  </td>
173
  <td>98.0%
174
  </td>
@@ -178,49 +180,49 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
178
  </td>
179
  <td>82.79
180
  </td>
181
- <td>79.53
182
  </td>
183
- <td>96.1%
184
  </td>
185
  </tr>
186
  <tr>
187
  <td>Hellaswag (10-shot)
188
  </td>
189
- <td>80.01
190
  </td>
191
- <td>78.57
192
  </td>
193
- <td>98.2%
194
  </td>
195
  </tr>
196
  <tr>
197
  <td>Winogrande (5-shot)
198
  </td>
199
- <td>77.90
200
  </td>
201
- <td>76.48
202
  </td>
203
- <td>98.2%
204
  </td>
205
  </tr>
206
  <tr>
207
  <td>TruthfulQA (0-shot, mc2)
208
  </td>
209
- <td>54.04
210
  </td>
211
  <td>50.46
212
  </td>
213
- <td>93.4%
214
  </td>
215
  </tr>
216
  <tr>
217
  <td><strong>Average</strong>
218
  </td>
219
- <td><strong>74.04</strong>
220
  </td>
221
- <td><strong>71.86</strong>
222
  </td>
223
- <td><strong>97.1%</strong>
224
  </td>
225
  </tr>
226
  </table>
@@ -233,7 +235,7 @@ The results were obtained using the following commands:
233
  ```
234
  lm_eval \
235
  --model vllm \
236
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
237
  --tasks mmlu_llama_3.1_instruct \
238
  --fewshot_as_multiturn \
239
  --apply_chat_template \
@@ -245,7 +247,7 @@ lm_eval \
245
  ```
246
  lm_eval \
247
  --model vllm \
248
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4064,max_gen_toks=1024,tensor_parallel_size=1 \
249
  --tasks mmlu_cot_0shot_llama_3.1_instruct \
250
  --apply_chat_template \
251
  --num_fewshot 0 \
@@ -256,7 +258,7 @@ lm_eval \
256
  ```
257
  lm_eval \
258
  --model vllm \
259
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=3940,max_gen_toks=100,tensor_parallel_size=1 \
260
  --tasks arc_challenge_llama_3.1_instruct \
261
  --apply_chat_template \
262
  --num_fewshot 0 \
@@ -267,7 +269,7 @@ lm_eval \
267
  ```
268
  lm_eval \
269
  --model vllm \
270
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,max_gen_toks=1024,tensor_parallel_size=1 \
271
  --tasks gsm8k_cot_llama_3.1_instruct \
272
  --fewshot_as_multiturn \
273
  --apply_chat_template \
 
32
  - **Model Developers:** Neural Magic
33
 
34
  Quantized version of [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct).
35
+ It achieves an average score of 72.58 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 74.25.
36
 
37
  ### Model Optimizations
38
 
 
129
  Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
130
  This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals).
131
 
132
+ **Note:** Results have been updated after Meta modified the chat template.
133
+
134
  ### Accuracy
135
 
136
  #### Open LLM Leaderboard evaluation scores
 
148
  <tr>
149
  <td>MMLU (5-shot)
150
  </td>
151
+ <td>68.32
152
  </td>
153
+ <td>66.89
154
  </td>
155
+ <td>97.9%
156
  </td>
157
  </tr>
158
  <tr>
159
  <td>MMLU (CoT, 0-shot)
160
  </td>
161
+ <td>72.83
162
  </td>
163
+ <td>71.06
164
  </td>
165
+ <td>97.6%
166
  </td>
167
  </tr>
168
  <tr>
169
  <td>ARC Challenge (0-shot)
170
  </td>
171
+ <td>81.40
172
  </td>
173
+ <td>80.20
174
  </td>
175
  <td>98.0%
176
  </td>
 
180
  </td>
181
  <td>82.79
182
  </td>
183
+ <td>82.94
184
  </td>
185
+ <td>100.2%
186
  </td>
187
  </tr>
188
  <tr>
189
  <td>Hellaswag (10-shot)
190
  </td>
191
+ <td>80.47
192
  </td>
193
+ <td>78.59
194
  </td>
195
+ <td>97.7%
196
  </td>
197
  </tr>
198
  <tr>
199
  <td>Winogrande (5-shot)
200
  </td>
201
+ <td>78.06
202
  </td>
203
+ <td>76.40
204
  </td>
205
+ <td>97.9%
206
  </td>
207
  </tr>
208
  <tr>
209
  <td>TruthfulQA (0-shot, mc2)
210
  </td>
211
+ <td>54.48
212
  </td>
213
  <td>50.46
214
  </td>
215
+ <td>92.6%
216
  </td>
217
  </tr>
218
  <tr>
219
  <td><strong>Average</strong>
220
  </td>
221
+ <td><strong>74.25</strong>
222
  </td>
223
+ <td><strong>72.58</strong>
224
  </td>
225
+ <td><strong>97.7%</strong>
226
  </td>
227
  </tr>
228
  </table>
 
235
  ```
236
  lm_eval \
237
  --model vllm \
238
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16",dtype=auto,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
239
  --tasks mmlu_llama_3.1_instruct \
240
  --fewshot_as_multiturn \
241
  --apply_chat_template \
 
247
  ```
248
  lm_eval \
249
  --model vllm \
250
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16",dtype=auto,max_model_len=4064,max_gen_toks=1024,tensor_parallel_size=1 \
251
  --tasks mmlu_cot_0shot_llama_3.1_instruct \
252
  --apply_chat_template \
253
  --num_fewshot 0 \
 
258
  ```
259
  lm_eval \
260
  --model vllm \
261
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16",dtype=auto,max_model_len=3940,max_gen_toks=100,tensor_parallel_size=1 \
262
  --tasks arc_challenge_llama_3.1_instruct \
263
  --apply_chat_template \
264
  --num_fewshot 0 \
 
269
  ```
270
  lm_eval \
271
  --model vllm \
272
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16",dtype=auto,max_model_len=4096,max_gen_toks=1024,tensor_parallel_size=1 \
273
  --tasks gsm8k_cot_llama_3.1_instruct \
274
  --fewshot_as_multiturn \
275
  --apply_chat_template \