alexmarques commited on
Commit
c402127
·
verified ·
1 Parent(s): 200fb5a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -17
README.md CHANGED
@@ -131,7 +131,7 @@ model.save_pretrained("Meta-Llama-3.1-70B-Instruct-quantized.w8a8")
131
 
132
  The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
133
  Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
134
- This version of the lm-evaluation-harness includes versions of ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-70B-Instruct-evals).
135
 
136
  ### Accuracy
137
 
@@ -142,7 +142,7 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge and
142
  </td>
143
  <td><strong>Meta-Llama-3.1-70B-Instruct </strong>
144
  </td>
145
- <td><strong>Meta-Llama-3.1-70B-Instruct-quantized.w8a8 (this model)</strong>
146
  </td>
147
  <td><strong>Recovery</strong>
148
  </td>
@@ -150,9 +150,19 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge and
150
  <tr>
151
  <td>MMLU (5-shot)
152
  </td>
153
- <td>82.21
154
  </td>
155
- <td>81.88
 
 
 
 
 
 
 
 
 
 
156
  </td>
157
  <td>99.6%
158
  </td>
@@ -160,11 +170,11 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge and
160
  <tr>
161
  <td>ARC Challenge (0-shot)
162
  </td>
163
- <td>95.05
164
  </td>
165
- <td>94.97
166
  </td>
167
- <td>99.9%
168
  </td>
169
  </tr>
170
  <tr>
@@ -210,11 +220,11 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge and
210
  <tr>
211
  <td><strong>Average</strong>
212
  </td>
213
- <td><strong>83.60</strong>
214
  </td>
215
- <td><strong>83.71</strong>
216
  </td>
217
- <td><strong>100.1%</strong>
218
  </td>
219
  </tr>
220
  </table>
@@ -227,17 +237,30 @@ The results were obtained using the following commands:
227
  ```
228
  lm_eval \
229
  --model vllm \
230
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=4 \
231
- --tasks mmlu \
 
 
232
  --num_fewshot 5 \
233
  --batch_size auto
234
  ```
235
 
 
 
 
 
 
 
 
 
 
 
 
236
  #### ARC-Challenge
237
  ```
238
  lm_eval \
239
  --model vllm \
240
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=4 \
241
  --tasks arc_challenge_llama_3.1_instruct \
242
  --apply_chat_template \
243
  --num_fewshot 0 \
@@ -248,7 +271,7 @@ lm_eval \
248
  ```
249
  lm_eval \
250
  --model vllm \
251
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=4 \
252
  --tasks gsm8k_cot_llama_3.1_instruct \
253
  --fewshot_as_multiturn \
254
  --apply_chat_template \
@@ -260,7 +283,7 @@ lm_eval \
260
  ```
261
  lm_eval \
262
  --model vllm \
263
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=4 \
264
  --tasks hellaswag \
265
  --num_fewshot 10 \
266
  --batch_size auto
@@ -270,7 +293,7 @@ lm_eval \
270
  ```
271
  lm_eval \
272
  --model vllm \
273
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=4 \
274
  --tasks winogrande \
275
  --num_fewshot 5 \
276
  --batch_size auto
@@ -280,7 +303,7 @@ lm_eval \
280
  ```
281
  lm_eval \
282
  --model vllm \
283
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=4 \
284
  --tasks truthfulqa \
285
  --num_fewshot 0 \
286
  --batch_size auto
 
131
 
132
  The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
133
  Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
134
+ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-70B-Instruct-evals).
135
 
136
  ### Accuracy
137
 
 
142
  </td>
143
  <td><strong>Meta-Llama-3.1-70B-Instruct </strong>
144
  </td>
145
+ <td><strong>Meta-Llama-3.1-70B-Instruct-quantized.w8a16 (this model)</strong>
146
  </td>
147
  <td><strong>Recovery</strong>
148
  </td>
 
150
  <tr>
151
  <td>MMLU (5-shot)
152
  </td>
153
+ <td>83.88
154
  </td>
155
+ <td>83.65
156
+ </td>
157
+ <td>99.7%
158
+ </td>
159
+ </tr>
160
+ <tr>
161
+ <td>MMLU (CoT, 0-shot)
162
+ </td>
163
+ <td>85.74
164
+ </td>
165
+ <td>85.41
166
  </td>
167
  <td>99.6%
168
  </td>
 
170
  <tr>
171
  <td>ARC Challenge (0-shot)
172
  </td>
173
+ <td>93.26
174
  </td>
175
+ <td>93.26
176
  </td>
177
+ <td>100.0%
178
  </td>
179
  </tr>
180
  <tr>
 
220
  <tr>
221
  <td><strong>Average</strong>
222
  </td>
223
+ <td><strong>83.89</strong>
224
  </td>
225
+ <td><strong>83.96</strong>
226
  </td>
227
+ <td><strong>100.2%</strong>
228
  </td>
229
  </tr>
230
  </table>
 
237
  ```
238
  lm_eval \
239
  --model vllm \
240
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
241
+ --tasks mmlu_llama_3.1_instruct \
242
+ --fewshot_as_multiturn \
243
+ --apply_chat_template \
244
  --num_fewshot 5 \
245
  --batch_size auto
246
  ```
247
 
248
+ #### MMLU-CoT
249
+ ```
250
+ lm_eval \
251
+ --model vllm \
252
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=4064,max_gen_toks=1024,tensor_parallel_size=1 \
253
+ --tasks mmlu_cot_0shot_llama_3.1_instruct \
254
+ --apply_chat_template \
255
+ --num_fewshot 0 \
256
+ --batch_size auto
257
+ ```
258
+
259
  #### ARC-Challenge
260
  ```
261
  lm_eval \
262
  --model vllm \
263
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=3940,max_gen_toks=100,tensor_parallel_size=1 \
264
  --tasks arc_challenge_llama_3.1_instruct \
265
  --apply_chat_template \
266
  --num_fewshot 0 \
 
271
  ```
272
  lm_eval \
273
  --model vllm \
274
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=4096,max_gen_toks=1024,tensor_parallel_size=1 \
275
  --tasks gsm8k_cot_llama_3.1_instruct \
276
  --fewshot_as_multiturn \
277
  --apply_chat_template \
 
283
  ```
284
  lm_eval \
285
  --model vllm \
286
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
287
  --tasks hellaswag \
288
  --num_fewshot 10 \
289
  --batch_size auto
 
293
  ```
294
  lm_eval \
295
  --model vllm \
296
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
297
  --tasks winogrande \
298
  --num_fewshot 5 \
299
  --batch_size auto
 
303
  ```
304
  lm_eval \
305
  --model vllm \
306
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
307
  --tasks truthfulqa \
308
  --num_fewshot 0 \
309
  --batch_size auto