alexmarques commited on
Commit
65b86b1
·
verified ·
1 Parent(s): cc2f1e3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -29
README.md CHANGED
@@ -32,7 +32,7 @@ base_model: meta-llama/Meta-Llama-3.1-70B-Instruct
32
  - **Model Developers:** Neural Magic
33
 
34
  Quantized version of [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct).
35
- It achieves scores within 1.5% of the scores of the unquantized model for MMLU, ARC-Challenge, GSM-8k, Hellaswag, Winogrande and TruthfulQA.
36
 
37
  ### Model Optimizations
38
 
@@ -136,6 +136,8 @@ The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande an
136
  Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
137
  This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-70B-Instruct-evals).
138
 
 
 
139
  ### Accuracy
140
 
141
  #### Open LLM Leaderboard evaluation scores
@@ -153,81 +155,81 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
153
  <tr>
154
  <td>MMLU (5-shot)
155
  </td>
156
- <td>83.88
157
  </td>
158
- <td>81.07
159
  </td>
160
- <td>96.6%
161
  </td>
162
  </tr>
163
  <tr>
164
  <td>MMLU (CoT, 0-shot)
165
  </td>
166
- <td>85.74
167
  </td>
168
- <td>83.29
169
  </td>
170
- <td>97.1%
171
  </td>
172
  </tr>
173
  <tr>
174
  <td>ARC Challenge (0-shot)
175
  </td>
176
- <td>93.26
177
  </td>
178
- <td>91.98
179
  </td>
180
- <td>98.6%
181
  </td>
182
  </tr>
183
  <tr>
184
  <td>GSM-8K (CoT, 8-shot, strict-match)
185
  </td>
186
- <td>93.10
187
  </td>
188
- <td>92.27
189
  </td>
190
- <td>99.1%
191
  </td>
192
  </tr>
193
  <tr>
194
  <td>Hellaswag (10-shot)
195
  </td>
196
- <td>86.40
197
  </td>
198
- <td>86.11
199
  </td>
200
- <td>99.7%
201
  </td>
202
  </tr>
203
  <tr>
204
  <td>Winogrande (5-shot)
205
  </td>
206
- <td>85.00
207
  </td>
208
  <td>84.14
209
  </td>
210
- <td>99.0%
211
  </td>
212
  </tr>
213
  <tr>
214
- <td>TruthfulQA (0-shot, mc2)
215
  </td>
216
- <td>59.83
217
  </td>
218
- <td>58.90
219
  </td>
220
- <td>98.5%
221
  </td>
222
  </tr>
223
  <tr>
224
  <td><strong>Average</strong>
225
  </td>
226
- <td><strong>83.89</strong>
227
  </td>
228
- <td><strong>82.54</strong>
229
  </td>
230
- <td><strong>98.4%</strong>
231
  </td>
232
  </tr>
233
  </table>
@@ -240,7 +242,7 @@ The results were obtained using the following commands:
240
  ```
241
  lm_eval \
242
  --model vllm \
243
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
244
  --tasks mmlu_llama_3.1_instruct \
245
  --fewshot_as_multiturn \
246
  --apply_chat_template \
@@ -252,7 +254,7 @@ lm_eval \
252
  ```
253
  lm_eval \
254
  --model vllm \
255
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4064,max_gen_toks=1024,tensor_parallel_size=1 \
256
  --tasks mmlu_cot_0shot_llama_3.1_instruct \
257
  --apply_chat_template \
258
  --num_fewshot 0 \
@@ -263,7 +265,7 @@ lm_eval \
263
  ```
264
  lm_eval \
265
  --model vllm \
266
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=3940,max_gen_toks=100,tensor_parallel_size=1 \
267
  --tasks arc_challenge_llama_3.1_instruct \
268
  --apply_chat_template \
269
  --num_fewshot 0 \
@@ -274,7 +276,7 @@ lm_eval \
274
  ```
275
  lm_eval \
276
  --model vllm \
277
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4096,max_gen_toks=1024,tensor_parallel_size=1 \
278
  --tasks gsm8k_cot_llama_3.1_instruct \
279
  --fewshot_as_multiturn \
280
  --apply_chat_template \
 
32
  - **Model Developers:** Neural Magic
33
 
34
  Quantized version of [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct).
35
+ It achieves scores within 3.1% of the scores of the unquantized model for MMLU, ARC-Challenge, GSM-8k, Hellaswag, Winogrande and TruthfulQA.
36
 
37
  ### Model Optimizations
38
 
 
136
  Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
137
  This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-70B-Instruct-evals).
138
 
139
+ **Note:** Results have been updated after Meta modified the chat template.
140
+
141
  ### Accuracy
142
 
143
  #### Open LLM Leaderboard evaluation scores
 
155
  <tr>
156
  <td>MMLU (5-shot)
157
  </td>
158
+ <td>83.94
159
  </td>
160
+ <td>81.37
161
  </td>
162
+ <td>96.9%
163
  </td>
164
  </tr>
165
  <tr>
166
  <td>MMLU (CoT, 0-shot)
167
  </td>
168
+ <td>86.23
169
  </td>
170
+ <td>83.86
171
  </td>
172
+ <td>97.2%
173
  </td>
174
  </tr>
175
  <tr>
176
  <td>ARC Challenge (0-shot)
177
  </td>
178
+ <td>93.34
179
  </td>
180
+ <td>92.32
181
  </td>
182
+ <td>98.9%
183
  </td>
184
  </tr>
185
  <tr>
186
  <td>GSM-8K (CoT, 8-shot, strict-match)
187
  </td>
188
+ <td>95.38
189
  </td>
190
+ <td>93.25
191
  </td>
192
+ <td>97.8%
193
  </td>
194
  </tr>
195
  <tr>
196
  <td>Hellaswag (10-shot)
197
  </td>
198
+ <td>86.66
199
  </td>
200
+ <td>86.16
201
  </td>
202
+ <td>99.4%
203
  </td>
204
  </tr>
205
  <tr>
206
  <td>Winogrande (5-shot)
207
  </td>
208
+ <td>85.32
209
  </td>
210
  <td>84.14
211
  </td>
212
+ <td>98.6%
213
  </td>
214
  </tr>
215
  <tr>
216
+ <td>TruthfulQA (0-shot, mc2)
217
  </td>
218
+ <td>60.65
219
  </td>
220
+ <td>58.89
221
  </td>
222
+ <td>97.1%
223
  </td>
224
  </tr>
225
  <tr>
226
  <td><strong>Average</strong>
227
  </td>
228
+ <td><strong>84.50</strong>
229
  </td>
230
+ <td><strong>82.85</strong>
231
  </td>
232
+ <td><strong>98.0%</strong>
233
  </td>
234
  </tr>
235
  </table>
 
242
  ```
243
  lm_eval \
244
  --model vllm \
245
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16",dtype=auto,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
246
  --tasks mmlu_llama_3.1_instruct \
247
  --fewshot_as_multiturn \
248
  --apply_chat_template \
 
254
  ```
255
  lm_eval \
256
  --model vllm \
257
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16",dtype=auto,max_model_len=4064,max_gen_toks=1024,tensor_parallel_size=1 \
258
  --tasks mmlu_cot_0shot_llama_3.1_instruct \
259
  --apply_chat_template \
260
  --num_fewshot 0 \
 
265
  ```
266
  lm_eval \
267
  --model vllm \
268
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16",dtype=auto,max_model_len=3940,max_gen_toks=100,tensor_parallel_size=1 \
269
  --tasks arc_challenge_llama_3.1_instruct \
270
  --apply_chat_template \
271
  --num_fewshot 0 \
 
276
  ```
277
  lm_eval \
278
  --model vllm \
279
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16",dtype=auto,max_model_len=4096,max_gen_toks=1024,tensor_parallel_size=1 \
280
  --tasks gsm8k_cot_llama_3.1_instruct \
281
  --fewshot_as_multiturn \
282
  --apply_chat_template \