nm-research commited on
Commit
7137935
·
verified ·
1 Parent(s): e9feb9f

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +602 -0
README.md ADDED
@@ -0,0 +1,602 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - int8
4
+ - vllm
5
+ - chat
6
+ - neuralmagic
7
+ - llmcompressor
8
+ language:
9
+ - en
10
+ - de
11
+ - fr
12
+ - it
13
+ - pt
14
+ - hi
15
+ - es
16
+ - th
17
+ pipeline_tag: text-generation
18
+ license: llama3.3
19
+ base_model: meta-llama/Llama-3.3-70B-Instruct
20
+ ---
21
+
22
+ # Llama-3.3-70B-Instruct-quantized.w8a8
23
+
24
+ ## Model Overview
25
+ - **Model Architecture:** Llama
26
+ - **Input:** Text
27
+ - **Output:** Text
28
+ - **Model Optimizations:**
29
+ - **Activation quantization:** INT8
30
+ - **Weight quantization:** INT8
31
+ - **Intended Use Cases:** Intended for commercial and research use multiple languages. Similarly to [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct), this models is intended for assistant-like chat.
32
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws).
33
+ - **Release Date:** 01/20/2025
34
+ - **Version:** 1.0
35
+ - **Model Developers:** Neural Magic
36
+
37
+ Quantized version of [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct).
38
+ It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model, including multiple-choice, math reasoning, and open-ended text generation.
39
+ Llama-3.3-70B-Instruct-quantized.w8a8 achieves 99.4% recovery for OpenLLM v1 (using Meta's prompting when available) and 100% for both HumanEval and HumanEval+ pass@1.
40
+
41
+ ### Model Optimizations
42
+
43
+ This model was obtained by quantizing the weights and activations of [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) to INT8 data type.
44
+ This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
45
+ Weight quantization also reduces disk size requirements by approximately 50%.
46
+
47
+ Only weights and activations of the linear operators within transformers blocks are quantized.
48
+ Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between INT8 and floating point representations for each output channel dimension.
49
+ Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between INT8 and floating point representations.
50
+
51
+ ## Deployment
52
+
53
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
54
+
55
+ ```python
56
+ from vllm import LLM, SamplingParams
57
+ from transformers import AutoTokenizer
58
+
59
+ model_id = "neuralmagic-ent/Llama-3.3-70B-Instruct-quantized.w8a8"
60
+ number_gpus = 1
61
+ max_model_len = 8192
62
+
63
+ sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
64
+
65
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
66
+
67
+ messages = [
68
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
69
+ {"role": "user", "content": "Who are you?"},
70
+ ]
71
+
72
+ prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
73
+
74
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=max_model_len)
75
+
76
+ outputs = llm.generate(prompts, sampling_params)
77
+
78
+ generated_text = outputs[0].outputs[0].text
79
+ print(generated_text)
80
+ ```
81
+
82
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
83
+
84
+
85
+ ## Creation
86
+
87
+ This model was created by using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library as presented in the code snipet below.
88
+
89
+ ```python
90
+ from transformers import AutoTokenizer, AutoModelForCausalLM
91
+ from datasets import Dataset
92
+ from llmcompressor.transformers import oneshot
93
+ from llmcompressor.modifiers.quantization import GPTQModifier
94
+ import random
95
+
96
+ model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
97
+
98
+ num_samples = 1024
99
+ max_seq_len = 8192
100
+
101
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
102
+
103
+ max_token_id = len(tokenizer.get_vocab()) - 1
104
+ input_ids = [[random.randint(0, max_token_id) for _ in range(max_seq_len)] for _ in range(num_samples)]
105
+ attention_mask = num_samples * [max_seq_len * [1]]
106
+ ds = Dataset.from_dict({"input_ids": input_ids, "attention_mask": attention_mask})
107
+
108
+ recipe = GPTQModifier(
109
+ targets="Linear",
110
+ scheme="W8A8",
111
+ ignore=["lm_head"],
112
+ dampening_frac=0.01,
113
+ )
114
+
115
+ model = SparseAutoModelForCausalLM.from_pretrained(
116
+ model_id,
117
+ device_map="auto",
118
+ )
119
+
120
+ oneshot(
121
+ model=model,
122
+ dataset=ds,
123
+ recipe=recipe,
124
+ max_seq_length=max_seq_len,
125
+ num_calibration_samples=num_samples,
126
+ )
127
+
128
+ model.save_pretrained("Llama-3.3-70B-Instruct-quantized.w8a8")
129
+ ```
130
+
131
+ ## Evaluation
132
+
133
+ This model was evaluated on the well-known OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval+ benchmarks.
134
+ In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/en/stable/) engine.
135
+
136
+ OpenLLM v1 and v2 evaluations were conducted using Neural Magic's fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct).
137
+ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals) and a few fixes to OpenLLM v2 tasks.
138
+
139
+ HumanEval and HumanEval+ evaluations were conducted using Neural Magic's fork of the [EvalPlus](https://github.com/neuralmagic/evalplus) repository.
140
+
141
+ ### Accuracy
142
+
143
+ <table>
144
+ <tr>
145
+ <th>Category
146
+ </th>
147
+ <th>Benchmark
148
+ </th>
149
+ <th>Llama-3.3-70B-Instruct
150
+ </th>
151
+ <th>Llama-3.3-70B-Instruct-quantized.w8a8 (this model)
152
+ </th>
153
+ <th>Recovery
154
+ </th>
155
+ </tr>
156
+ <tr>
157
+ <td rowspan="8" ><strong>OpenLLM v1</strong>
158
+ </td>
159
+ <td>MMLU (5-shot)
160
+ </td>
161
+ <td>81.60
162
+ </td>
163
+ <td>81.19
164
+ </td>
165
+ <td>99.5%
166
+ </td>
167
+ </tr>
168
+ <tr>
169
+ <td>MMLU (CoT, 0-shot)
170
+ </td>
171
+ <td>86.58
172
+ </td>
173
+ <td>85.92
174
+ </td>
175
+ <td>99.2%
176
+ </td>
177
+ </tr>
178
+ <tr>
179
+ <td>ARC Challenge (0-shot)
180
+ </td>
181
+ <td>49.23
182
+ </td>
183
+ <td>48.04
184
+ </td>
185
+ <td>97.6%
186
+ </td>
187
+ </tr>
188
+ <tr>
189
+ <td>GSM-8K (CoT, 8-shot, strict-match)
190
+ </td>
191
+ <td>94.16
192
+ </td>
193
+ <td>94.01
194
+ </td>
195
+ <td>99.8%
196
+ </td>
197
+ </tr>
198
+ <tr>
199
+ <td>Hellaswag (10-shot)
200
+ </td>
201
+ <td>86.49
202
+ </td>
203
+ <td>86.47
204
+ </td>
205
+ <td>100.0%
206
+ </td>
207
+ </tr>
208
+ <tr>
209
+ <td>Winogrande (5-shot)
210
+ </td>
211
+ <td>84.77
212
+ </td>
213
+ <td>83.74
214
+ </td>
215
+ <td>98.8%
216
+ </td>
217
+ </tr>
218
+ <tr>
219
+ <td>TruthfulQA (0-shot, mc2)
220
+ </td>
221
+ <td>62.75
222
+ </td>
223
+ <td>63.09
224
+ </td>
225
+ <td>99.5%
226
+ </td>
227
+ </tr>
228
+ <tr>
229
+ <td><strong>Average</strong>
230
+ </td>
231
+ <td><strong>77.94</strong>
232
+ </td>
233
+ <td><strong>77.49</strong>
234
+ </td>
235
+ <td><strong>99.4%</strong>
236
+ </td>
237
+ </tr>
238
+ <tr>
239
+ <td rowspan="7" ><strong>OpenLLM v2</strong>
240
+ </td>
241
+ <td>MMLU-Pro (5-shot)
242
+ </td>
243
+ <td>51.89
244
+ </td>
245
+ <td>xxxx
246
+ </td>
247
+ <td>xxxx%
248
+ </td>
249
+ </tr>
250
+ <tr>
251
+ <td>IFEval (0-shot)
252
+ </td>
253
+ <td>90.89
254
+ </td>
255
+ <td>xxxx
256
+ </td>
257
+ <td>xxxx%
258
+ </td>
259
+ </tr>
260
+ <tr>
261
+ <td>BBH (3-shot)
262
+ </td>
263
+ <td>63.15
264
+ </td>
265
+ <td>xxxx
266
+ </td>
267
+ <td>xxxx%
268
+ </td>
269
+ </tr>
270
+ <tr>
271
+ <td>Math-lvl-5 (4-shot)
272
+ </td>
273
+ <td>0.17
274
+ </td>
275
+ <td>xxxx
276
+ </td>
277
+ <td>N/A
278
+ </td>
279
+ </tr>
280
+ <tr>
281
+ <td>GPQA (0-shot)
282
+ </td>
283
+ <td>46.10
284
+ </td>
285
+ <td>xxxx
286
+ </td>
287
+ <td>xxxx%
288
+ </td>
289
+ </tr>
290
+ <tr>
291
+ <td>MuSR (0-shot)
292
+ </td>
293
+ <td>44.35
294
+ </td>
295
+ <td>xxxx
296
+ </td>
297
+ <td>xxxx%
298
+ </td>
299
+ </tr>
300
+ <tr>
301
+ <td><strong>Average</strong>
302
+ </td>
303
+ <td><strong>49.42</strong>
304
+ </td>
305
+ <td><strong>xxxx</strong>
306
+ </td>
307
+ <td><strong>xxxx%</strong>
308
+ </td>
309
+ </tr>
310
+ <tr>
311
+ <td rowspan="2" ><strong>Coding</strong>
312
+ </td>
313
+ <td>HumanEval pass@1
314
+ </td>
315
+ <td>83.20
316
+ </td>
317
+ <td>83.30
318
+ </td>
319
+ <td>100.1%
320
+ </td>
321
+ </tr>
322
+ <tr>
323
+ <td>HumanEval+ pass@1
324
+ </td>
325
+ <td>78.40
326
+ </td>
327
+ <td>78.60
328
+ </td>
329
+ <td>100.3%
330
+ </td>
331
+ </tr>
332
+ <tr>
333
+ <td rowspan="9" ><strong>Multilingual</strong>
334
+ </td>
335
+ <td>Portuguese MMLU (5-shot)
336
+ </td>
337
+ <td>79.76
338
+ </td>
339
+ <td>xxxx
340
+ </td>
341
+ <td>xxxx%
342
+ </td>
343
+ </tr>
344
+ <tr>
345
+ <td>Spanish MMLU (5-shot)
346
+ </td>
347
+ <td>79.33
348
+ </td>
349
+ <td>xxxx
350
+ </td>
351
+ <td>xxxx%
352
+ </td>
353
+ </tr>
354
+ <tr>
355
+ <td>Italian MMLU (5-shot)
356
+ </td>
357
+ <td>79.15
358
+ </td>
359
+ <td>xxxx
360
+ </td>
361
+ <td>xxxx%
362
+ </td>
363
+ </tr>
364
+ <tr>
365
+ <td>German MMLU (5-shot)
366
+ </td>
367
+ <td>77.94
368
+ </td>
369
+ <td>xxxx
370
+ </td>
371
+ <td>xxxx%
372
+ </td>
373
+ </tr>
374
+ <tr>
375
+ <td>French MMLU (5-shot)
376
+ </td>
377
+ <td>75.69
378
+ </td>
379
+ <td>xxxx
380
+ </td>
381
+ <td>xxxx%
382
+ </td>
383
+ </tr>
384
+ <tr>
385
+ <td>Hindi MMLU (5-shot)
386
+ </td>
387
+ <td>73.81
388
+ </td>
389
+ <td>xxxx
390
+ </td>
391
+ <td>xxxx%
392
+ </td>
393
+ </tr>
394
+ <tr>
395
+ <td>Thai MMLU (5-shot)
396
+ </td>
397
+ <td>71.97
398
+ </td>
399
+ <td>xxxx
400
+ </td>
401
+ <td>xxxx%
402
+ </td>
403
+ </tr>
404
+ </table>
405
+
406
+ ### Reproduction
407
+
408
+ The results were obtained using the following commands:
409
+
410
+ #### MMLU
411
+ ```
412
+ lm_eval \
413
+ --model vllm \
414
+ --model_args pretrained="neuralmagic-ent/Llama-3.3-70B-Instruct-quantized.w8a8",dtype=auto,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
415
+ --tasks mmlu_llama_3.1_instruct \
416
+ --fewshot_as_multiturn \
417
+ --apply_chat_template \
418
+ --num_fewshot 5 \
419
+ --batch_size auto
420
+ ```
421
+
422
+ #### MMLU-CoT
423
+ ```
424
+ lm_eval \
425
+ --model vllm \
426
+ --model_args pretrained="neuralmagic-ent/Llama-3.3-70B-Instruct-quantized.w8a8",dtype=auto,max_model_len=4064,max_gen_toks=1024,tensor_parallel_size=1 \
427
+ --tasks mmlu_cot_0shot_llama_3.1_instruct \
428
+ --apply_chat_template \
429
+ --num_fewshot 0 \
430
+ --batch_size auto
431
+ ```
432
+
433
+ #### ARC-Challenge
434
+ ```
435
+ lm_eval \
436
+ --model vllm \
437
+ --model_args pretrained="neuralmagic-ent/Llama-3.3-70B-Instruct-quantized.w8a8",dtype=auto,max_model_len=3940,max_gen_toks=100,tensor_parallel_size=1 \
438
+ --tasks arc_challenge_llama_3.1_instruct \
439
+ --apply_chat_template \
440
+ --num_fewshot 0 \
441
+ --batch_size auto
442
+ ```
443
+
444
+ #### GSM-8K
445
+ ```
446
+ lm_eval \
447
+ --model vllm \
448
+ --model_args pretrained="neuralmagic-ent/Llama-3.3-70B-Instruct-quantized.w8a8",dtype=auto,max_model_len=4096,max_gen_toks=1024,tensor_parallel_size=1 \
449
+ --tasks gsm8k_cot_llama_3.1_instruct \
450
+ --fewshot_as_multiturn \
451
+ --apply_chat_template \
452
+ --num_fewshot 8 \
453
+ --batch_size auto
454
+ ```
455
+
456
+ #### Hellaswag
457
+ ```
458
+ lm_eval \
459
+ --model vllm \
460
+ --model_args pretrained="neuralmagic-ent/Llama-3.3-70B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
461
+ --tasks hellaswag \
462
+ --num_fewshot 10 \
463
+ --batch_size auto
464
+ ```
465
+
466
+ #### Winogrande
467
+ ```
468
+ lm_eval \
469
+ --model vllm \
470
+ --model_args pretrained="neuralmagic-ent/Llama-3.3-70B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
471
+ --tasks winogrande \
472
+ --num_fewshot 5 \
473
+ --batch_size auto
474
+ ```
475
+
476
+ #### TruthfulQA
477
+ ```
478
+ lm_eval \
479
+ --model vllm \
480
+ --model_args pretrained="neuralmagic-ent/Llama-3.3-70B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
481
+ --tasks truthfulqa \
482
+ --num_fewshot 0 \
483
+ --batch_size auto
484
+ ```
485
+
486
+ #### OpenLLM v2
487
+ ```
488
+ lm_eval \
489
+ --model vllm \
490
+ --model_args pretrained="neuralmagic-ent/Llama-3.3-70B-Instruct-quantized.w8a8",dtype=auto,max_model_len=4096,tensor_parallel_size=1,enable_chunked_prefill=True \
491
+ --apply_chat_template \
492
+ --fewshot_as_multiturn \
493
+ --tasks leaderboard \
494
+ --batch_size auto
495
+ ```
496
+
497
+ #### MMLU Portuguese
498
+ ```
499
+ lm_eval \
500
+ --model vllm \
501
+ --model_args pretrained="neuralmagic-ent/Llama-3.3-70B-Instruct-quantized.w8a8",dtype=auto,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
502
+ --tasks mmlu_pt_llama_3.1_instruct \
503
+ --fewshot_as_multiturn \
504
+ --apply_chat_template \
505
+ --num_fewshot 5 \
506
+ --batch_size auto
507
+ ```
508
+
509
+ #### MMLU Spanish
510
+ ```
511
+ lm_eval \
512
+ --model vllm \
513
+ --model_args pretrained="neuralmagic-ent/Llama-3.3-70B-Instruct-quantized.w8a8",dtype=auto,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
514
+ --tasks mmlu_es_llama_3.1_instruct \
515
+ --fewshot_as_multiturn \
516
+ --apply_chat_template \
517
+ --num_fewshot 5 \
518
+ --batch_size auto
519
+ ```
520
+
521
+ #### MMLU Italian
522
+ ```
523
+ lm_eval \
524
+ --model vllm \
525
+ --model_args pretrained="neuralmagic-ent/Llama-3.3-70B-Instruct-quantized.w8a8",dtype=auto,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
526
+ --tasks mmlu_it_llama_3.1_instruct \
527
+ --fewshot_as_multiturn \
528
+ --apply_chat_template \
529
+ --num_fewshot 5 \
530
+ --batch_size auto
531
+ ```
532
+
533
+ #### MMLU German
534
+ ```
535
+ lm_eval \
536
+ --model vllm \
537
+ --model_args pretrained="neuralmagic-ent/Llama-3.3-70B-Instruct-quantized.w8a8",dtype=auto,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
538
+ --tasks mmlu_de_llama_3.1_instruct \
539
+ --fewshot_as_multiturn \
540
+ --apply_chat_template \
541
+ --num_fewshot 5 \
542
+ --batch_size auto
543
+ ```
544
+
545
+ #### MMLU French
546
+ ```
547
+ lm_eval \
548
+ --model vllm \
549
+ --model_args pretrained="neuralmagic-ent/Llama-3.3-70B-Instruct-quantized.w8a8",dtype=auto,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
550
+ --tasks mmlu_fr_llama_3.1_instruct \
551
+ --fewshot_as_multiturn \
552
+ --apply_chat_template \
553
+ --num_fewshot 5 \
554
+ --batch_size auto
555
+ ```
556
+
557
+ #### MMLU Hindi
558
+ ```
559
+ lm_eval \
560
+ --model vllm \
561
+ --model_args pretrained="neuralmagic-ent/Llama-3.3-70B-Instruct-quantized.w8a8",dtype=auto,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
562
+ --tasks mmlu_hi_llama_3.1_instruct \
563
+ --fewshot_as_multiturn \
564
+ --apply_chat_template \
565
+ --num_fewshot 5 \
566
+ --batch_size auto
567
+ ```
568
+
569
+ #### MMLU Thai
570
+ ```
571
+ lm_eval \
572
+ --model vllm \
573
+ --model_args pretrained="neuralmagic-ent/Llama-3.3-70B-Instruct-quantized.w8a8",dtype=auto,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
574
+ --tasks mmlu_th_llama_3.1_instruct \
575
+ --fewshot_as_multiturn \
576
+ --apply_chat_template \
577
+ --num_fewshot 5 \
578
+ --batch_size auto
579
+ ```
580
+
581
+ #### HumanEval and HumanEval+
582
+ ##### Generation
583
+ ```
584
+ python3 codegen/generate.py \
585
+ --model neuralmagic-ent/Llama-3.3-70B-Instruct-quantized.w8a8 \
586
+ --bs 16 \
587
+ --temperature 0.2 \
588
+ --n_samples 50 \
589
+ --root "." \
590
+ --dataset humaneval
591
+ ```
592
+ ##### Sanitization
593
+ ```
594
+ python3 evalplus/sanitize.py \
595
+ humaneval/neuralmagic-ent--Llama-3.3-70B-Instruct-quantized.w8a8_vllm_temp_0.2
596
+ ```
597
+ ##### Evaluation
598
+ ```
599
+ evalplus.evaluate \
600
+ --dataset humaneval \
601
+ --samples humaneval/neuralmagic-ent--Llama-3.3-70B-Instruct-quantized.w8a8_vllm_temp_0.2-sanitized
602
+ ```