KylePenaKuzco commited on
Commit
92417ce
·
verified ·
1 Parent(s): 17c9fb9

Add 1 files

Browse files
Files changed (1) hide show
  1. README.md +348 -0
README.md ADDED
@@ -0,0 +1,348 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - fp8
4
+ - vllm
5
+ language:
6
+ - en
7
+ - de
8
+ - fr
9
+ - it
10
+ - pt
11
+ - hi
12
+ - es
13
+ - th
14
+ pipeline_tag: text-generation
15
+ license: llama3.1
16
+ base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
17
+ ---
18
+
19
+ # Meta-Llama-3.1-8B-Instruct-FP8
20
+
21
+ ## Model Overview
22
+ - **Model Architecture:** Meta-Llama-3.1
23
+ - **Input:** Text
24
+ - **Output:** Text
25
+ - **Model Optimizations:**
26
+ - **Weight quantization:** FP8
27
+ - **Activation quantization:** FP8
28
+ - **Intended Use Cases:** Intended for commercial and research use in multiple languages. Similarly to [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct), this models is intended for assistant-like chat.
29
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
30
+ - **Release Date:** 7/23/2024
31
+ - **Version:** 1.0
32
+ - **License(s):** [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
33
+ - **Model Developers:** Neural Magic
34
+
35
+ Quantized version of [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct).
36
+ It achieves an average score of 73.44 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 73.79.
37
+
38
+ ### Model Optimizations
39
+
40
+ This model was obtained by quantizing the weights and activations of [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) to FP8 data type, ready for inference with vLLM built from source.
41
+ This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
42
+
43
+ Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations.
44
+ [LLM Compressor](https://github.com/vllm-project/llm-compressor) is used for quantization with 512 sequences of UltraChat.
45
+
46
+ ## Deployment
47
+
48
+ ### Use with vLLM
49
+
50
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
51
+
52
+ ```python
53
+ from vllm import LLM, SamplingParams
54
+ from transformers import AutoTokenizer
55
+
56
+ model_id = "neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8"
57
+
58
+ sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
59
+
60
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
61
+
62
+ messages = [
63
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
64
+ {"role": "user", "content": "Who are you?"},
65
+ ]
66
+
67
+ prompts = tokenizer.apply_chat_template(messages, tokenize=False)
68
+
69
+ llm = LLM(model=model_id)
70
+
71
+ outputs = llm.generate(prompts, sampling_params)
72
+
73
+ generated_text = outputs[0].outputs[0].text
74
+ print(generated_text)
75
+ ```
76
+
77
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
78
+
79
+ ## Creation
80
+
81
+ This model was created by applying [LLM Compressor with calibration samples from UltraChat](https://github.com/vllm-project/llm-compressor/blob/sa/big_model_support/examples/big_model_offloading/big_model_w8a8_calibrate.py), as presented in the code snipet below.
82
+
83
+ ```python
84
+ import torch
85
+ from datasets import load_dataset
86
+ from transformers import AutoTokenizer
87
+
88
+ from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
89
+ from llmcompressor.transformers.compression.helpers import (
90
+ calculate_offload_device_map,
91
+ custom_offload_device_map,
92
+ )
93
+
94
+ recipe = """
95
+ quant_stage:
96
+ quant_modifiers:
97
+ QuantizationModifier:
98
+ ignore: ["lm_head"]
99
+ config_groups:
100
+ group_0:
101
+ weights:
102
+ num_bits: 8
103
+ type: float
104
+ strategy: tensor
105
+ dynamic: false
106
+ symmetric: true
107
+ input_activations:
108
+ num_bits: 8
109
+ type: float
110
+ strategy: tensor
111
+ dynamic: false
112
+ symmetric: true
113
+ targets: ["Linear"]
114
+ """
115
+
116
+ model_stub = "meta-llama/Meta-Llama-3.1-8B-Instruct"
117
+ model_name = model_stub.split("/")[-1]
118
+
119
+ device_map = calculate_offload_device_map(
120
+ model_stub, reserve_for_hessians=False, num_gpus=1, torch_dtype="auto"
121
+ )
122
+
123
+ model = SparseAutoModelForCausalLM.from_pretrained(
124
+ model_stub, torch_dtype="auto", device_map=device_map
125
+ )
126
+ tokenizer = AutoTokenizer.from_pretrained(model_stub)
127
+
128
+ output_dir = f"./{model_name}-FP8"
129
+
130
+ DATASET_ID = "HuggingFaceH4/ultrachat_200k"
131
+ DATASET_SPLIT = "train_sft"
132
+ NUM_CALIBRATION_SAMPLES = 512
133
+ MAX_SEQUENCE_LENGTH = 4096
134
+
135
+ ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
136
+ ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
137
+
138
+ def preprocess(example):
139
+ return {
140
+ "text": tokenizer.apply_chat_template(
141
+ example["messages"],
142
+ tokenize=False,
143
+ )
144
+ }
145
+
146
+ ds = ds.map(preprocess)
147
+
148
+ def tokenize(sample):
149
+ return tokenizer(
150
+ sample["text"],
151
+ padding=False,
152
+ max_length=MAX_SEQUENCE_LENGTH,
153
+ truncation=True,
154
+ add_special_tokens=False,
155
+ )
156
+
157
+ ds = ds.map(tokenize, remove_columns=ds.column_names)
158
+
159
+ oneshot(
160
+ model=model,
161
+ output_dir=output_dir,
162
+ dataset=ds,
163
+ recipe=recipe,
164
+ max_seq_length=MAX_SEQUENCE_LENGTH,
165
+ num_calibration_samples=NUM_CALIBRATION_SAMPLES,
166
+ save_compressed=True,
167
+ )
168
+ ```
169
+
170
+ ## Evaluation
171
+
172
+ The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
173
+ Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
174
+ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GSM-8K, MMLU, and MMLU-cot that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals).
175
+
176
+ ### Accuracy
177
+
178
+ #### Open LLM Leaderboard evaluation scores
179
+ <table>
180
+ <tr>
181
+ <td><strong>Benchmark</strong>
182
+ </td>
183
+ <td><strong>Meta-Llama-3.1-8B-Instruct </strong>
184
+ </td>
185
+ <td><strong>Meta-Llama-3.1-8B-Instruct-FP8(this model)</strong>
186
+ </td>
187
+ <td><strong>Recovery</strong>
188
+ </td>
189
+ </tr>
190
+ <tr>
191
+ <td>MMLU (5-shot)
192
+ </td>
193
+ <td>67.95
194
+ </td>
195
+ <td>67.97
196
+ </td>
197
+ <td>100.0%
198
+ </td>
199
+ </tr>
200
+ <tr>
201
+ <td>MMLU-cot (0-shot)
202
+ </td>
203
+ <td>71.24
204
+ </td>
205
+ <td>71.12
206
+ </td>
207
+ <td>99.83%
208
+ </td>
209
+ </tr>
210
+ <tr>
211
+ <td>ARC Challenge (0-shot)
212
+ </td>
213
+ <td>82.00
214
+ </td>
215
+ <td>81.66
216
+ </td>
217
+ <td>99.59%
218
+ </td>
219
+ </tr>
220
+ <tr>
221
+ <td>GSM-8K-cot (8-shot, strict-match)
222
+ </td>
223
+ <td>81.96
224
+ </td>
225
+ <td>81.12
226
+ </td>
227
+ <td>98.98%
228
+ </td>
229
+ </tr>
230
+ <tr>
231
+ <td>Hellaswag (10-shot)
232
+ </td>
233
+ <td>80.46
234
+ </td>
235
+ <td>80.4
236
+ </td>
237
+ <td>99.93%
238
+ </td>
239
+ </tr>
240
+ <tr>
241
+ <td>Winogrande (5-shot)
242
+ </td>
243
+ <td>78.45
244
+ </td>
245
+ <td>77.90
246
+ </td>
247
+ <td>99.30%
248
+ </td>
249
+ </tr>
250
+ <tr>
251
+ <td>TruthfulQA (0-shot, mc2)
252
+ </td>
253
+ <td>54.50
254
+ </td>
255
+ <td>53.92
256
+ </td>
257
+ <td>98.94%
258
+ </td>
259
+ </tr>
260
+ <tr>
261
+ <td><strong>Average</strong>
262
+ </td>
263
+ <td><strong>73.79</strong>
264
+ </td>
265
+ <td><strong>73.44</strong>
266
+ </td>
267
+ <td><strong>99.52%</strong>
268
+ </td>
269
+ </tr>
270
+ </table>
271
+
272
+ ### Reproduction
273
+
274
+ The results were obtained using the following commands:
275
+
276
+ #### MMLU
277
+ ```
278
+ lm_eval \
279
+ --model vllm \
280
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
281
+ --tasks mmlu \
282
+ --num_fewshot 5 \
283
+ --batch_size auto
284
+ ```
285
+
286
+ #### MMLU-cot
287
+ ```
288
+ lm_eval \
289
+ --model vllm \
290
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
291
+ --tasks mmlu_cot_0shot_llama_3.1_instruct \
292
+ --apply_chat_template \
293
+ --num_fewshot 0 \
294
+ --batch_size auto
295
+ ```
296
+
297
+ #### ARC-Challenge
298
+ ```
299
+ lm_eval \
300
+ --model vllm \
301
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
302
+ --tasks arc_challenge_llama_3.1_instruct \
303
+ --apply_chat_template \
304
+ --num_fewshot 0 \
305
+ --batch_size auto
306
+ ```
307
+
308
+ #### GSM-8K
309
+ ```
310
+ lm_eval \
311
+ --model vllm \
312
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
313
+ --tasks gsm8k_cot_llama_3.1_instruct \
314
+ --apply_chat_template \
315
+ --fewshot_as_multiturn \
316
+ --num_fewshot 8 \
317
+ --batch_size auto
318
+ ```
319
+
320
+ #### Hellaswag
321
+ ```
322
+ lm_eval \
323
+ --model vllm \
324
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
325
+ --tasks hellaswag \
326
+ --num_fewshot 10 \
327
+ --batch_size auto
328
+ ```
329
+
330
+ #### Winogrande
331
+ ```
332
+ lm_eval \
333
+ --model vllm \
334
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
335
+ --tasks winogrande \
336
+ --num_fewshot 5 \
337
+ --batch_size auto
338
+ ```
339
+
340
+ #### TruthfulQA
341
+ ```
342
+ lm_eval \
343
+ --model vllm \
344
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
345
+ --tasks truthfulqa \
346
+ --num_fewshot 0 \
347
+ --batch_size auto
348
+ ```