qlebourgeois VictorBard commited on
Commit
c9f3de3
·
verified ·
1 Parent(s): cf6d54a

Upload 8 files (#1)

Browse files

- Upload 8 files (679e7cf3ba15898762123cf01aa66d60415c27d8)


Co-authored-by: Victor Bard <[email protected]>

README.md CHANGED
@@ -1,3 +1,197 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - fp8
4
+ - vllm
5
+ license: apache-2.0
6
+ ---
7
+
8
+ # Mixtral-8x7B-Instruct-v0.1-FP8
9
+
10
+ ## Model Overview
11
+ - **Model Architecture:** Mixtral-8x7B-Instruct-v0.1
12
+ - **Input:** Text
13
+ - **Output:** Text
14
+ - **Model Optimizations:**
15
+ - **Weight quantization:** FP8
16
+ - **Activation quantization:** FP8
17
+ - **Intended Use Cases:** Intended for commercial and research use in English. Similarly to [Meta-Llama-3-7B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-7B-Instruct), this models is intended for assistant-like chat.
18
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
19
+ - **Release Date:** 6/8/2024
20
+ - **Version:** 1.0
21
+ - **License(s):** [apache-2.0](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md)
22
+ - **Model Developers:** Neural Magic
23
+
24
+ Quantized version of [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1).
25
+ It achieves an average score of 73.19 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 73.48.
26
+
27
+ ### Model Optimizations
28
+
29
+ This model was obtained by quantizing the weights and activations of [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) to FP8 data type, ready for inference with vLLM >= 0.5.0.
30
+ This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
31
+
32
+ Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations.
33
+ [AutoFP8](https://github.com/neuralmagic/AutoFP8) is used for quantization with 512 sequences of UltraChat.
34
+
35
+ ## Deployment
36
+
37
+ ### Use with vLLM
38
+
39
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
40
+
41
+ ```python
42
+ from vllm import LLM, SamplingParams
43
+ from transformers import AutoTokenizer
44
+
45
+ model_id = "neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8"
46
+
47
+ sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
48
+
49
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
50
+
51
+ messages = [
52
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
53
+ {"role": "user", "content": "Who are you?"},
54
+ ]
55
+
56
+ prompts = tokenizer.apply_chat_template(messages, tokenize=False)
57
+
58
+ llm = LLM(model=model_id)
59
+
60
+ outputs = llm.generate(prompts, sampling_params)
61
+
62
+ generated_text = outputs[0].outputs[0].text
63
+ print(generated_text)
64
+ ```
65
+
66
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
67
+
68
+ ## Creation
69
+
70
+ This model was created by applying [AutoFP8 with calibration samples from ultrachat](https://github.com/neuralmagic/AutoFP8/blob/147fa4d9e1a90ef8a93f96fc7d9c33056ddc017a/example_dataset.py) with block_sparse_moe.gate layers kept at original precision, as presented in the code snipet below.
71
+ Although AutoFP8 was used for this particular model, Neural Magic is transitioning to using [llm-compressor](https://github.com/vllm-project/llm-compressor) which supports several quantization schemes and models not supported by AutoFP8.
72
+
73
+ ```python
74
+ from datasets import load_dataset
75
+ from transformers import AutoTokenizer
76
+
77
+ from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
78
+
79
+ pretrained_model_dir = "mistralai/Mixtral-8x7B-Instruct-v0.1"
80
+ quantized_model_dir = "Mixtral-8x7B-Instruct-v0.1-FP8"
81
+
82
+ tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True, model_max_length=4096)
83
+ tokenizer.pad_token = tokenizer.eos_token
84
+
85
+ ds = load_dataset("mgoin/ultrachat_2k", split="train_sft").select(range(512))
86
+ examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
87
+ examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt").to("cuda")
88
+
89
+ quantize_config = BaseQuantizeConfig(
90
+ quant_method="fp8",
91
+ activation_scheme="static"
92
+ ignore_patterns=["re:.*lm_head", "re:.*block_sparse_moe.gate"],
93
+ )
94
+
95
+ model = AutoFP8ForCausalLM.from_pretrained(
96
+ pretrained_model_dir, quantize_config=quantize_config
97
+ )
98
+ model.quantize(examples)
99
+ model.save_quantized(quantized_model_dir)
100
+ ```
101
+
102
+ ## Evaluation
103
+
104
+ The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/383bbd54bc621086e05aa1b030d8d4d5635b25e6) (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command:
105
+ ```
106
+ lm_eval \
107
+ --model vllm \
108
+ --model_args pretrained="neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8",dtype=auto,gpu_memory_utilization=0.4,add_bos_token=True,max_model_len=4096 \
109
+ --tasks openllm \
110
+ --batch_size auto
111
+ ```
112
+
113
+ ### Accuracy
114
+
115
+ #### Open LLM Leaderboard evaluation scores
116
+ <table>
117
+ <tr>
118
+ <td><strong>Benchmark</strong>
119
+ </td>
120
+ <td><strong>Mixtral-8x7B-Instruct-v0.1</strong>
121
+ </td>
122
+ <td><strong>Mixtral-8x7B-Instruct-v0.1-FP8(this model)</strong>
123
+ </td>
124
+ <td><strong>Recovery</strong>
125
+ </td>
126
+ </tr>
127
+ <tr>
128
+ <td>MMLU (5-shot)
129
+ </td>
130
+ <td>70.33
131
+ </td>
132
+ <td>70.00
133
+ </td>
134
+ <td>99.53%
135
+ </td>
136
+ </tr>
137
+ <tr>
138
+ <td>ARC Challenge (25-shot)
139
+ </td>
140
+ <td>71.50
141
+ </td>
142
+ <td>71.08
143
+ </td>
144
+ <td>99.41%
145
+ </td>
146
+ </tr>
147
+ <tr>
148
+ <td>GSM-8K (5-shot, strict-match)
149
+ </td>
150
+ <td>64.36
151
+ </td>
152
+ <td>64.06
153
+ </td>
154
+ <td>99.53%
155
+ </td>
156
+ </tr>
157
+ <tr>
158
+ <td>Hellaswag (10-shot)
159
+ </td>
160
+ <td>87.53
161
+ </td>
162
+ <td>87.38
163
+ </td>
164
+ <td>99.82%
165
+ </td>
166
+ </tr>
167
+ <tr>
168
+ <td>Winogrande (5-shot)
169
+ </td>
170
+ <td>82.40
171
+ </td>
172
+ <td>82.40
173
+ </td>
174
+ <td>100.0%
175
+ </td>
176
+ </tr>
177
+ <tr>
178
+ <td>TruthfulQA (0-shot)
179
+ </td>
180
+ <td>64.79
181
+ </td>
182
+ <td>64.20
183
+ </td>
184
+ <td>99.08%
185
+ </td>
186
+ </tr>
187
+ <tr>
188
+ <td><strong>Average</strong>
189
+ </td>
190
+ <td><strong>73.48</strong>
191
+ </td>
192
+ <td><strong>73.19</strong>
193
+ </td>
194
+ <td><strong>99.61%</strong>
195
+ </td>
196
+ </tr>
197
+ </table>
config.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "mistralai/Mixtral-8x7B-Instruct-v0.1",
3
+ "architectures": [
4
+ "MixtralForCausalLM"
5
+ ],
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 1,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "silu",
10
+ "hidden_size": 4096,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 14336,
13
+ "max_position_embeddings": 32768,
14
+ "model_type": "mixtral",
15
+ "num_attention_heads": 32,
16
+ "num_experts_per_tok": 2,
17
+ "num_hidden_layers": 32,
18
+ "num_key_value_heads": 8,
19
+ "num_local_experts": 8,
20
+ "output_router_logits": false,
21
+ "quantization_config": {
22
+ "activation_scheme": "static",
23
+ "quant_method": "fp8"
24
+ },
25
+ "rms_norm_eps": 1e-05,
26
+ "rope_theta": 1000000.0,
27
+ "router_aux_loss_coef": 0.02,
28
+ "router_jitter_noise": 0.0,
29
+ "sliding_window": null,
30
+ "tie_word_embeddings": false,
31
+ "torch_dtype": "bfloat16",
32
+ "transformers_version": "4.40.0",
33
+ "use_cache": true,
34
+ "vocab_size": 32000
35
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "transformers_version": "4.40.0"
6
+ }
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
special_tokens_map.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "unk_token": {
17
+ "content": "<unk>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ }
23
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "add_prefix_space": null,
5
+ "added_tokens_decoder": {
6
+ "0": {
7
+ "content": "<unk>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false,
12
+ "special": true
13
+ },
14
+ "1": {
15
+ "content": "<s>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false,
20
+ "special": true
21
+ },
22
+ "2": {
23
+ "content": "</s>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": false,
27
+ "single_word": false,
28
+ "special": true
29
+ }
30
+ },
31
+ "additional_special_tokens": [],
32
+ "bos_token": "<s>",
33
+ "chat_template": "{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content'] %}\n {%- set loop_messages = messages[1:] %}\n{%- else %}\n {%- set loop_messages = messages %}\n{%- endif %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}\n {{- raise_exception('After the optional system message, conversation roles must alternate user/assistant/user/assistant/...') }}\n {%- endif %}\n {%- if message['role'] == 'user' %}\n {%- if loop.first and system_message is defined %}\n {{- ' [INST] ' + system_message + '\\n\\n' + message['content'] + ' [/INST]' }}\n {%- else %}\n {{- ' [INST] ' + message['content'] + ' [/INST]' }}\n {%- endif %}\n {%- elif message['role'] == 'assistant' %}\n {{- ' ' + message['content'] + eos_token}}\n {%- else %}\n {{- raise_exception('Only user and assistant roles are supported, with the exception of an initial optional system message!') }}\n {%- endif %}\n{%- endfor %}\n",
34
+ "clean_up_tokenization_spaces": false,
35
+ "eos_token": "</s>",
36
+ "legacy": false,
37
+ "model_max_length": 1000000000000000019884624838656,
38
+ "pad_token": null,
39
+ "sp_model_kwargs": {},
40
+ "spaces_between_special_tokens": false,
41
+ "tokenizer_class": "LlamaTokenizer",
42
+ "unk_token": "<unk>",
43
+ "use_default_system_prompt": false
44
+ }