nm-research commited on
Commit
4d3ece9
·
verified ·
1 Parent(s): 1068a02

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +171 -82
README.md CHANGED
@@ -8,29 +8,33 @@ base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
8
  library_name: transformers
9
  ---
10
 
11
- # DeepSeek-R1-Distill-Qwen-14B-FP8-Dynamic
12
 
13
  ## Model Overview
14
- - **Model Architecture:** DeepSeek-R1-Distill-Qwen-14B
15
  - **Input:** Text
16
  - **Output:** Text
17
  - **Model Optimizations:**
18
  - **Weight quantization:** FP8
19
  - **Activation quantization:** FP8
20
- - **Release Date:** 2/6/2025
21
  - **Version:** 1.0
22
  - **Model Developers:** Neural Magic
23
 
24
  Quantized version of [DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B).
25
 
 
26
  ### Model Optimizations
27
 
28
- This model was obtained by quantizing the weights and activations to FP8 data type, ready for inference with vLLM.
29
- This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks are quantized.
 
 
 
 
30
 
31
- ## Deployment
32
 
33
- ### Use with vLLM
34
 
35
  This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
36
 
@@ -38,11 +42,12 @@ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/
38
  from transformers import AutoTokenizer
39
  from vllm import LLM, SamplingParams
40
 
41
- max_model_len, tp_size = 4096, 1
42
- model_name = "neuralmagic-ent/DeepSeek-R1-Distill-Qwen-14B-FP8-Dynamic"
 
43
  tokenizer = AutoTokenizer.from_pretrained(model_name)
44
- llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
45
- sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
46
 
47
  messages_list = [
48
  [{"role": "user", "content": "Who are you? Please respond in pirate speak!"}],
@@ -64,44 +69,40 @@ This model was created with [llm-compressor](https://github.com/vllm-project/llm
64
 
65
 
66
  ```python
67
- import argparse
68
  from transformers import AutoModelForCausalLM, AutoTokenizer
69
  from llmcompressor.modifiers.quantization import QuantizationModifier
70
  from llmcompressor.transformers import oneshot
71
  import os
72
 
73
- def main():
74
- parser = argparse.ArgumentParser(description='Quantize a transformer model to FP8')
75
- parser.add_argument('--model_id', type=str, required=True,
76
- help='The model ID from HuggingFace (e.g., "meta-llama/Meta-Llama-3-8B-Instruct")')
77
- parser.add_argument('--save_path', type=str, default='.',
78
- help='Custom path to save the quantized model. If not provided, will use model_name-FP8-dynamic')
79
- args = parser.parse_args()
80
-
81
- # Load model
82
- model = AutoModelForCausalLM.from_pretrained(
83
- args.model_id, device_map="auto", torch_dtype="auto", trust_remote_code=True,
84
- )
85
- tokenizer = AutoTokenizer.from_pretrained(args.model_id)
86
-
87
- # Configure the quantization algorithm and scheme
88
- recipe = QuantizationModifier(
89
- targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
90
- )
91
-
92
- # Apply quantization
93
- oneshot(model=model, recipe=recipe)
94
-
95
- save_path = os.path.join(args.save_path, args.model_id.split("/")[1] + "-FP8-dynamic")
96
- os.makedirs(save_path, exist_ok=True)
97
-
98
- # Save to disk in compressed-tensors format
99
- model.save_pretrained(save_path)
100
- tokenizer.save_pretrained(save_path)
101
- print(f"Model and tokenizer saved to: {save_path}")
102
-
103
- if __name__ == "__main__":
104
- main()
105
  ```
106
 
107
  ## Evaluation
@@ -112,7 +113,7 @@ OpenLLM Leaderboard V1:
112
  ```
113
  lm_eval \
114
  --model vllm \
115
- --model_args pretrained="neuralmagic-ent/DeepSeek-R1-Distill-Qwen-14B-FP8-Dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
116
  --tasks openllm \
117
  --write_out \
118
  --batch_size auto \
@@ -124,7 +125,7 @@ OpenLLM Leaderboard V2:
124
  ```
125
  lm_eval \
126
  --model vllm \
127
- --model_args pretrained="neuralmagic-ent/DeepSeek-R1-Distill-Qwen-14B-FP8-Dynamic",dtype=auto,add_bos_token=False,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
128
  --apply_chat_template \
129
  --fewshot_as_multiturn \
130
  --tasks leaderboard \
@@ -132,43 +133,131 @@ lm_eval \
132
  --batch_size auto \
133
  --output_path output_dir \
134
  --show_config
135
-
136
  ```
137
 
138
  ### Accuracy
139
 
140
- #### OpenLLM Leaderboard V1 evaluation scores
141
-
142
- | Metric | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | neuralmagic-ent/DeepSeek-R1-Distill-Qwen-14B-FP8-Dynamic |
143
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
144
- | ARC-Challenge (Acc-Norm, 25-shot) | 58.79 | 58.02 |
145
- | GSM8K (Strict-Match, 5-shot) | 87.04 | 87.41 |
146
- | HellaSwag (Acc-Norm, 10-shot) | 81.51 | 81.46 |
147
- | MMLU (Acc, 5-shot) | 74.46 | 74.63 |
148
- | TruthfulQA (MC2, 0-shot) | 54.77 | 54.36 |
149
- | Winogrande (Acc, 5-shot) | 69.38 | 68.98 |
150
- | **Average Score** | **70.99** | **70.81** |
151
- | **Recovery (%)** | **100.00** | **99.75** |
152
-
153
- #### OpenLLM Leaderboard V2 evaluation scores
154
-
155
- | Metric | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | neuralmagic-ent/DeepSeek-R1-Distill-Qwen-14B-FP8-Dynamic |
156
- |---------------------------------------------------------|:---------------------------------:|:-------------------------------------------:|
157
- | IFEval (Inst-and-Prompt Level Strict Acc, 0-shot) | 43.05 | 43.69 |
158
- | BBH (Acc-Norm, 3-shot) | 47.16 | 47.92 |
159
- | GPQA (Acc-Norm, 0-shot) | 35.07 | 35.05 |
160
- | MUSR (Acc-Norm, 0-shot) | 45.14 | 44.62 |
161
- | MMLU-Pro (Acc, 5-shot) | 34.86 | 35.04 |
162
- | **Average Score** | **41.05** | **41.26** |
163
- | **Recovery (%)** | **100.00** | **100.51** |
164
-
165
- #### Coding evaluation scores
166
-
167
- | Metric | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | neuralmagic-ent/DeepSeek-R1-Distill-Qwen-14B-FP8-Dynamic |
168
- |---------------------------------------------------------|:---------------------------------:|:-------------------------------------------:|
169
- | HumanEval pass@1 | 78.90 | 77.20 |
170
- | HumanEval pass@10 | 89.80 | 90.40 |
171
- | HumanEval+ pass@1 | 72.60 | 72.40 |
172
- | HumanEval+ pass@10 | 84.90 | 85.90 |
173
- | **Average Score** | **81.55** | **81.47** |
174
- | **Recovery (%)** | **100.00** | **99.90** |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  library_name: transformers
9
  ---
10
 
11
+ # DeepSeek-R1-Distill-Qwen-14B-FP8-dynamic
12
 
13
  ## Model Overview
14
+ - **Model Architecture:** Qwen2ForCausalLM
15
  - **Input:** Text
16
  - **Output:** Text
17
  - **Model Optimizations:**
18
  - **Weight quantization:** FP8
19
  - **Activation quantization:** FP8
20
+ - **Release Date:** 2/5/2025
21
  - **Version:** 1.0
22
  - **Model Developers:** Neural Magic
23
 
24
  Quantized version of [DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B).
25
 
26
+
27
  ### Model Optimizations
28
 
29
+ This model was obtained by quantizing the weights and activations of [DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B) to FP8 data type.
30
+ This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
31
+
32
+ Only the weights and activations of the linear operators within transformers blocks are quantized.
33
+ Weights are quantized using a symmetric per-channel scheme, whereas quantizations are quantized using a symmetric per-token scheme.
34
+ [LLM Compressor](https://github.com/vllm-project/llm-compressor) is used for quantization.
35
 
 
36
 
37
+ ## Use with vLLM
38
 
39
  This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
40
 
 
42
  from transformers import AutoTokenizer
43
  from vllm import LLM, SamplingParams
44
 
45
+ number_gpus = 1
46
+ model_name = "neuralmagic/DeepSeek-R1-Distill-Qwen-14B-dynamic"
47
+
48
  tokenizer = AutoTokenizer.from_pretrained(model_name)
49
+ sampling_params = SamplingParams(temperature=0.6, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
50
+ llm = LLM(model=model_name, tensor_parallel_size=number_gpus, trust_remote_code=True)
51
 
52
  messages_list = [
53
  [{"role": "user", "content": "Who are you? Please respond in pirate speak!"}],
 
69
 
70
 
71
  ```python
 
72
  from transformers import AutoModelForCausalLM, AutoTokenizer
73
  from llmcompressor.modifiers.quantization import QuantizationModifier
74
  from llmcompressor.transformers import oneshot
75
  import os
76
 
77
+ # Load model
78
+ model_stub = "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B"
79
+ model_name = model_stub.split("/")[-1]
80
+
81
+ model = AutoModelForCausalLM.from_pretrained(
82
+ model_stub,
83
+ torch_dtype="auto",
84
+ )
85
+
86
+ tokenizer = AutoTokenizer.from_pretrained(model_stub)
87
+
88
+ # Configure the quantization algorithm and scheme
89
+ recipe = QuantizationModifier(
90
+ targets="Linear",
91
+ scheme="FP8_DYNAMIC",
92
+ ignore=["lm_head"],
93
+ )
94
+
95
+ # Apply quantization
96
+ oneshot(
97
+ model=model,
98
+ recipe=recipe,
99
+ )
100
+
101
+ # Save to disk in compressed-tensors format
102
+ save_path = model_name + "-FP8-dynamic
103
+ model.save_pretrained(save_path)
104
+ tokenizer.save_pretrained(save_path)
105
+ print(f"Model and tokenizer saved to: {save_path}")
 
 
 
106
  ```
107
 
108
  ## Evaluation
 
113
  ```
114
  lm_eval \
115
  --model vllm \
116
+ --model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Qwen-14B-FP8-dynamic",dtype=auto,max_model_len=4096,tensor_parallel_size=1,enable_chunked_prefill=True \
117
  --tasks openllm \
118
  --write_out \
119
  --batch_size auto \
 
125
  ```
126
  lm_eval \
127
  --model vllm \
128
+ --model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Qwen-14B-FP8-dynamic",dtype=auto,max_model_len=4096,tensor_parallel_size=1,enable_chunked_prefill=True \
129
  --apply_chat_template \
130
  --fewshot_as_multiturn \
131
  --tasks leaderboard \
 
133
  --batch_size auto \
134
  --output_path output_dir \
135
  --show_config
 
136
  ```
137
 
138
  ### Accuracy
139
 
140
+ <table>
141
+ <thead>
142
+ <tr>
143
+ <th>Category</th>
144
+ <th>Metric</th>
145
+ <th>deepseek-ai/DeepSeek-R1-Distill-Qwen-14B</th>
146
+ <th>neuralmagic/DeepSeek-R1-Distill-Qwen-14B-FP8-dynamic</th>
147
+ <th>Recovery</th>
148
+ </tr>
149
+ </thead>
150
+ <tbody>
151
+ <tr>
152
+ <td rowspan="7"><b>OpenLLM V1</b></td>
153
+ <td>ARC-Challenge (Acc-Norm, 25-shot)</td>
154
+ <td>58.79</td>
155
+ <td>58.02</td>
156
+ <td>98.7%</td>
157
+ </tr>
158
+ <tr>
159
+ <td>GSM8K (Strict-Match, 5-shot)</td>
160
+ <td>87.04</td>
161
+ <td>87.41</td>
162
+ <td>100.4%</td>
163
+ </tr>
164
+ <tr>
165
+ <td>HellaSwag (Acc-Norm, 10-shot)</td>
166
+ <td>81.51</td>
167
+ <td>81.46</td>
168
+ <td>100.0%</td>
169
+ </tr>
170
+ <tr>
171
+ <td>MMLU (Acc, 5-shot)</td>
172
+ <td>74.46</td>
173
+ <td>74.63</td>
174
+ <td>100.2%</td>
175
+ </tr>
176
+ <tr>
177
+ <td>TruthfulQA (MC2, 0-shot)</td>
178
+ <td>54.77</td>
179
+ <td>54.36</td>
180
+ <td>99.3%</td>
181
+ </tr>
182
+ <tr>
183
+ <td>Winogrande (Acc, 5-shot)</td>
184
+ <td>69.38</td>
185
+ <td>68.98</td>
186
+ <td>99.4%</td>
187
+ </tr>
188
+ <tr>
189
+ <td><b>Average Score</b></td>
190
+ <td><b>70.99</b></td>
191
+ <td><b>70.81</b></td>
192
+ <td><b>99.8%</b></td>
193
+ </tr>
194
+ <tr>
195
+ <td rowspan="7"><b>OpenLLM V2</b></td>
196
+ <td>IFEval (Inst Level Strict Acc, 0-shot)</td>
197
+ <td>43.05</td>
198
+ <td>43.69</td>
199
+ <td>101.5%</td>
200
+ </tr>
201
+ <tr>
202
+ <td>BBH (Acc-Norm, 3-shot)</td>
203
+ <td>47.16</td>
204
+ <td>47.92</td>
205
+ <td>101.6%</td>
206
+ </tr>
207
+ <tr>
208
+ <td>Math-Hard (Exact-Match, 4-shot)</td>
209
+ <td>0.00</td>
210
+ <td>0.00</td>
211
+ <td>---</td>
212
+ </tr>
213
+ <tr>
214
+ <td>GPQA (Acc-Norm, 0-shot)</td>
215
+ <td>35.07</td>
216
+ <td>35.05</td>
217
+ <td>100.0%</td>
218
+ </tr>
219
+ <tr>
220
+ <td>MUSR (Acc-Norm, 0-shot)</td>
221
+ <td>45.14</td>
222
+ <td>44.62</td>
223
+ <td>98.8%</td>
224
+ </tr>
225
+ <tr>
226
+ <td>MMLU-Pro (Acc, 5-shot)</td>
227
+ <td>34.86</td>
228
+ <td>35.04</td>
229
+ <td>100.5%</td>
230
+ </tr>
231
+ <tr>
232
+ <td><b>Average Score</b></td>
233
+ <td><b>34.21</b></td>
234
+ <td><b>34.39</b></td>
235
+ <td><b>100.5%</b></td>
236
+ </tr>
237
+ <tr>
238
+ <td rowspan="4"><b>Coding</b></td>
239
+ <td>HumanEval (pass@1)</td>
240
+ <td>78.90</td>
241
+ <td>77.20</td>
242
+ <td><b>97.9%</b></td>
243
+ </tr>
244
+ <tr>
245
+ <td>HumanEval (pass@10)</td>
246
+ <td>89.80</td>
247
+ <td>90.40</td>
248
+ <td>100.7%</td>
249
+ </tr>
250
+ <tr>
251
+ <td>HumanEval+ (pass@10)</td>
252
+ <td>72.60</td>
253
+ <td>72.40</td>
254
+ <td>99.7%</td>
255
+ </tr>
256
+ <tr>
257
+ <td>HumanEval+ (pass@10)</td>
258
+ <td>84.90</td>
259
+ <td>85.90</td>
260
+ <td>101.2%</td>
261
+ </tr>
262
+ </tbody>
263
+ </table>