nm-research commited on
Commit
debee57
·
verified ·
1 Parent(s): 613a7f4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +190 -3
README.md CHANGED
@@ -25,7 +25,7 @@ library_name: transformers
25
  - **Model Developers:** Neural Magic
26
 
27
  Quantized version of [ibm-granite/granite-3.1-2b-instruct](https://huggingface.co/ibm-granite/granite-3.1-2b-instruct).
28
- It achieves an average score of xxxx on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves xxxx.
29
 
30
  ### Model Optimizations
31
 
@@ -74,7 +74,7 @@ python quantize.py --model_path ibm-granite/granite-3.1-2b-instruct --quant_path
74
 
75
  ```python
76
  from datasets import load_dataset
77
- from transformers import AutoTokenizer
78
  from llmcompressor.modifiers.quantization import GPTQModifier
79
  from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
80
  from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot, apply
@@ -90,7 +90,7 @@ parser.add_argument('--dampening_frac', type=float, default=0.1)
90
  parser.add_argument('--observer', type=str, default="minmax")
91
  args = parser.parse_args()
92
 
93
- model = SparseAutoModelForCausalLM.from_pretrained(
94
  args.model_path,
95
  device_map="auto",
96
  torch_dtype="auto",
@@ -223,3 +223,190 @@ evalplus.evaluate \
223
  |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
224
  | HumanEval Pass@1 | 53.40 | 54.9 |
225
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  - **Model Developers:** Neural Magic
26
 
27
  Quantized version of [ibm-granite/granite-3.1-2b-instruct](https://huggingface.co/ibm-granite/granite-3.1-2b-instruct).
28
+ It achieves an average score of 61.68 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 61.98.
29
 
30
  ### Model Optimizations
31
 
 
74
 
75
  ```python
76
  from datasets import load_dataset
77
+ from transformers import AutoTokenizer, AutoModelForCausalLM
78
  from llmcompressor.modifiers.quantization import GPTQModifier
79
  from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
80
  from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot, apply
 
90
  parser.add_argument('--observer', type=str, default="minmax")
91
  args = parser.parse_args()
92
 
93
+ model = AutoModelForCausalLM.from_pretrained(
94
  args.model_path,
95
  device_map="auto",
96
  torch_dtype="auto",
 
223
  |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
224
  | HumanEval Pass@1 | 53.40 | 54.9 |
225
 
226
+
227
+ ## Inference Performance
228
+
229
+
230
+ This model achieves up to 1.4x speedup in single-stream deployment and up to 1.1x speedup in multi-stream asynchronous deployment, depending on hardware and use-case scenario.
231
+ The following performance benchmarks were conducted with [vLLM](https://docs.vllm.ai/en/latest/) version 0.6.6.post1, and [GuideLLM](https://github.com/neuralmagic/guidellm).
232
+
233
+ ### Single-stream performance (measured with vLLM version 0.6.6.post1)
234
+ <table>
235
+ <tr>
236
+ <td></td>
237
+ <td></td>
238
+ <td></td>
239
+ <th style="text-align: center;" colspan="7" >Latency (s)</th>
240
+ </tr>
241
+ <tr>
242
+ <th>GPU class</th>
243
+ <th>Model</th>
244
+ <th>Speedup</th>
245
+ <th>Code Completion<br>prefill: 256 tokens<br>decode: 1024 tokens</th>
246
+ <th>Docstring Generation<br>prefill: 768 tokens<br>decode: 128 tokens</th>
247
+ <th>Code Fixing<br>prefill: 1024 tokens<br>decode: 1024 tokens</th>
248
+ <th>RAG<br>prefill: 1024 tokens<br>decode: 128 tokens</th>
249
+ <th>Instruction Following<br>prefill: 256 tokens<br>decode: 128 tokens</th>
250
+ <th>Multi-turn Chat<br>prefill: 512 tokens<br>decode: 256 tokens</th>
251
+ <th>Large Summarization<br>prefill: 4096 tokens<br>decode: 512 tokens</th>
252
+ </tr>
253
+ <tr>
254
+ <td style="vertical-align: middle;" rowspan="3" >A5000</td>
255
+ <td>granite-3.1-2b-instruct</td>
256
+ <td></td>
257
+ <td>10.9</td>
258
+ <td>1.4</td>
259
+ <td>11.0</td>
260
+ <td>1.5</td>
261
+ <td>1.4</td>
262
+ <td>2.8</td>
263
+ <td>6.1</td>
264
+ </tr>
265
+ <tr>
266
+ <td>granite-3.1-2b-instruct-quantized.w8a8<br>(this model)</td>
267
+ <td>1.37</td>
268
+ <td>7.9</td>
269
+ <td>1.0</td>
270
+ <td>8.0</td>
271
+ <td>1.1</td>
272
+ <td>1.0</td>
273
+ <td>2.0</td>
274
+ <td>4.7</td>
275
+ </tr>
276
+ <tr>
277
+ <td>granite-3.1-2b-instruct-quantized.w4a16</td>
278
+ <td>1.94</td>
279
+ <td>5.4</td>
280
+ <td>0.7</td>
281
+ <td>5.5</td>
282
+ <td>0.8</td>
283
+ <td>0.7</td>
284
+ <td>1.4</td>
285
+ <td>3.4</td>
286
+ </tr>
287
+ <tr>
288
+ <td style="vertical-align: middle;" rowspan="3" >A6000</td>
289
+ <td>granite-3.1-2b-instruct</td>
290
+ <td></td>
291
+ <td>9.8</td>
292
+ <td>1.3</td>
293
+ <td>10.0</td>
294
+ <td>1.3</td>
295
+ <td>1.3</td>
296
+ <td>2.6</td>
297
+ <td>5.4</td>
298
+ </tr>
299
+ <tr>
300
+ <td>granite-3.1-2b-instruct-quantized.w8a8<br>(this model)</td>
301
+ <td>1.31</td>
302
+ <td>7.8</td>
303
+ <td>1.0</td>
304
+ <td>7.6</td>
305
+ <td>1.0</td>
306
+ <td>0.9</td>
307
+ <td>1.9</td>
308
+ <td>4.5</td>
309
+ </tr>
310
+ <tr>
311
+ <td>granite-3.1-2b-instruct-quantized.w4a16</td>
312
+ <td>1.87</td>
313
+ <td>5.1</td>
314
+ <td>0.7</td>
315
+ <td>5.2</td>
316
+ <td>0.7</td>
317
+ <td>0.7</td>
318
+ <td>1.3</td>
319
+ <td>3.1</td>
320
+ </tr>
321
+ </table>
322
+
323
+
324
+ ### Multi-stream asynchronous performance (measured with vLLM version 0.6.6.post1)
325
+ <table>
326
+ <tr>
327
+ <td></td>
328
+ <td></td>
329
+ <td></td>
330
+ <th style="text-align: center;" colspan="7" >Maximum Throughput (Queries per Second)</th>
331
+ </tr>
332
+ <tr>
333
+ <th>GPU class</th>
334
+ <th>Model</th>
335
+ <th>Speedup</th>
336
+ <th>Code Completion<br>prefill: 256 tokens<br>decode: 1024 tokens</th>
337
+ <th>Docstring Generation<br>prefill: 768 tokens<br>decode: 128 tokens</th>
338
+ <th>Code Fixing<br>prefill: 1024 tokens<br>decode: 1024 tokens</th>
339
+ <th>RAG<br>prefill: 1024 tokens<br>decode: 128 tokens</th>
340
+ <th>Instruction Following<br>prefill: 256 tokens<br>decode: 128 tokens</th>
341
+ <th>Multi-turn Chat<br>prefill: 512 tokens<br>decode: 256 tokens</th>
342
+ <th>Large Summarization<br>prefill: 4096 tokens<br>decode: 512 tokens</th>
343
+ </tr>
344
+ <tr>
345
+ <td style="vertical-align: middle;" rowspan="3" >A5000</td>
346
+ <td>granite-3.1-2b-instruct</td>
347
+ <td></td>
348
+ <td>2.9</td>
349
+ <td>10.2</td>
350
+ <td>1.8</td>
351
+ <td>8.2</td>
352
+ <td>19.3</td>
353
+ <td>9.1</td>
354
+ <td>1.3</td>
355
+ </tr>
356
+ <tr>
357
+ <td>granite-3.1-2b-instruct-quantized.w8a8<br>(this model)</td>
358
+ <td>1.13</td>
359
+ <td>3.1</td>
360
+ <td>12.1</td>
361
+ <td>2.0</td>
362
+ <td>9.6</td>
363
+ <td>22.2</td>
364
+ <td>10.2</td>
365
+ <td>1.4</td>
366
+ </tr>
367
+ <tr>
368
+ <td>granite-3.1-2b-instruct-quantized.w4a16</td>
369
+ <td>0.98</td>
370
+ <td>2.8</td>
371
+ <td>10.0</td>
372
+ <td>1.8</td>
373
+ <td>8.1</td>
374
+ <td>18.6</td>
375
+ <td>9.0</td>
376
+ <td>1.2</td>
377
+ </tr>
378
+ <tr>
379
+ <td style="vertical-align: middle;" rowspan="3" >A6000</td>
380
+ <td>granite-3.1-2b-instruct</td>
381
+ <td></td>
382
+ <td>3.7</td>
383
+ <td>12.4</td>
384
+ <td>2.4</td>
385
+ <td>10.3</td>
386
+ <td>23.6</td>
387
+ <td>11.0</td>
388
+ <td>1.6</td>
389
+ </tr>
390
+ <tr>
391
+ <td>granite-3.1-2b-instruct-quantized.w8a8<br>(this model)</td>
392
+ <td>1.12</td>
393
+ <td>3.6</td>
394
+ <td>14.4</td>
395
+ <td>2.7</td>
396
+ <td>12.0</td>
397
+ <td>28.3</td>
398
+ <td>12.9</td>
399
+ <td>1.7</td>
400
+ </tr>
401
+ <tr>
402
+ <td>granite-3.1-2b-instruct-quantized.w4a16</td>
403
+ <td>0.95</td>
404
+ <td>3.7</td>
405
+ <td>11.4</td>
406
+ <td>2.5</td>
407
+ <td>9.8</td>
408
+ <td>22.1</td>
409
+ <td>10.4</td>
410
+ <td>1.4</td>
411
+ </tr>
412
+ </table>