nm-research commited on
Commit
f8353f4
·
verified ·
1 Parent(s): 0d4e980

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -0
README.md CHANGED
@@ -225,6 +225,76 @@ evalplus.evaluate \
225
  </table>
226
 
227
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
228
 
229
 
230
 
 
225
  </table>
226
 
227
 
228
+ ## Inference Performance
229
+
230
+
231
+ This model achieves up to 1.2x speedup in single-stream deployment on L40 GPUs.
232
+ The following performance benchmarks were conducted with [vLLM](https://docs.vllm.ai/en/latest/) version 0.6.6.post1, and [GuideLLM](https://github.com/neuralmagic/guidellm).
233
+
234
+ <details>
235
+ <summary>Benchmarking Command</summary>
236
+
237
+ ```
238
+ guidellm --model neuralmagic/granite-3.1-2b-base-FP8-dynamic --target "http://localhost:8000/v1" --data-type emulated --data "prompt_tokens=<prompt_tokens>,generated_tokens=<generated_tokens>" --max seconds 360 --backend aiohttp_server
239
+ ```
240
+
241
+ </details>
242
+
243
+ ### Single-stream performance (measured with vLLM version 0.6.6.post1)
244
+ <table>
245
+ <tr>
246
+ <td></td>
247
+ <td></td>
248
+ <td></td>
249
+ <th style="text-align: center;" colspan="7" >Latency (s)</th>
250
+ </tr>
251
+ <tr>
252
+ <th>GPU class</th>
253
+ <th>Model</th>
254
+ <th>Speedup</th>
255
+ <th>Code Completion<br>prefill: 256 tokens<br>decode: 1024 tokens</th>
256
+ <th>Docstring Generation<br>prefill: 768 tokens<br>decode: 128 tokens</th>
257
+ <th>Code Fixing<br>prefill: 1024 tokens<br>decode: 1024 tokens</th>
258
+ <th>RAG<br>prefill: 1024 tokens<br>decode: 128 tokens</th>
259
+ <th>Instruction Following<br>prefill: 256 tokens<br>decode: 128 tokens</th>
260
+ <th>Multi-turn Chat<br>prefill: 512 tokens<br>decode: 256 tokens</th>
261
+ <th>Large Summarization<br>prefill: 4096 tokens<br>decode: 512 tokens</th>
262
+ </tr>
263
+ <tr>
264
+ <td style="vertical-align: middle;" rowspan="3" >L40</td>
265
+ <td>granite-3.1-2b-base</td>
266
+ <td></td>
267
+ <td>9.3</td>
268
+ <td>1.2</td>
269
+ <td>9.4</td>
270
+ <td>1.2</td>
271
+ <td>1.2</td>
272
+ <td>2.3</td>
273
+ <td>5.0</td>
274
+ </tr>
275
+ <tr>
276
+ <td>granite-3.1-2b-base-FP8-dynamic<br>(this model)</td>
277
+ <td>1.26</td>
278
+ <td>7.3</td>
279
+ <td>0.9</td>
280
+ <td>7.4</td>
281
+ <td>1.0</td>
282
+ <td>0.9</td>
283
+ <td>1.8</td>
284
+ <td>4.1</td>
285
+ </tr>
286
+ <tr>
287
+ <td>granite-3.1-2b-base-quantized.w4a16</td>
288
+ <td>1.88</td>
289
+ <td>4.8</td>
290
+ <td>0.6</td>
291
+ <td>4.9</td>
292
+ <td>0.6</td>
293
+ <td>0.6</td>
294
+ <td>1.2</td>
295
+ <td>2.8</td>
296
+ </tr>
297
+ </table>
298
 
299
 
300