shubhrapandit commited on
Commit
d279068
·
verified ·
1 Parent(s): 647404d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -25
README.md CHANGED
@@ -154,6 +154,7 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
154
  </tr>
155
  <tr>
156
  <th>Hardware</th>
 
157
  <th>Model</th>
158
  <th>Average Cost Reduction</th>
159
  <th>Latency (s)</th>
@@ -166,7 +167,8 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
166
  </thead>
167
  <tbody>
168
  <tr>
169
- <td>A100x4</td>
 
170
  <td>Qwen/Qwen2-VL-72B-Instruct</td>
171
  <td></td>
172
  <td>6.5</td>
@@ -177,7 +179,7 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
177
  <td>113</td>
178
  </tr>
179
  <tr>
180
- <td>A100x2</td>
181
  <td>neuralmagic/Qwen2-VL-72B-Instruct-quantized.w8a8</td>
182
  <td>1.85</td>
183
  <td>7.2</td>
@@ -188,7 +190,7 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
188
  <td>211</td>
189
  </tr>
190
  <tr>
191
- <td>A100x1</td>
192
  <td>neuralmagic/Qwen2-VL-72B-Instruct-quantized.w4a16</td>
193
  <td>3.32</td>
194
  <td>10.0</td>
@@ -199,7 +201,8 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
199
  <td>419</td>
200
  </tr>
201
  <tr>
202
- <td>H100x4</td>
 
203
  <td>Qwen/Qwen2-VL-72B-Instruct</td>
204
  <td></td>
205
  <td>4.4</td>
@@ -210,7 +213,7 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
210
  <td>99</td>
211
  </tr>
212
  <tr>
213
- <td>H100x2</td>
214
  <td>neuralmagic/Qwen2-VL-72B-Instruct-FP8-Dynamic</td>
215
  <td>1.79</td>
216
  <td>4.7</td>
@@ -221,7 +224,7 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
221
  <td>177</td>
222
  </tr>
223
  <tr>
224
- <td>H100x1</td>
225
  <td>neuralmagic/Qwen2-VL-72B-Instruct-quantized.w4a16</td>
226
  <td>2.60</td>
227
  <td>6.4</td>
@@ -233,7 +236,10 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
233
  </tr>
234
  </tbody>
235
  </table>
236
-
 
 
 
237
 
238
  ### Multi-stream asynchronous performance (measured with vLLM version 0.7.2)
239
 
@@ -261,7 +267,7 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
261
  </thead>
262
  <tbody>
263
  <tr>
264
- <td>A100x4</td>
265
  <td>Qwen/Qwen2-VL-72B-Instruct</td>
266
  <td></td>
267
  <td>0.3</td>
@@ -272,29 +278,27 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
272
  <td>595</td>
273
  </tr>
274
  <tr>
275
- <td>A100x2</td>
276
  <td>neuralmagic/Qwen2-VL-72B-Instruct-quantized.w8a8</td>
277
  <td>1.84</td>
278
- <td>0.6</td>
279
  <td>293</td>
280
- <td>2.0</td>
281
  <td>1021</td>
282
- <td>2.3</td>
283
  <td>1135</td>
284
  </tr>
285
  <tr>
286
- <td>A100x1</td>
287
  <td>neuralmagic/Qwen2-VL-72B-Instruct-quantized.w4a16</td>
288
  <td>2.73</td>
289
- <td>0.6</td>
290
  <td>314</td>
291
- <td>3.2</td>
292
  <td>1591</td>
293
- <td>4.0</td>
294
  <td>2019</td>
295
  </tr>
296
  <tr>
297
- <td>H100x4</td>
298
  <td>Qwen/Qwen2-VL-72B-Instruct</td>
299
  <td></td>
300
  <td>0.5</td>
@@ -305,27 +309,31 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
305
  <td>377</td>
306
  </tr>
307
  <tr>
308
- <td>H100x2</td>
309
  <td>neuralmagic/Qwen2-VL-72B-Instruct-FP8-Dynamic</td>
310
  <td>1.70</td>
311
- <td>0.8</td>
312
  <td>236</td>
313
- <td>2.2</td>
314
  <td>623</td>
315
- <td>2.4</td>
316
  <td>669</td>
317
  </tr>
318
  <tr>
319
- <td>H100x1</td>
320
  <td>neuralmagic/Qwen2-VL-72B-Instruct-quantized.w4a16</td>
321
  <td>2.35</td>
322
- <td>1.3</td>
323
  <td>350</td>
324
- <td>3.3</td>
325
  <td>910</td>
326
- <td>3.6</td>
327
  <td>994</td>
328
  </tr>
329
  </tbody>
330
  </table>
 
 
 
 
 
 
331
 
 
154
  </tr>
155
  <tr>
156
  <th>Hardware</th>
157
+ <th>Number of GPUs</th>
158
  <th>Model</th>
159
  <th>Average Cost Reduction</th>
160
  <th>Latency (s)</th>
 
167
  </thead>
168
  <tbody>
169
  <tr>
170
+ <th rowspan="3" valign="top">A100</th>
171
+ <td>4</td>
172
  <td>Qwen/Qwen2-VL-72B-Instruct</td>
173
  <td></td>
174
  <td>6.5</td>
 
179
  <td>113</td>
180
  </tr>
181
  <tr>
182
+ <td>2</td>
183
  <td>neuralmagic/Qwen2-VL-72B-Instruct-quantized.w8a8</td>
184
  <td>1.85</td>
185
  <td>7.2</td>
 
190
  <td>211</td>
191
  </tr>
192
  <tr>
193
+ <td>1</td>
194
  <td>neuralmagic/Qwen2-VL-72B-Instruct-quantized.w4a16</td>
195
  <td>3.32</td>
196
  <td>10.0</td>
 
201
  <td>419</td>
202
  </tr>
203
  <tr>
204
+ <th rowspan="3" valign="top">H100</td>
205
+ <td>4</td>
206
  <td>Qwen/Qwen2-VL-72B-Instruct</td>
207
  <td></td>
208
  <td>4.4</td>
 
213
  <td>99</td>
214
  </tr>
215
  <tr>
216
+ <td>2</td>
217
  <td>neuralmagic/Qwen2-VL-72B-Instruct-FP8-Dynamic</td>
218
  <td>1.79</td>
219
  <td>4.7</td>
 
224
  <td>177</td>
225
  </tr>
226
  <tr>
227
+ <td>1</td>
228
  <td>neuralmagic/Qwen2-VL-72B-Instruct-quantized.w4a16</td>
229
  <td>2.60</td>
230
  <td>6.4</td>
 
236
  </tr>
237
  </tbody>
238
  </table>
239
+
240
+ **Use case profiles: Image Size (WxH) / prompt tokens / generation tokens
241
+
242
+ **QPD: Queries per dollar, based on on-demand cost at [Lambda Labs](https://lambdalabs.com/service/gpu-cloud) (observed on 2/18/2025).
243
 
244
  ### Multi-stream asynchronous performance (measured with vLLM version 0.7.2)
245
 
 
267
  </thead>
268
  <tbody>
269
  <tr>
270
+ <th rowspan="3" valign="top">A100x4</th>
271
  <td>Qwen/Qwen2-VL-72B-Instruct</td>
272
  <td></td>
273
  <td>0.3</td>
 
278
  <td>595</td>
279
  </tr>
280
  <tr>
 
281
  <td>neuralmagic/Qwen2-VL-72B-Instruct-quantized.w8a8</td>
282
  <td>1.84</td>
283
+ <td>1.2</td>
284
  <td>293</td>
285
+ <td>4.0</td>
286
  <td>1021</td>
287
+ <td>4.6</td>
288
  <td>1135</td>
289
  </tr>
290
  <tr>
 
291
  <td>neuralmagic/Qwen2-VL-72B-Instruct-quantized.w4a16</td>
292
  <td>2.73</td>
293
+ <td>2.4</td>
294
  <td>314</td>
295
+ <td>12.8</td>
296
  <td>1591</td>
297
+ <td>16.0</td>
298
  <td>2019</td>
299
  </tr>
300
  <tr>
301
+ <th rowspan="3" valign="top">H100x4</td>
302
  <td>Qwen/Qwen2-VL-72B-Instruct</td>
303
  <td></td>
304
  <td>0.5</td>
 
309
  <td>377</td>
310
  </tr>
311
  <tr>
 
312
  <td>neuralmagic/Qwen2-VL-72B-Instruct-FP8-Dynamic</td>
313
  <td>1.70</td>
314
+ <td>1.6</td>
315
  <td>236</td>
316
+ <td>4.4</td>
317
  <td>623</td>
318
+ <td>4.8</td>
319
  <td>669</td>
320
  </tr>
321
  <tr>
 
322
  <td>neuralmagic/Qwen2-VL-72B-Instruct-quantized.w4a16</td>
323
  <td>2.35</td>
324
+ <td>5.2</td>
325
  <td>350</td>
326
+ <td>13.2</td>
327
  <td>910</td>
328
+ <td>14.4</td>
329
  <td>994</td>
330
  </tr>
331
  </tbody>
332
  </table>
333
+
334
+ **Use case profiles: Image Size (WxH) / prompt tokens / generation tokens
335
+
336
+ **QPS: Queries per second.
337
+
338
+ **QPD: Queries per dollar, based on on-demand cost at [Lambda Labs](https://lambdalabs.com/service/gpu-cloud) (observed on 2/18/2025).
339