Update README.md
Browse files
README.md
CHANGED
@@ -13,6 +13,6 @@ AWQ of the DeepSeek R1 model.
|
|
13 |
|
14 |
This quant modified some of the model code to fix the overflow issue when using float16.
|
15 |
|
16 |
-
Tested on vLLM with 8x H100, inference speed 5 tokens
|
17 |
|
18 |
If you are serving with vLLM, please either add `--dtype float16` or use the new `moe_wna16` kernel by using `--quantization moe_wna16`.
|
|
|
13 |
|
14 |
This quant modified some of the model code to fix the overflow issue when using float16.
|
15 |
|
16 |
+
Tested on vLLM with 8x H100, inference speed 5 tokens per second with batch size 1 and short prompt, 12 tokens per second when using `moe_wna16` kernel.
|
17 |
|
18 |
If you are serving with vLLM, please either add `--dtype float16` or use the new `moe_wna16` kernel by using `--quantization moe_wna16`.
|