MLA is not supported with moe_wna16 quantization. Disabling MLA.
MLA is not supported with moe_wna16 quantization. Disabling MLA.
I met the same warning and get the out of memory error
cuda version 12.2
8 * A800
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8009 --max-model-len 10000 --trust-remote-code --tensor-parallel-size 8 --quantization moe_wna16 --gpu-memory-utilization 0.97 --kv-cache-dtype fp8_e5m2 --calculate-kv-scales --served-model-name deepseek-r1-awq --model /share/DeepSeek-R1-AWQ
I am working on enabling support here https://github.com/vllm-project/vllm/pull/13181
I am working on enabling support here https://github.com/vllm-project/vllm/pull/13181
when use this, I got an error the triton MLA kernel not support fp8,so I have to set the --kv-cache-dtype fp16, this didn't increase the decoding speed, but cost more GPU memory and I get a CUDAOOM when context length exceed 6000.