MLA is not supported with moe_wna16 quantization. Disabling MLA.

#7
by AMOSE - opened

MLA is not supported with moe_wna16 quantization. Disabling MLA.

I met the same warning and get the out of memory error
cuda version 12.2
8 * A800
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8009 --max-model-len 10000 --trust-remote-code --tensor-parallel-size 8 --quantization moe_wna16 --gpu-memory-utilization 0.97 --kv-cache-dtype fp8_e5m2 --calculate-kv-scales --served-model-name deepseek-r1-awq --model /share/DeepSeek-R1-AWQ

Cognitive Computations org

@AMOSE @hhharold MLA is indeed not supported, you can't use MLA with AWQ. If you get OOM errors, reduce the gpu memory utilization flag.

I am working on enabling support here https://github.com/vllm-project/vllm/pull/13181

I am working on enabling support here https://github.com/vllm-project/vllm/pull/13181

when use this, I got an error the triton MLA kernel not support fp8,so I have to set the --kv-cache-dtype fp16, this didn't increase the decoding speed, but cost more GPU memory and I get a CUDAOOM when context length exceed 6000.

Sign up or log in to comment