MLA is not supported with moe_wna16 quantization. Disabling MLA.

by AMOSE - opened Feb 12

Discussion

AMOSE

Feb 12

MLA is not supported with moe_wna16 quantization. Disabling MLA.

hhharold

Feb 12

•

edited Feb 12

I met the same warning and get the out of memory error
cuda version 12.2
8 * A800
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8009 --max-model-len 10000 --trust-remote-code --tensor-parallel-size 8 --quantization moe_wna16 --gpu-memory-utilization 0.97 --kv-cache-dtype fp8_e5m2 --calculate-kv-scales --served-model-name deepseek-r1-awq --model /share/DeepSeek-R1-AWQ

v2ray

Quixi AI org Feb 12

@AMOSE @hhharold MLA is indeed not supported, you can't use MLA with AWQ. If you get OOM errors, reduce the gpu memory utilization flag.

mgoin

Feb 13

I am working on enabling support here https://github.com/vllm-project/vllm/pull/13181

muziyongshixin

Feb 14

•

edited Feb 14

I am working on enabling support here https://github.com/vllm-project/vllm/pull/13181

when use this， I got an error the triton MLA kernel not support fp8，so I have to set the --kv-cache-dtype fp16， this didn't increase the decoding speed， but cost more GPU memory and I get a CUDAOOM when context length exceed 6000.

v2ray

Quixi AI org Feb 17

Closed as it's supported.

@muziyongshixin Reduce the gpu memory utilization flag.

v2ray changed discussion status to closed Feb 17

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment