vllm support a100

#2
by HuggingLianWang - opened

Can this model be served directly using vllm on 8xA100(80GB)?

HuggingLianWang changed discussion title from vllm support to vllm support a100
Cognitive Computations org

Yes but it will run with like 3.7 tokens per second.

Yes but it will run with like 3.7 tokens per second.

Thank you very much , we will try

succeed
inference speed about 3.5 tokens/s with batch size 1 on 8xA100(80GB)

Cognitive Computations org

There's a PR which claims to boost it to 30 tokens per second, not tried tho.

very good 3t/s 8xA100

vllm serve cognitivecomputations/DeepSeek-V3-AWQ
--trust-remote-code
--host 0.0.0.0
--port 8080
--max-model-len 10000
--tensor-parallel-size 8
--gpu_memory_utilization 0.99
--swap-space 32
--kv-cache-dtype fp8
--enforce-eager
--dtype float16

works and 5.2T/s for 8 x A100

vllm serve cognitivecomputations/DeepSeek-R1-AWQ
--trust-remote-code
--host 0.0.0.0
--port 8080
--max-model-len 10000
--tensor-parallel-size 8
--gpu_memory_utilization 0.99
--swap-space 32
--kv-cache-dtype fp8
--enforce-eager
--dtype float16

works and 5.2T/s for 8 x A100 as well

Can i use it with several 3090 in docker and cpu-offload? Is it possible to start model in cpu mode only?

cannot say for sure, but high unlikely....

Cognitive Computations org

Currently it will error out when using cpu-offload, and even if it's eventually supported, it will still be extremely slow. @kuliev-vitaly

On 8xA800, does this command work?

python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 12345 --max-model-len 65536 --trust-remote-code --tensor-parallel-size 8 --quantization moe_wna16 --gpu-memory-utilization 0.97 --kv-cache-dtype fp8_e5m2 --calculate-kv-scales --served-model-name deepseek-chat --model cognitivecomputations/DeepSeek-V3-AWQ
Cognitive Computations org

@traphix Just try it.

I've tried this command. But my server stucked. No errors output. Just totally dead.

All parameters have been loaded in VRAM. Then stuck.

docker run -d \
    --restart always \
    --name deepseek-chat \
    --hostname deepseek-chat \
    --network host \
    --ipc=host \
    --gpus all \
    -v /data/model-cache/deepseek-ai/DeepSeek-V3-AWQ:/DeepSeek-V3-AWQ \
    vllm/vllm-openai:v0.7.2 \
    --served-model-name deepseek-chat \
    --model /DeepSeek-V3-AWQ \
    --tensor-parallel-size 8 \
    --trust-remote-code \
    --max-model-len 65536 \
    --gpu-memory-utilization 0.97 \
    --quantization moe_wna16 \
    --dtype half \
    --host 0.0.0.0 \
    --port 50521
Cognitive Computations org

@traphix Is there just no output at all? Or at what stage it's stuck?

I will retry any copy logs here soon.

By the way, i notice the config.json param " _name_or_path" is "/root/data/DeepSeek-V3-AWQ".

Shoud I put model in folder "/root/data/DeepSeek-V3-AWQ"?

image.png

Cognitive Computations org

@traphix No it doesn't matter.

Sign up or log in to comment