How to use the bnb-4bit model?
Are there detailed examples and tutorials available? Thanks!
Here's the rundown of how I got it working locally for me.
Running Unsloth QwQ-32B with Dynamic Quantization (vLLM)
Prerequisites
- Hardware: At least 2x NVIDIA GPUs with 24GB VRAM each (total 48GB).
- Software: Linux OS, Python 3.9-3.12, NVIDIA CUDA drivers.
Step 1: Install Dependencies
Using uv
(Fast Python Env Manager)
- What is
uv
? A lightweight tool for managing Python environments. Install instructions here. - Create a virtual environment:
uv venv vllm-env --python 3.12 --seed source vllm-env/bin/activate
Install vLLM & BitsAndBytes
uv pip install vllm bitsandbytes>=0.45.0
Step 2: Run OpenAI-Compatible API Server
Use this command to start the server with dynamic quantization and GPU parallelism:
python -m vllm.entrypoints.openai.api_server \
--model unsloth/QwQ-32B-unsloth-bnb-4bit \
--quantization bitsandbytes \
--load-format bitsandbytes \
--tensor-parallel-size 2 \
--max-model-len 4096
Key Parameters Explained:
--quantization bitsandbytes
: Enables 4-bit quantization to reduce VRAM usage.--load-format bitsandbytes
: Specifies the quantization format.--tensor-parallel-size 2
: Distributes the model across your 2 GPUs.--max-model-len 4096
: Sets the maximum context length (required for QwQ-32B).
Troubleshooting Tips
Out of Memory (OOM) Errors:
- Ensure you're using both GPUs with
--tensor-parallel-size 2
. - Verify your GPUs have ≥24GB VRAM each (total 48GB).
- Reduce
--max-model-len
if issues persist (though 4096 is required for full context).
- Ensure you're using both GPUs with
Distributed Inference Notes:
- For multi-GPU setups, vLLM automatically handles tensor parallelism.
- If using >2 GPUs or multiple nodes, adjust
--tensor-parallel-size
and follow vLLM's distributed docs.
Check GPU Usage:
nvidia-smi # Ensure GPUs are recognized and not in use.
Quick Usage Example
Once the server runs (default URL: http://localhost:8000
), test with curl
:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "unsloth/QwQ-32B-unsloth-bnb-4bit",
"prompt": "Explain quantum computing in simple terms.",
"max_tokens": 100
}'
Resources
- Unsloth Tutorial: How to Run QwQ-32B Effectively
- vLLM Docs: Quantization Guide | Distributed Serving
Here's the rundown of how I got it working locally for me.
Running Unsloth QwQ-32B with Dynamic Quantization (vLLM)
Prerequisites
- Hardware: At least 2x NVIDIA GPUs with 24GB VRAM each (total 48GB).
- Software: Linux OS, Python 3.9-3.12, NVIDIA CUDA drivers.
Step 1: Install Dependencies
Using
uv
(Fast Python Env Manager)
- What is
uv
? A lightweight tool for managing Python environments. Install instructions here.- Create a virtual environment:
uv venv vllm-env --python 3.12 --seed source vllm-env/bin/activate
Install vLLM & BitsAndBytes
uv pip install vllm bitsandbytes>=0.45.0
Step 2: Run OpenAI-Compatible API Server
Use this command to start the server with dynamic quantization and GPU parallelism:
python -m vllm.entrypoints.openai.api_server \ --model unsloth/QwQ-32B-unsloth-bnb-4bit \ --quantization bitsandbytes \ --load-format bitsandbytes \ --tensor-parallel-size 2 \ --max-model-len 4096
Key Parameters Explained:
--quantization bitsandbytes
: Enables 4-bit quantization to reduce VRAM usage.--load-format bitsandbytes
: Specifies the quantization format.--tensor-parallel-size 2
: Distributes the model across your 2 GPUs.--max-model-len 4096
: Sets the maximum context length (required for QwQ-32B).
Troubleshooting Tips
Out of Memory (OOM) Errors:
- Ensure you're using both GPUs with
--tensor-parallel-size 2
.- Verify your GPUs have ≥24GB VRAM each (total 48GB).
- Reduce
--max-model-len
if issues persist (though 4096 is required for full context).Distributed Inference Notes:
- For multi-GPU setups, vLLM automatically handles tensor parallelism.
- If using >2 GPUs or multiple nodes, adjust
--tensor-parallel-size
and follow vLLM's distributed docs.Check GPU Usage:
nvidia-smi # Ensure GPUs are recognized and not in use.
Quick Usage Example
Once the server runs (default URL:
http://localhost:8000
), test withcurl
:
curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "unsloth/QwQ-32B-unsloth-bnb-4bit", "prompt": "Explain quantum computing in simple terms.", "max_tokens": 100 }'
Resources
- Unsloth Tutorial: How to Run QwQ-32B Effectively
- vLLM Docs: Quantization Guide | Distributed Serving
Really appreciate it, thank you very much!
Is there any possible to run it locally by using unsloth FastLanguageModel?
I tried hard, but the decoded token is in an infinite loop. Do I need to define LogitsProcessor myself?
Any advice and tutorial available plz. Thanks!
Thanks for the dual GPU tensor parallel example!
I have an example for single GPU over here:
https://huggingface.co/unsloth/QwQ-32B-GGUF/discussions/4#67cdfd1e06912c735446842e
Using about 38GB out of 48GB VRAM on a GTX A6000 getting around 20 tok/sec. Not as fast or small as GGUF but probably better quality?
Thanks for the dual GPU tensor parallel example!
I have an example for single GPU over here:
https://huggingface.co/unsloth/QwQ-32B-GGUF/discussions/4#67cdfd1e06912c735446842e
Using about 38GB out of 48GB VRAM on a GTX A6000 getting around 20 tok/sec. Not as fast or small as GGUF but probably better quality?
agree. I got an example even for single V100. only 9.9 tok/sec. but the quality is good. the flappybird is only 2 shots for my case.
Huh, I was able to 1-shot flappy bird with bartowski IQ4_XS
with 32k context on llama.cpp at over 30 tok/sec on a single 3090TI 24GB VRAM.
Also, @abyssalaxioms how did you get BitsAndBytes model to work with Tensor Parallel??? I can only get it working with single GPU. Do you have NVLINK installed or something special?
I have the bug and info here: https://github.com/vllm-project/vllm/issues/14449
I'll start from scratch and try you version, though tbh your version looks kinda ai generated so i'm not sure lol... nope it hangs with both RTX A6000 GPUs at 100% utilization and when I control+c I have to manually clean up ps aux | grep vllm && kill blah blah
...
How did you get tensor parallel to work with bitsandbytes ???
@ubergarm
Wow, incredible tokens per second and such an extended context window! 😃
I only used a 16,384-token context (double the standard 8192) in my two-shot tests and prompted the model to review & fix the bug by itself. all done well.
It appears that the BnB 4-bit model might not be run locally via vLLM on a single RTX 3090 Ti 24GB card. Even my RTX 4090 24GB setup failed to execute it successfully.
Additionally, the BnB format might not be compatible with Tensor Parallelism—at least in my test with 4x NVIDIA V100 GPUs, this limitation persists.
For BNB-quantized models in vLLM, Tensor Parallelism is currently unsupported by vLLM. According to current implementation guidelines, parallel processing must rely solely on Pipeline Parallelism, with Tensor Parallelism remaining unavailable.
Hey bud, can you show the screenshot or logs of your setup working with tensor parallel? If not, I assume you simply ai generated some slop and didn't actually test it and your example is possibly wrong and misleading without some details.
It may require 4090 or newer CUDA architecture which would not work on 3090TI nor RTX A6000 which are only version 8.6 and not 9.0+.
# 3090TI FE 24 GB VRAM GPU
$ nvidia-smi --query-gpu=compute_cap --format=csv
compute_cap
8.6
I got QwQ running with tensor parallel finally on both vllm and sglang. tl;dr; only certain quants are supported and likely need to disable P2P unless using a hacked nvidia driver. As neoragex2002 mentioned BNB quants are not supported.
Full details here:
https://github.com/vllm-project/vllm/issues/14449#issuecomment-2718245030