unsloth/QwQ-32B-unsloth-bnb-4bit · How to use the bnb-4bit model?

neoragex2002

7 days ago

Are there detailed examples and tutorials available? Thanks!

abyssalaxioms

6 days ago

•

edited 6 days ago

Here's the rundown of how I got it working locally for me.

Running Unsloth QwQ-32B with Dynamic Quantization (vLLM)

Prerequisites

Hardware: At least 2x NVIDIA GPUs with 24GB VRAM each (total 48GB).
Software: Linux OS, Python 3.9-3.12, NVIDIA CUDA drivers.

Step 1: Install Dependencies

Using `uv` (Fast Python Env Manager)

What is uv? A lightweight tool for managing Python environments. Install instructions here.

Create a virtual environment:

uv venv vllm-env --python 3.12 --seed
source vllm-env/bin/activate

Install vLLM & BitsAndBytes

uv pip install vllm bitsandbytes>=0.45.0

Step 2: Run OpenAI-Compatible API Server

Use this command to start the server with dynamic quantization and GPU parallelism:

python -m vllm.entrypoints.openai.api_server \
    --model unsloth/QwQ-32B-unsloth-bnb-4bit \
    --quantization bitsandbytes \
    --load-format bitsandbytes \
    --tensor-parallel-size 2 \
    --max-model-len 4096

Key Parameters Explained:

--quantization bitsandbytes: Enables 4-bit quantization to reduce VRAM usage.
--load-format bitsandbytes: Specifies the quantization format.
--tensor-parallel-size 2: Distributes the model across your 2 GPUs.
--max-model-len 4096: Sets the maximum context length (required for QwQ-32B).

Troubleshooting Tips

Out of Memory (OOM) Errors:
- Ensure you're using both GPUs with --tensor-parallel-size 2.
- Verify your GPUs have ≥24GB VRAM each (total 48GB).
- Reduce --max-model-len if issues persist (though 4096 is required for full context).
Distributed Inference Notes:
- For multi-GPU setups, vLLM automatically handles tensor parallelism.
- If using >2 GPUs or multiple nodes, adjust --tensor-parallel-size and follow vLLM's distributed docs.

Check GPU Usage:

nvidia-smi  # Ensure GPUs are recognized and not in use.

Quick Usage Example

Once the server runs (default URL: http://localhost:8000), test with curl:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/QwQ-32B-unsloth-bnb-4bit",
    "prompt": "Explain quantum computing in simple terms.",
    "max_tokens": 100
  }'

Resources

Unsloth Tutorial: How to Run QwQ-32B Effectively
vLLM Docs: Quantization Guide | Distributed Serving

neoragex2002

6 days ago

Here's the rundown of how I got it working locally for me.

Running Unsloth QwQ-32B with Dynamic Quantization (vLLM)

Prerequisites

Hardware: At least 2x NVIDIA GPUs with 24GB VRAM each (total 48GB).

Software: Linux OS, Python 3.9-3.12, NVIDIA CUDA drivers.

Step 1: Install Dependencies

Using uv (Fast Python Env Manager)
What is uv? A lightweight tool for managing Python environments. Install instructions here.
Create a virtual environment:
uv venv vllm-env --python 3.12 --seed
source vllm-env/bin/activate
Install vLLM & BitsAndBytes
uv pip install vllm bitsandbytes>=0.45.0
Step 2: Run OpenAI-Compatible API Server

Use this command to start the server with dynamic quantization and GPU parallelism:
python -m vllm.entrypoints.openai.api_server \
    --model unsloth/QwQ-32B-unsloth-bnb-4bit \
    --quantization bitsandbytes \
    --load-format bitsandbytes \
    --tensor-parallel-size 2 \
    --max-model-len 4096
Key Parameters Explained:

--quantization bitsandbytes: Enables 4-bit quantization to reduce VRAM usage.

--load-format bitsandbytes: Specifies the quantization format.

--tensor-parallel-size 2: Distributes the model across your 2 GPUs.

--max-model-len 4096: Sets the maximum context length (required for QwQ-32B).

Troubleshooting Tips
Out of Memory (OOM) Errors:

Ensure you're using both GPUs with --tensor-parallel-size 2.

Verify your GPUs have ≥24GB VRAM each (total 48GB).

Reduce --max-model-len if issues persist (though 4096 is required for full context).

Distributed Inference Notes:

For multi-GPU setups, vLLM automatically handles tensor parallelism.

If using >2 GPUs or multiple nodes, adjust --tensor-parallel-size and follow vLLM's distributed docs.
Check GPU Usage:
nvidia-smi  # Ensure GPUs are recognized and not in use.
Quick Usage Example

Once the server runs (default URL: http://localhost:8000), test with curl:
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/QwQ-32B-unsloth-bnb-4bit",
    "prompt": "Explain quantum computing in simple terms.",
    "max_tokens": 100
  }'
Resources

Unsloth Tutorial: How to Run QwQ-32B Effectively

vLLM Docs: Quantization Guide | Distributed Serving

Really appreciate it, thank you very much!

neoragex2002

6 days ago

Is there any possible to run it locally by using unsloth FastLanguageModel?
I tried hard, but the decoded token is in an infinite loop. Do I need to define LogitsProcessor myself?
Any advice and tutorial available plz. Thanks!

ubergarm

4 days ago

@abyssalaxioms

Thanks for the dual GPU tensor parallel example!

I have an example for single GPU over here:

https://huggingface.co/unsloth/QwQ-32B-GGUF/discussions/4#67cdfd1e06912c735446842e

Using about 38GB out of 48GB VRAM on a GTX A6000 getting around 20 tok/sec. Not as fast or small as GGUF but probably better quality?

neoragex2002

4 days ago

@abyssalaxioms

Thanks for the dual GPU tensor parallel example!

I have an example for single GPU over here:

https://huggingface.co/unsloth/QwQ-32B-GGUF/discussions/4#67cdfd1e06912c735446842e

Using about 38GB out of 48GB VRAM on a GTX A6000 getting around 20 tok/sec. Not as fast or small as GGUF but probably better quality?

agree. I got an example even for single V100. only 9.9 tok/sec. but the quality is good. the flappybird is only 2 shots for my case.

ubergarm

4 days ago

•

edited 4 days ago

@neoragex2002

Huh, I was able to 1-shot flappy bird with bartowski IQ4_XS with 32k context on llama.cpp at over 30 tok/sec on a single 3090TI 24GB VRAM.

Also, @abyssalaxioms how did you get BitsAndBytes model to work with Tensor Parallel??? I can only get it working with single GPU. Do you have NVLINK installed or something special?

I have the bug and info here: https://github.com/vllm-project/vllm/issues/14449

I'll start from scratch and try you version, though tbh your version looks kinda ai generated so i'm not sure lol... nope it hangs with both RTX A6000 GPUs at 100% utilization and when I control+c I have to manually clean up ps aux | grep vllm && kill blah blah...

How did you get tensor parallel to work with bitsandbytes ???

neoragex2002

4 days ago

@ubergarm
Wow, incredible tokens per second and such an extended context window! 😃

I only used a 16,384-token context (double the standard 8192) in my two-shot tests and prompted the model to review & fix the bug by itself. all done well.

It appears that the BnB 4-bit model might not be run locally via vLLM on a single RTX 3090 Ti 24GB card. Even my RTX 4090 24GB setup failed to execute it successfully.

Additionally, the BnB format might not be compatible with Tensor Parallelism—at least in my test with 4x NVIDIA V100 GPUs, this limitation persists.

neoragex2002

4 days ago

For BNB-quantized models in vLLM, Tensor Parallelism is currently unsupported by vLLM. According to current implementation guidelines, parallel processing must rely solely on Pipeline Parallelism, with Tensor Parallelism remaining unavailable.

ubergarm

4 days ago

•

edited 4 days ago

@abyssalaxioms

Hey bud, can you show the screenshot or logs of your setup working with tensor parallel? If not, I assume you simply ai generated some slop and didn't actually test it and your example is possibly wrong and misleading without some details.

It may require 4090 or newer CUDA architecture which would not work on 3090TI nor RTX A6000 which are only version 8.6 and not 9.0+.

# 3090TI FE 24 GB VRAM GPU
$ nvidia-smi --query-gpu=compute_cap --format=csv
compute_cap
8.6

ubergarm

2 days ago

•

edited 2 days ago

@abyssalaxioms @neoragex2002

I got QwQ running with tensor parallel finally on both vllm and sglang. tl;dr; only certain quants are supported and likely need to disable P2P unless using a hacked nvidia driver. As neoragex2002 mentioned BNB quants are not supported.

Full details here:

https://github.com/vllm-project/vllm/issues/14449#issuecomment-2718245030

How to use the bnb-4bit model?

Running Unsloth QwQ-32B with Dynamic Quantization (vLLM)

Prerequisites

Step 1: Install Dependencies

Using uv (Fast Python Env Manager)

Install vLLM & BitsAndBytes

Step 2: Run OpenAI-Compatible API Server

Key Parameters Explained:

Troubleshooting Tips

Quick Usage Example

Resources

Running Unsloth QwQ-32B with Dynamic Quantization (vLLM)

Prerequisites

Step 1: Install Dependencies

Using uv (Fast Python Env Manager)

Install vLLM & BitsAndBytes

Step 2: Run OpenAI-Compatible API Server

Key Parameters Explained:

Troubleshooting Tips

Quick Usage Example

Resources

Using `uv` (Fast Python Env Manager)

Using `uv` (Fast Python Env Manager)