How to use the bnb-4bit model?

#4
by neoragex2002 - opened

Are there detailed examples and tutorials available? Thanks!

Here's the rundown of how I got it working locally for me.

Running Unsloth QwQ-32B with Dynamic Quantization (vLLM)

Prerequisites

  1. Hardware: At least 2x NVIDIA GPUs with 24GB VRAM each (total 48GB).
  2. Software: Linux OS, Python 3.9-3.12, NVIDIA CUDA drivers.

Step 1: Install Dependencies

Using uv (Fast Python Env Manager)

  • What is uv? A lightweight tool for managing Python environments. Install instructions here.
  • Create a virtual environment:
    uv venv vllm-env --python 3.12 --seed
    source vllm-env/bin/activate
    

Install vLLM & BitsAndBytes

uv pip install vllm bitsandbytes>=0.45.0

Step 2: Run OpenAI-Compatible API Server

Use this command to start the server with dynamic quantization and GPU parallelism:

python -m vllm.entrypoints.openai.api_server \
    --model unsloth/QwQ-32B-unsloth-bnb-4bit \
    --quantization bitsandbytes \
    --load-format bitsandbytes \
    --tensor-parallel-size 2 \
    --max-model-len 4096

Key Parameters Explained:

  • --quantization bitsandbytes: Enables 4-bit quantization to reduce VRAM usage.
  • --load-format bitsandbytes: Specifies the quantization format.
  • --tensor-parallel-size 2: Distributes the model across your 2 GPUs.
  • --max-model-len 4096: Sets the maximum context length (required for QwQ-32B).

Troubleshooting Tips

  1. Out of Memory (OOM) Errors:

    • Ensure you're using both GPUs with --tensor-parallel-size 2.
    • Verify your GPUs have ≥24GB VRAM each (total 48GB).
    • Reduce --max-model-len if issues persist (though 4096 is required for full context).
  2. Distributed Inference Notes:

    • For multi-GPU setups, vLLM automatically handles tensor parallelism.
    • If using >2 GPUs or multiple nodes, adjust --tensor-parallel-size and follow vLLM's distributed docs.
  3. Check GPU Usage:

    nvidia-smi  # Ensure GPUs are recognized and not in use.
    

Quick Usage Example

Once the server runs (default URL: http://localhost:8000), test with curl:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/QwQ-32B-unsloth-bnb-4bit",
    "prompt": "Explain quantum computing in simple terms.",
    "max_tokens": 100
  }'

Resources


Here's the rundown of how I got it working locally for me.

Running Unsloth QwQ-32B with Dynamic Quantization (vLLM)

Prerequisites

  1. Hardware: At least 2x NVIDIA GPUs with 24GB VRAM each (total 48GB).
  2. Software: Linux OS, Python 3.9-3.12, NVIDIA CUDA drivers.

Step 1: Install Dependencies

Using uv (Fast Python Env Manager)

  • What is uv? A lightweight tool for managing Python environments. Install instructions here.
  • Create a virtual environment:
    uv venv vllm-env --python 3.12 --seed
    source vllm-env/bin/activate
    

Install vLLM & BitsAndBytes

uv pip install vllm bitsandbytes>=0.45.0

Step 2: Run OpenAI-Compatible API Server

Use this command to start the server with dynamic quantization and GPU parallelism:

python -m vllm.entrypoints.openai.api_server \
    --model unsloth/QwQ-32B-unsloth-bnb-4bit \
    --quantization bitsandbytes \
    --load-format bitsandbytes \
    --tensor-parallel-size 2 \
    --max-model-len 4096

Key Parameters Explained:

  • --quantization bitsandbytes: Enables 4-bit quantization to reduce VRAM usage.
  • --load-format bitsandbytes: Specifies the quantization format.
  • --tensor-parallel-size 2: Distributes the model across your 2 GPUs.
  • --max-model-len 4096: Sets the maximum context length (required for QwQ-32B).

Troubleshooting Tips

  1. Out of Memory (OOM) Errors:

    • Ensure you're using both GPUs with --tensor-parallel-size 2.
    • Verify your GPUs have ≥24GB VRAM each (total 48GB).
    • Reduce --max-model-len if issues persist (though 4096 is required for full context).
  2. Distributed Inference Notes:

    • For multi-GPU setups, vLLM automatically handles tensor parallelism.
    • If using >2 GPUs or multiple nodes, adjust --tensor-parallel-size and follow vLLM's distributed docs.
  3. Check GPU Usage:

    nvidia-smi  # Ensure GPUs are recognized and not in use.
    

Quick Usage Example

Once the server runs (default URL: http://localhost:8000), test with curl:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/QwQ-32B-unsloth-bnb-4bit",
    "prompt": "Explain quantum computing in simple terms.",
    "max_tokens": 100
  }'

Resources


Really appreciate it, thank you very much!

Is there any possible to run it locally by using unsloth FastLanguageModel?
I tried hard, but the decoded token is in an infinite loop. Do I need to define LogitsProcessor myself?
Any advice and tutorial available plz. Thanks!

@abyssalaxioms

Thanks for the dual GPU tensor parallel example!

I have an example for single GPU over here:

https://huggingface.co/unsloth/QwQ-32B-GGUF/discussions/4#67cdfd1e06912c735446842e

Using about 38GB out of 48GB VRAM on a GTX A6000 getting around 20 tok/sec. Not as fast or small as GGUF but probably better quality?

@abyssalaxioms

Thanks for the dual GPU tensor parallel example!

I have an example for single GPU over here:

https://huggingface.co/unsloth/QwQ-32B-GGUF/discussions/4#67cdfd1e06912c735446842e

Using about 38GB out of 48GB VRAM on a GTX A6000 getting around 20 tok/sec. Not as fast or small as GGUF but probably better quality?

agree. I got an example even for single V100. only 9.9 tok/sec. but the quality is good. the flappybird is only 2 shots for my case.

@neoragex2002

Huh, I was able to 1-shot flappy bird with bartowski IQ4_XS with 32k context on llama.cpp at over 30 tok/sec on a single 3090TI 24GB VRAM.

Also, @abyssalaxioms how did you get BitsAndBytes model to work with Tensor Parallel??? I can only get it working with single GPU. Do you have NVLINK installed or something special?

I have the bug and info here: https://github.com/vllm-project/vllm/issues/14449

I'll start from scratch and try you version, though tbh your version looks kinda ai generated so i'm not sure lol... nope it hangs with both RTX A6000 GPUs at 100% utilization and when I control+c I have to manually clean up ps aux | grep vllm && kill blah blah...

How did you get tensor parallel to work with bitsandbytes ???

@ubergarm
Wow, incredible tokens per second and such an extended context window! 😃

I only used a 16,384-token context (double the standard 8192) in my two-shot tests and prompted the model to review & fix the bug by itself. all done well.

It appears that the BnB 4-bit model might not be run locally via vLLM on a single RTX 3090 Ti 24GB card. Even my RTX 4090 24GB setup failed to execute it successfully.

Additionally, the BnB format might not be compatible with Tensor Parallelism—at least in my test with 4x NVIDIA V100 GPUs, this limitation persists.

For BNB-quantized models in vLLM, Tensor Parallelism is currently unsupported by vLLM. According to current implementation guidelines, parallel processing must rely solely on Pipeline Parallelism, with Tensor Parallelism remaining unavailable.

@abyssalaxioms

Hey bud, can you show the screenshot or logs of your setup working with tensor parallel? If not, I assume you simply ai generated some slop and didn't actually test it and your example is possibly wrong and misleading without some details.

It may require 4090 or newer CUDA architecture which would not work on 3090TI nor RTX A6000 which are only version 8.6 and not 9.0+.

# 3090TI FE 24 GB VRAM GPU
$ nvidia-smi --query-gpu=compute_cap --format=csv
compute_cap
8.6

@abyssalaxioms @neoragex2002

I got QwQ running with tensor parallel finally on both vllm and sglang. tl;dr; only certain quants are supported and likely need to disable P2P unless using a hacked nvidia driver. As neoragex2002 mentioned BNB quants are not supported.

Full details here:

https://github.com/vllm-project/vllm/issues/14449#issuecomment-2718245030

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment