This is a BitBLAS Implementation for the reproduced 1.58bit model from 1bitLLM/bitnet_b1_58-3B. We replaced the original simulated Int8x3bit Quantized Inference Kernel with BitBLAS INT8xINT2 Kernel. We also evaluated the model's correctness and performance through eval_correctness.py
and benchmark_inference_latency.py
.
Latest News
- 08/09/2024 ✨: We provide a more efficient implementation for bitnet with vLLM, which should use special model checkpoints, to make the ckpt and study how to deploy, please checkout Make Checkpoints for vLLM.
Make Checkpoints for vLLM
We provide two scripts to make the checkpoints for vLLM. The first script is generate_bitnet_model_native_format.sh
, which is used to make a checkpoint with fp16 uncompressed metaadta, the main difference with the original checkpoint is the quant_config.json
, which allow vLLM to load the model and execute with a quant extension.
# move to the integration directory
cd /root/to/BitBLAS/integration/BitNet
# make the checkpoint
./maint/generate_bitnet_model_native_format.sh
# the output ckpy will be saved in the `./models/ckpt_bitnet_b1_58-3B` directory
The second script is generate_bitnet_model_bitblas_format.sh
, which is used to make a checkpoint with BitBLAS compressed metadata, which can avoid the online dequantize sage for the profiling of vLLM, which lead to more efficient memory utilization.
./maint/generate_bitnet_model_bitblas_format.sh ./models/ckpt_bitnet_b1_58-3B ./models/ckpt_bitnet_b1_58-3B_bitblas
# the output ckpy will be saved in the `./models/ckpt_bitnet_b1_58-3B_bitblas` directory
Finnaly, you can use the ckpt in vLLM with:
cd vllm_workspace
# inference with the ckpt with fp16 uncompressed metadata
python3 inference_with_native_format.py
# inference with the ckpt with BitBLAS compressed metadata
python3 inference_with_bitblas_format.py
BitBLAS Results
Performance
Note: To reproduce the results of BitBLAS, Please checkout the benchmark_inference_latency.py
. To reproduce the results of the original model, Please checkout the 1bitLLM/bitnet_b1_58-3B repo.
Model | Device | batchsize | in_seq | model | bitnet-1.58b-3b-huggingface | bitnet-1.58b-3b-bitblas |
---|---|---|---|---|---|---|
bitnet_b1_58-3B | A100 | 1 | 1 | LLAMA-3B | 177.6729107 | 64.17962909 |
bitnet_b1_58-3B | A100 | 128 | 1 | LLAMA-3B | 188.6145592 | 63.48158518 |
bitnet_b1_58-3B | A100 | 1 | 2048 | LLAMA-3B | 348.7066031 | 202.6877999 |
On-the-Fly GPU Memory Footprint
We measured the GPU memory footprint through the nvidia-smi
command. Please checkout nvidia_measure_memory.sh
to get the real-time GPU memory usage. And then start a benchmark_model_10k_loops.py
workload to measure the overall GPU memory usage.
Model | Device | batchsize | in_seq | bitnet-1.58b-3b-huggingface | bitnet-1.58b-3b-bitblas |
---|---|---|---|---|---|
bitnet_b1_58-3B | A100 | 1 | 1 | 7595 MB | 1729 MB |
bitnet_b1_58-3B | A100 | 128 | 1 | 7677 MB | 1789 MB |
bitnet_b1_58-3B | A100 | 1 | 2048 | 8731 MB | 3163 MB |
PPL and Zero-shot Accuracy
The number is Reported from the 1bitLLM/bitnet_b1_58-3B, Please checkout the eval_ppl.py
.
PPL and zero-shot accuracy:
Models | PPL | ARCe | ARCc | HS | BQ | OQ | PQ | WGe | Avg |
---|---|---|---|---|---|---|---|---|---|
FP16 700M (reported) | 12.33 | 54.7 | 23.0 | 37.0 | 60.0 | 20.2 | 68.9 | 54.8 | 45.5 |
BitNet b1.58 700M (reported) | 12.87 | 51.8 | 21.4 | 35.1 | 58.2 | 20.0 | 68.1 | 55.2 | 44.3 |
BitNet b1.58 700M (reproduced) | 12.78 | 51.4 | 21.8 | 35.0 | 59.6 | 20.6 | 67.5 | 55.4 | 44.5 |
FP16 1.3B (reported) | 11.25 | 56.9 | 23.5 | 38.5 | 59.1 | 21.6 | 70.0 | 53.9 | 46.2 |
BitNet b1.58 1.3B (reported) | 11.29 | 54.9 | 24.2 | 37.7 | 56.7 | 19.6 | 68.8 | 55.8 | 45.4 |
BitNet b1.58 1.3B (reproduced) | 11.19 | 55.8 | 23.7 | 37.6 | 59.0 | 20.2 | 69.2 | 56.0 | 45.9 |
FP16 3B (reported) | 10.04 | 62.1 | 25.6 | 43.3 | 61.8 | 24.6 | 72.1 | 58.2 | 49.7 |
BitNet b1.58 3B (reported) | 9.91 | 61.4 | 28.3 | 42.9 | 61.5 | 26.6 | 71.5 | 59.3 | 50.2 |
BitNet b1.58 3B (reproduced) | 9.88 | 60.9 | 28.0 | 42.3 | 58.3 | 26.0 | 71.4 | 60.3 | 49.6 |
The differences between the reported numbers and the reproduced results are possibly variances from the training data processing, seeds, or other random factors.
Citations
@article{ma2024era,
title={The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits},
author={Ma, Shuming and Wang, Hongyu and Ma, Lingxiao and Wang, Lei and Wang, Wenhui and Huang, Shaohan and Dong, Li and Wang, Ruiping and Xue, Jilong and Wei, Furu},
journal={arXiv preprint arXiv:2402.17764},
year={2024}
}
- Downloads last month
- 4