|
--- |
|
license: llama3.1 |
|
train: false |
|
inference: false |
|
pipeline_tag: text-generation |
|
--- |
|
This is an <a href="https://github.com/mobiusml/hqq/">HQQ</a> all 4-bit (group-size=64) quantized <a href="https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-70B"> Hermes-3-Llama-3.1-70B</a> model. |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/i0vpy66jdz3IlGQcbKqHe.png) |
|
|
|
Model size and decoding speed should be similar to the <a href="https://huggingface.co/mobiuslabsgmbh/Llama-3.1-70b-instruct_4bitgs64_hqq">Llama-3.1-70b-instruct_4bitgs64_hqq</a> |
|
version. |
|
|
|
## Usage |
|
First, install the dependecies: |
|
``` |
|
pip install git+https://github.com/mobiusml/hqq.git #master branch fix |
|
pip install bitblas |
|
``` |
|
Also, make sure you use at least torch `2.4.0` or the nightly build. |
|
|
|
Then you can use the sample code below: |
|
``` Python |
|
import torch |
|
from transformers import AutoTokenizer |
|
from hqq.models.hf.base import AutoHQQHFModel |
|
from hqq.utils.patching import * |
|
from hqq.core.quantize import * |
|
from hqq.utils.generation_hf import HFGenerator |
|
|
|
#Load the model |
|
################################################### |
|
model_id = 'mobiuslabsgmbh/Hermes-3-Llama-3.1-70B_4bitgs64_hqq' |
|
|
|
compute_dtype = torch.bfloat16 #bfloat16 for torchao, float16 for bitblas |
|
cache_dir = '.' |
|
model = AutoHQQHFModel.from_quantized(model_id, cache_dir=cache_dir, compute_dtype=compute_dtype) |
|
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir) |
|
|
|
quant_config = BaseQuantizeConfig(nbits=4, group_size=64, quant_scale=False, quant_zero=False, axis=1) |
|
patch_linearlayers(model, patch_add_quant_config, quant_config) |
|
|
|
#Use optimized inference kernels |
|
################################################### |
|
HQQLinear.set_backend(HQQBackend.PYTORCH) |
|
#prepare_for_inference(model) #default backend |
|
prepare_for_inference(model, backend="torchao_int4") |
|
#prepare_for_inference(model, backend="bitblas") #takes a while to init... |
|
|
|
#Generate |
|
################################################### |
|
#For longer context, make sure to allocate enough cache via the cache_size= parameter |
|
gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Warm-up takes a while |
|
|
|
gen.generate("Write an essay about large language models", print_tokens=True) |
|
gen.generate("Tell me a funny joke!", print_tokens=True) |
|
gen.generate("How to make a yummy chocolate cake?", print_tokens=True) |
|
|
|
``` |