mobiuslabsgmbh
/

Hermes-3-Llama-3.1-70B_4bitgs64_hqq

Text Generation

Model card Files Files and versions Community

Hermes-3-Llama-3.1-70B_4bitgs64_hqq / README.md

appoose's picture

Update README.md

9c84d87 verified 6 months ago

|

history blame contribute delete

2.43 kB

	---
	license: llama3.1
	train: false
	inference: false
	pipeline_tag: text-generation
	---
	This is an <a href="https://github.com/mobiusml/hqq/">HQQ</a> all 4-bit (group-size=64) quantized <a href="https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-70B"> Hermes-3-Llama-3.1-70B</a> model.

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/i0vpy66jdz3IlGQcbKqHe.png)

	Model size and decoding speed should be similar to the <a href="https://huggingface.co/mobiuslabsgmbh/Llama-3.1-70b-instruct_4bitgs64_hqq">Llama-3.1-70b-instruct_4bitgs64_hqq</a>
	version.

	## Usage
	First, install the dependecies:
	```
	pip install git+https://github.com/mobiusml/hqq.git #master branch fix
	pip install bitblas
	```
	Also, make sure you use at least torch `2.4.0` or the nightly build.

	Then you can use the sample code below:
	``` Python
	import torch
	from transformers import AutoTokenizer
	from hqq.models.hf.base import AutoHQQHFModel
	from hqq.utils.patching import *
	from hqq.core.quantize import *
	from hqq.utils.generation_hf import HFGenerator

	#Load the model
	###################################################
	model_id = 'mobiuslabsgmbh/Hermes-3-Llama-3.1-70B_4bitgs64_hqq'

	compute_dtype = torch.bfloat16 #bfloat16 for torchao, float16 for bitblas
	cache_dir = '.'
	model = AutoHQQHFModel.from_quantized(model_id, cache_dir=cache_dir, compute_dtype=compute_dtype)
	tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir)

	quant_config = BaseQuantizeConfig(nbits=4, group_size=64, quant_scale=False, quant_zero=False, axis=1)
	patch_linearlayers(model, patch_add_quant_config, quant_config)

	#Use optimized inference kernels
	###################################################
	HQQLinear.set_backend(HQQBackend.PYTORCH)
	#prepare_for_inference(model) #default backend
	prepare_for_inference(model, backend="torchao_int4")
	#prepare_for_inference(model, backend="bitblas") #takes a while to init...

	#Generate
	###################################################
	#For longer context, make sure to allocate enough cache via the cache_size= parameter
	gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Warm-up takes a while

	gen.generate("Write an essay about large language models", print_tokens=True)
	gen.generate("Tell me a funny joke!", print_tokens=True)
	gen.generate("How to make a yummy chocolate cake?", print_tokens=True)

	```