Spaces:

compressed-llm
/

README

Running

App Files Files Community

README / README.md

jyhong836

Update README.md

cbb77d6 almost 2 years ago

preview code

raw

history blame

3.73 kB

	---
	title: README
	emoji: 🐇
	colorFrom: pink
	colorTo: indigo
	sdk: static
	pinned: false
	---
	# Compressed LLM Model Zone


	The models are prepared by [Visual Informatics Group @ University of Texas at Austin (VITA-group)](https://vita-group.github.io/) and
	[Center for Applied Scientific Computing](https://computing.llnl.gov/casc) at [LLNL](https://www.llnl.gov/).
	Credits to Ajay Jaiswal, Jinhao Duan, Zhenyu Zhang, Zhangheng Li, Lu Yin, Shiwei Liu and Junyuan Hong.

	License: [MIT License](https://opensource.org/license/mit/)

	Setup environment
	```shell
	pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu117
	pip install transformers==4.31.0
	pip install accelerate
	pip install auto-gptq # for gptq
	```

	How to use pruned models
	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer
	base_model = 'llama-2-7b'
	comp_method = 'magnitude_unstructured'
	comp_degree = 0.2
	model_path = f'compressed-llm/{base_model}_{comp_method}'
	model = AutoModelForCausalLM.from_pretrained(
	model_path,
	revision=f's{comp_degree}',
	torch_dtype=torch.float16,
	low_cpu_mem_usage=True,
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')
	input_ids = tokenizer('Hello! I am a compressed-LLM chatbot!', return_tensors='pt').input_ids.cuda()
	outputs = model.generate(input_ids, max_new_tokens=128)
	print(tokenizer.decode(outputs[0]))
	```

	How to use wanda+gptq models
	```python
	from transformers import AutoTokenizer
	from auto_gptq import AutoGPTQForCausalLM
	model_path = 'compressed-llm/llama-2-7b_wanda_2_4_gptq_4bit_128g'
	tokenizer_path = 'meta-llama/Llama-2-7b-hf'
	model = AutoGPTQForCausalLM.from_quantized(
	model_path,
	# inject_fused_attention=False, # or
	disable_exllama=True,
	device_map='auto',
	)
	tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True)
	input_ids = tokenizer('Hello! I am a VITA-compressed-LLM chatbot!', return_tensors='pt').input_ids.to('cuda')
	outputs = model.generate(input_ids=input_ids, max_length=128)
	tokenizer.decode(outputs[0])
	```

	How to use gptq models
	```python
	from transformers import AutoTokenizer
	from auto_gptq import AutoGPTQForCausalLM
	# model_path = 'compressed-llm/llama-2-7b_wanda_2_4_gptq_4bit_128g'
	# tokenizer_path = 'meta-llama/Llama-2-7b-hf'
	model_path = 'compressed-llm/vicuna-7b-v1.3_gptq'
	tokenizer_path = 'lmsys/vicuna-7b-v1.3'
	model = AutoGPTQForCausalLM.from_quantized(
	model_path,
	# inject_fused_attention=False, # or
	disable_exllama=True,
	device_map='auto',
	revision='2bit_128g',
	)
	from transformers import AutoTokenizer
	tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True)
	input_ids = tokenizer('Hello! I am a VITA-compressed-LLM chatbot!', return_tensors='pt').input_ids.to('cuda')
	outputs = model.generate(input_ids=input_ids, max_length=128)
	tokenizer.decode(outputs[0])
	```

	## Citations

	If you are using models in this hub, please consider citing our papers.
	```bibtex
	@article{jaiswal2023emergence,
	title={The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter},
	author={Jaiswal, Ajay and Liu, Shiwei and Chen, Tianlong and Wang, Zhangyang},
	journal={arXiv},
	year={2023}
	}
	@article{jaiswal2023compressing,
	title={Compressing LLMs: The Truth is Rarely Pure and Never Simple},
	author={Ajay Jaiswal and Zhe Gan and Xianzhi Du and Bowen Zhang and Zhangyang Wang and Yinfei Yang},
	year={2023},
	journal={arXiv},
	}
	```


	For any question, please contact [Junyuan Hong](mailto:[email protected]).