Spaces:
Running
Running
File size: 4,768 Bytes
42509a6 5b0fb94 42509a6 83f93fd 42509a6 83f93fd 5b0fb94 83f93fd 891a44f 5b0fb94 02a238d 5b0fb94 02a238d 5b0fb94 83f93fd 5b0fb94 83f93fd 5b0fb94 42509a6 83f93fd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
---
title: README
emoji: π
colorFrom: pink
colorTo: indigo
sdk: static
pinned: false
license: mit
short_description: Compressed Large Language Models
---
# Compressed Large Language Models
This repo contains compressed LLMs used in the [Decoding Compressed Trust](https://decoding-comp-trust.github.io/) project.
The models are prepared by [Visual Informatics Group @ University of Texas at Austin (VITA-group)](https://vita-group.github.io/) and
[Center for Applied Scientific Computing](https://computing.llnl.gov/casc) at [LLNL](https://www.llnl.gov/).
License: [MIT License](https://opensource.org/license/mit/)
Simplified lists:
* Models: Llama-2 13b, Llama-2 chat 13b, Vicuna 13b v1.3
* Compression methods:
- Pruning: Magnitude-based, Wanda, SparseGPT (2:4 semi-structured)
- Quantization: AWQ, GPTQ (3,4,8 bits)
Setup environment
```shell
pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu117
pip install transformers==4.31.0
pip install accelerate
pip install auto-gptq # for gptq
```
## How to use models
How to use pruned models
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = 'llama-2-7b'
comp_method = 'magnitude_unstructured'
comp_degree = 0.2
model_path = f'compressed-llm/{base_model}_{comp_method}'
model = AutoModelForCausalLM.from_pretrained(
model_path,
revision=f's{comp_degree}',
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')
input_ids = tokenizer('Hello! I am a compressed-LLM chatbot!', return_tensors='pt').input_ids.cuda()
outputs = model.generate(input_ids, max_new_tokens=128)
print(tokenizer.decode(outputs[0]))
```
How to use wanda+gptq models
```python
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
model_path = 'compressed-llm/llama-2-7b_wanda_2_4_gptq_4bit_128g'
tokenizer_path = 'meta-llama/Llama-2-7b-hf'
model = AutoGPTQForCausalLM.from_quantized(
model_path,
# inject_fused_attention=False, # or
disable_exllama=True,
device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True)
input_ids = tokenizer('Hello! I am a VITA-compressed-LLM chatbot!', return_tensors='pt').input_ids.to('cuda')
outputs = model.generate(input_ids=input_ids, max_length=128)
tokenizer.decode(outputs[0])
```
How to use gptq models
```python
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
# model_path = 'compressed-llm/llama-2-7b_wanda_2_4_gptq_4bit_128g'
# tokenizer_path = 'meta-llama/Llama-2-7b-hf'
model_path = 'compressed-llm/vicuna-7b-v1.3_gptq'
tokenizer_path = 'lmsys/vicuna-7b-v1.3'
model = AutoGPTQForCausalLM.from_quantized(
model_path,
# inject_fused_attention=False, # or
disable_exllama=True,
device_map='auto',
revision='2bit_128g',
)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True)
input_ids = tokenizer('Hello! I am a VITA-compressed-LLM chatbot!', return_tensors='pt').input_ids.to('cuda')
outputs = model.generate(input_ids=input_ids, max_length=128)
tokenizer.decode(outputs[0])
```
## Citations
If you are using models in this hub, please consider citing our papers.
```bibtex
@article{hong2024comptrust,
title={Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression},
author={Hong, Junyuan and Duan, Jinhao and Zhang, Chenhui and Li, Zhangheng
and Xie, Chulin and Lieberman, Kelsey and Diffenderfer, James
and Bartoldson, Brian and Jaiswal, Ajay and Xu, Kaidi and Kailkhura, Bhavya
and Hendrycks, Dan and Song, Dawn and Wang, Zhangyang and Bo Li},
journal={arXiv},
year={2024}
}
```
Some of the models were used in previous publications.
```bibtex
@article{jaiswal2023emergence,
title={The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter},
author={Jaiswal, Ajay and Liu, Shiwei and Chen, Tianlong and Wang, Zhangyang},
journal={arXiv},
year={2023}
}
@article{jaiswal2023compressing,
title={Compressing LLMs: The Truth is Rarely Pure and Never Simple},
author={Ajay Jaiswal and Zhe Gan and Xianzhi Du and Bowen Zhang and Zhangyang Wang and Yinfei Yang},
year={2023},
journal={arXiv},
}
```
## Acknowlegement
Main credits to Ajay Jaiswal, Jinhao Duan, Zhangheng Li and Junyuan Hong. We also appreciate Zhenyu Zhang, Lu Yin, and Shiwei Liu in some preparations.
For any question, please contact [Junyuan Hong](mailto:[email protected]). |