jyhong836 commited on
Commit
5b0fb94
Β·
1 Parent(s): 42509a6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +100 -4
README.md CHANGED
@@ -1,10 +1,106 @@
1
  ---
2
  title: README
3
- emoji: πŸ“ˆ
4
- colorFrom: yellow
5
- colorTo: red
6
  sdk: static
7
  pinned: false
8
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
- Edit this `README.md` markdown file to author your organization card.
 
1
  ---
2
  title: README
3
+ emoji: πŸ‡
4
+ colorFrom: pink
5
+ colorTo: indigo
6
  sdk: static
7
  pinned: false
8
  ---
9
+ # Compressed LLM Model Zone
10
+
11
+
12
+ The models are prepared by [Visual Informatics Group @ University of Texas at Austin (VITA-group)](https://vita-group.github.io/) and [LLNL]().
13
+ Credits to Ajay Jaiswal, Jinhao Duan, Zhenyu Zhang, Zhangheng Li, Lu Yin, Shiwei Liu and Junyuan Hong.
14
+
15
+ License: [MIT License](https://opensource.org/license/mit/)
16
+
17
+ Setup environment
18
+ ```shell
19
+ pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu117
20
+ pip install transformers==4.31.0
21
+ pip install accelerate
22
+ pip install auto-gptq # for gptq
23
+ ```
24
+
25
+ How to use pruned models
26
+ ```python
27
+ import torch
28
+ from transformers import AutoModelForCausalLM, AutoTokenizer
29
+ base_model = 'llama-2-7b'
30
+ comp_method = 'magnitude_unstructured'
31
+ comp_degree = 0.2
32
+ model_path = f'compressed-llm/{base_model}_{comp_method}'
33
+ model = AutoModelForCausalLM.from_pretrained(
34
+ model_path,
35
+ revision=f's{comp_degree}',
36
+ torch_dtype=torch.float16,
37
+ low_cpu_mem_usage=True,
38
+ device_map="auto"
39
+ )
40
+ tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')
41
+ input_ids = tokenizer('Hello! I am a compressed-LLM chatbot!', return_tensors='pt').input_ids.cuda()
42
+ outputs = model.generate(input_ids, max_new_tokens=128)
43
+ print(tokenizer.decode(outputs[0]))
44
+ ```
45
+
46
+ How to use wanda+gptq models
47
+ ```python
48
+ from transformers import AutoTokenizer
49
+ from auto_gptq import AutoGPTQForCausalLM
50
+ model_path = 'compressed-llm/llama-2-7b_wanda_2_4_gptq_4bit_128g'
51
+ tokenizer_path = 'meta-llama/Llama-2-7b-hf'
52
+ model = AutoGPTQForCausalLM.from_quantized(
53
+ model_path,
54
+ # inject_fused_attention=False, # or
55
+ disable_exllama=True,
56
+ device_map='auto',
57
+ )
58
+ tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True)
59
+ input_ids = tokenizer('Hello! I am a VITA-compressed-LLM chatbot!', return_tensors='pt').input_ids.to('cuda')
60
+ outputs = model.generate(input_ids=input_ids, max_length=128)
61
+ tokenizer.decode(outputs[0])
62
+ ```
63
+
64
+ How to use gptq models
65
+ ```python
66
+ from transformers import AutoTokenizer
67
+ from auto_gptq import AutoGPTQForCausalLM
68
+ # model_path = 'compressed-llm/llama-2-7b_wanda_2_4_gptq_4bit_128g'
69
+ # tokenizer_path = 'meta-llama/Llama-2-7b-hf'
70
+ model_path = 'compressed-llm/vicuna-7b-v1.3_gptq'
71
+ tokenizer_path = 'lmsys/vicuna-7b-v1.3'
72
+ model = AutoGPTQForCausalLM.from_quantized(
73
+ model_path,
74
+ # inject_fused_attention=False, # or
75
+ disable_exllama=True,
76
+ device_map='auto',
77
+ revision='2bit_128g',
78
+ )
79
+ from transformers import AutoTokenizer
80
+ tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True)
81
+ input_ids = tokenizer('Hello! I am a VITA-compressed-LLM chatbot!', return_tensors='pt').input_ids.to('cuda')
82
+ outputs = model.generate(input_ids=input_ids, max_length=128)
83
+ tokenizer.decode(outputs[0])
84
+ ```
85
+
86
+ ## Citations
87
+
88
+ If you are using models in this hub, please consider citing our papers.
89
+ ```bibtex
90
+ @article{jaiswal2023emergence,
91
+ title={The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter},
92
+ author={Jaiswal, Ajay and Liu, Shiwei and Chen, Tianlong and Wang, Zhangyang},
93
+ journal={arXiv},
94
+ year={2023}
95
+ }
96
+ @article{jaiswal2023compressing,
97
+ title={Compressing LLMs: The Truth is Rarely Pure and Never Simple},
98
+ author={Ajay Jaiswal and Zhe Gan and Xianzhi Du and Bowen Zhang and Zhangyang Wang and Yinfei Yang},
99
+ year={2023},
100
+ journal={arXiv},
101
+ }
102
+ ```
103
+
104
+
105
+ For any question, please contact [Junyuan Hong](mailto:[email protected]).
106