OPEA
/

GGUF
Inference Endpoints
conversational
cicdatopea commited on
Commit
eb85df1
·
verified ·
1 Parent(s): dfedb9d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +57 -47
README.md CHANGED
@@ -85,53 +85,63 @@ Please follow the [Build llama.cpp locally](https://github.com/ggerganov/llama.c
85
 
86
  **5*80G gpu is needed(could optimize), 1.4T cpu memory is needed**
87
 
88
- We discovered that the inputs and outputs of certain layers in this model are very large and even exceed the FP16 range when tested with a few prompts. It is recommended to exclude these layers from quantization—particularly the 'down_proj' in layer 60—and run them using BF16 precision instead. However, we have not implemented this in this int4 model as in cpu, the compute dtype for int4 is bf16 or FP32.
89
-
90
- ~~~python
91
- model.layers.60.mlp.experts.150.down_proj tensor(1144.) tensor(2122.9451)
92
- model.layers.60.mlp.experts.231.down_proj tensor(25856.) tensor(12827.9980)
93
- model.layers.60.mlp.shared_experts.down_proj tensor(1880.) tensor(3156.7344)
94
- model.layers.60.mlp.experts.81.down_proj tensor(4416.) tensor(6124.6846)
95
- model.layers.60.mlp.experts.92.down_proj tensor(107520.) tensor(50486.0781)
96
- model.layers.59.mlp.experts.138.down_proj tensor(1568.) tensor(190.8769)
97
- model.layers.60.mlp.experts.81.down_proj tensor(7360.) tensor(10024.4531)
98
- model.layers.60.mlp.experts.92.down_proj tensor(116224.) tensor(55192.4180)
99
-
100
- ~~~
101
-
102
- **1 add meta data to bf16 model** https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16
103
-
104
- ~~~python
105
- import safetensors
106
- from safetensors.torch import save_file
107
-
108
- for i in range(1, 164):
109
- idx_str = "0" * (5-len(str(i))) + str(i)
110
- safetensors_path = f"model-{idx_str}-of-000163.safetensors"
111
- print(safetensors_path)
112
- tensors = dict()
113
- with safetensors.safe_open(safetensors_path, framework="pt") as f:
114
- for key in f.keys():
115
- tensors[key] = f.get_tensor(key)
116
- save_file(tensors, safetensors_path, metadata={'format': 'pt'})
117
- ~~~
118
-
119
-
120
-
121
- **2 replace the modeling_deepseek.py with the following file**, basically align device and remove torch.no_grad as we need some tuning in AutoRound.
122
-
123
- https://github.com/intel/auto-round/blob/deepseekv3/modeling_deepseek.py
124
-
125
-
126
-
127
- **3 tuning**
128
-
129
- ```bash
130
- git clone https://github.com/intel/auto-round.git && cd auto-round && git checkout deepseekv3
131
- ```
132
-
133
- ```bash
134
- python3 -m auto_round --model "/models/DeepSeek-V3-bf16/" --group_size 128 --format "gguf:q4_0" --iters 200 --devices 0,1,2,3,4 --nsamples 512 --batch_size 8 --seqlen 512 --low_gpu_mem_usage --output_dir "tmp_autoround" --disable_eval 2>&1 | tee -a seekv3.txt
 
 
 
 
 
 
 
 
 
 
135
  ```
136
 
137
  ## Ethical Considerations and Limitations
 
85
 
86
  **5*80G gpu is needed(could optimize), 1.4T cpu memory is needed**
87
 
88
+ pip3 install git+https://github.com/intel/auto-round.git
89
+
90
+ ```python
91
+ import torch
92
+ from transformers import AutoModelForCausalLM, AutoTokenizer
93
+
94
+ model_name = DeepSeek-V3-hf
95
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
96
+ model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype="auto")
97
+
98
+ block = model.model.layers
99
+ device_map = {}
100
+ for n, m in block.named_modules():
101
+ if isinstance(m, (torch.nn.Linear, transformers.modeling_utils.Conv1D)):
102
+ # if not check_to_quantized(m):
103
+ # unquantized_layers.append(n)
104
+ # continue
105
+ if "experts" in n and ("shared_experts" not in n) and int(n.split('.')[-2])<63 and "down_proj" not in n :
106
+ device ="cuda:1"
107
+ output_device = "cuda:1"
108
+ elif "experts" in n and ("shared_experts" not in n) and "down_proj" in n and int(n.split('.')[-2])<63:
109
+ device = "cuda:1"
110
+ output_device = "cuda:0"
111
+ elif "experts" in n and ("shared_experts" not in n) and int(n.split('.')[-2]) >= 63 and int(n.split('.')[-2]) < 128 and "down_proj" not in n:
112
+ device = "cuda:2"
113
+ output_device = "cuda:2"
114
+ elif "experts" in n and ("shared_experts" not in n) and "down_proj" in n and int(n.split('.')[-2]) >= 63 and int(n.split('.')[-2]) < 128:
115
+ device = "cuda:2"
116
+ output_device = "cuda:0"
117
+ elif "experts" in n and ("shared_experts" not in n) and int(n.split('.')[-2]) >= 128 and int(
118
+ n.split('.')[-2]) < 192 and "down_proj" not in n:
119
+ device = "cuda:3"
120
+ output_device = "cuda:3"
121
+ elif "experts" in n and ("shared_experts" not in n) and "down_proj" in n and int(
122
+ n.split('.')[-2]) >= 128 and int(n.split('.')[-2]) < 192:
123
+ device = "cuda:3"
124
+ output_device = "cuda:0"
125
+ elif "experts" in n and ("shared_experts" not in n) and "down_proj" not in n and int(
126
+ n.split('.')[-2]) >= 192:
127
+ device = "cuda:4"
128
+ output_device = "cuda:4"
129
+ elif "experts" in n and ("shared_experts" not in n) and "down_proj" in n and int(
130
+ n.split('.')[-2]) >= 192:
131
+ device = "cuda:4"
132
+ output_device = "cuda:0"
133
+ else:
134
+ device = "cuda:0"
135
+ output_device = "cuda:0"
136
+ n = n[2:]
137
+ device_map.update({n: device})
138
+
139
+ from auto_round import AutoRound
140
+
141
+ autoround = AutoRound(model=model, tokenizer=tokenizer, layer_config=layer_config, device_map=device_map,
142
+ iters=200,batch_size=8, seqlen=512)
143
+ autoround.quantize()
144
+ autoround.save_quantized(format="gguf:q4_0", output_dir="tmp_autoround"
145
  ```
146
 
147
  ## Ethical Considerations and Limitations