OPEA
/

DeepSeek-V3-int4-sym-gguf-q4-0-inc

GGUF

Inference Endpoints

conversational

Model card Files Files and versions Community

cicdatopea commited on 26 days ago

Commit

eb85df1

verified ·

1 Parent(s): dfedb9d

Update README.md

Browse files

Files changed (1) hide show

README.md +57 -47

README.md CHANGED Viewed

@@ -85,53 +85,63 @@ Please follow the [Build llama.cpp locally](https://github.com/ggerganov/llama.c
 **5*80G gpu is needed(could optimize), 1.4T cpu memory is needed**
-We discovered that the inputs and outputs of certain layers in this model are very large and even exceed the FP16 range when tested with a few prompts. It is recommended to exclude these layers from quantization—particularly the 'down_proj' in layer 60—and run them using BF16 precision instead. However, we have not implemented this in this int4 model as in cpu, the compute dtype for int4 is bf16 or FP32.
-~~~python
-model.layers.60.mlp.experts.150.down_proj tensor(1144.) tensor(2122.9451)
-model.layers.60.mlp.experts.231.down_proj tensor(25856.) tensor(12827.9980)
-model.layers.60.mlp.shared_experts.down_proj tensor(1880.) tensor(3156.7344)
-model.layers.60.mlp.experts.81.down_proj tensor(4416.) tensor(6124.6846)
-model.layers.60.mlp.experts.92.down_proj tensor(107520.) tensor(50486.0781)
-model.layers.59.mlp.experts.138.down_proj tensor(1568.) tensor(190.8769)
-model.layers.60.mlp.experts.81.down_proj tensor(7360.) tensor(10024.4531)
-model.layers.60.mlp.experts.92.down_proj tensor(116224.) tensor(55192.4180)
-~~~
-**1 add meta data to bf16 model** https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16
-~~~python
-import safetensors
-from safetensors.torch import save_file
-for i in range(1, 164):
-    idx_str = "0" * (5-len(str(i))) + str(i)
-    safetensors_path = f"model-{idx_str}-of-000163.safetensors"
-    print(safetensors_path)
-    tensors = dict()
-    with safetensors.safe_open(safetensors_path, framework="pt") as f:
-        for key in f.keys():
-            tensors[key] = f.get_tensor(key)
-    save_file(tensors, safetensors_path, metadata={'format': 'pt'})
-~~~
-**2 replace the  modeling_deepseek.py with the following file**, basically align device and remove torch.no_grad as we need some tuning in AutoRound.
-https://github.com/intel/auto-round/blob/deepseekv3/modeling_deepseek.py
-**3   tuning**
-```bash
-git clone https://github.com/intel/auto-round.git && cd auto-round && git checkout deepseekv3
-```
-```bash
-python3 -m auto_round --model  "/models/DeepSeek-V3-bf16/"  --group_size 128 --format "gguf:q4_0"  --iters 200 --devices 0,1,2,3,4 --nsamples 512 --batch_size 8 --seqlen 512 --low_gpu_mem_usage --output_dir "tmp_autoround"  --disable_eval 2>&1 | tee -a seekv3.txt
 ```
 ## Ethical Considerations and Limitations

 **5*80G gpu is needed(could optimize), 1.4T cpu memory is needed**
+pip3 install git+https://github.com/intel/auto-round.git
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = DeepSeek-V3-hf
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype="auto")
+block = model.model.layers
+device_map = {}
+for n, m in block.named_modules():
+  if isinstance(m, (torch.nn.Linear, transformers.modeling_utils.Conv1D)):
+    # if not check_to_quantized(m):
+    #     unquantized_layers.append(n)
+    #     continue
+    if "experts" in n and ("shared_experts" not in n) and int(n.split('.')[-2])<63  and "down_proj" not in n :
+      device ="cuda:1"
+      output_device = "cuda:1"
+    elif "experts" in n and ("shared_experts" not in n) and "down_proj" in n and int(n.split('.')[-2])<63:
+      device = "cuda:1"
+      output_device = "cuda:0"
+    elif "experts" in n and ("shared_experts" not in n) and int(n.split('.')[-2]) >= 63 and  int(n.split('.')[-2]) < 128 and "down_proj" not in n:
+      device = "cuda:2"
+      output_device = "cuda:2"
+    elif "experts" in n and ("shared_experts" not in n) and "down_proj" in n and int(n.split('.')[-2]) >= 63 and  int(n.split('.')[-2]) < 128:
+      device = "cuda:2"
+      output_device = "cuda:0"
+    elif "experts" in n and ("shared_experts" not in n) and int(n.split('.')[-2]) >= 128 and int(
+          n.split('.')[-2]) < 192 and "down_proj" not in n:
+      device = "cuda:3"
+      output_device = "cuda:3"
+    elif "experts" in n and ("shared_experts" not in n) and "down_proj"  in n and int(
+          n.split('.')[-2]) >= 128 and int(n.split('.')[-2]) < 192:
+      device = "cuda:3"
+      output_device = "cuda:0"
+    elif "experts" in n and ("shared_experts" not in n) and "down_proj" not in n and int(
+          n.split('.')[-2]) >= 192:
+      device = "cuda:4"
+      output_device = "cuda:4"
+    elif "experts" in n and ("shared_experts" not in n) and "down_proj" in n and int(
+          n.split('.')[-2]) >= 192:
+      device = "cuda:4"
+      output_device = "cuda:0"
+    else:
+      device = "cuda:0"
+      output_device = "cuda:0"
+    n = n[2:]
+    device_map.update({n: device})
+from auto_round import AutoRound
+autoround = AutoRound(model=model, tokenizer=tokenizer, layer_config=layer_config, device_map=device_map,
+                       iters=200,batch_size=8, seqlen=512)
+autoround.quantize()
+autoround.save_quantized(format="gguf:q4_0", output_dir="tmp_autoround"
 ```
 ## Ethical Considerations and Limitations