SeerAttention commited on
Commit
e2e579d
·
verified ·
1 Parent(s): 537920b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -12
README.md CHANGED
@@ -1,12 +1,20 @@
1
- ---
2
- license: mit
3
- library_name: transformers
4
- base_model:
5
- - deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
6
- base_model_relation: "adapter"
7
- ---
8
-
9
- This repo only contains the AttnGates' weights for deepseek-ai/DeepSeek-R1-Distill-Qwen-14B Model. It's only used for decoding. However, the current inference framework is unoptimized and only for accuracy tests.
10
-
11
- [SeerAttention](https://arxiv.org/pdf/2410.13276) introduces learnable AttnGate modules to accelerate the computationally intensive prefill stage of long-context large language models (LLMs) via dynamic block-level sparsity. The AttnGates are trained in a parameter-efficient self-distillation framework, where they learn to mimic the block-wise attention patterns of the original frozen model, preserving its integrity while avoiding costly retraining. During inference, these gates generate block-sparse binary masks by applying threshold/TopK to their learned soft scores, enabling efficient computation through a custom block-sparse FlashAttention kernel.
12
-
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: transformers
4
+ base_model:
5
+ - deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
6
+ base_model_relation: "adapter"
7
+ ---
8
+
9
+ This repo only contains the AttnGates' weights for deepseek-ai/DeepSeek-R1-Distill-Qwen-14B Model. It's only used for decoding. However, the current inference framework is unoptimized and only for accuracy tests.
10
+
11
+ [SeerAttention](https://arxiv.org/pdf/2410.13276) introduces learnable AttnGate modules to accelerate the computationally intensive prefill stage of long-context large language models (LLMs) via dynamic block-level sparsity. The AttnGates are trained in a parameter-efficient self-distillation framework, where they learn to mimic the block-wise attention patterns of the original frozen model, preserving its integrity while avoiding costly retraining. During inference, these gates generate block-sparse binary masks by applying threshold/TopK to their learned soft scores, enabling efficient computation through a custom block-sparse FlashAttention kernel.
12
+
13
+
14
+ ## AIME
15
+
16
+ | threshold| 0.005 | 0.001 | Dense |
17
+ |----------|--------|-------|-------|
18
+ | Acc | 73.33 | 73.33 | 70 |
19
+ | Sparsity | 86% | 61.68%| 0 |
20
+