Update README.md
Browse files
README.md
CHANGED
@@ -1,12 +1,20 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
library_name: transformers
|
4 |
-
base_model:
|
5 |
-
- deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
|
6 |
-
base_model_relation: "adapter"
|
7 |
-
---
|
8 |
-
|
9 |
-
This repo only contains the AttnGates' weights for deepseek-ai/DeepSeek-R1-Distill-Qwen-14B Model. It's only used for decoding. However, the current inference framework is unoptimized and only for accuracy tests.
|
10 |
-
|
11 |
-
[SeerAttention](https://arxiv.org/pdf/2410.13276) introduces learnable AttnGate modules to accelerate the computationally intensive prefill stage of long-context large language models (LLMs) via dynamic block-level sparsity. The AttnGates are trained in a parameter-efficient self-distillation framework, where they learn to mimic the block-wise attention patterns of the original frozen model, preserving its integrity while avoiding costly retraining. During inference, these gates generate block-sparse binary masks by applying threshold/TopK to their learned soft scores, enabling efficient computation through a custom block-sparse FlashAttention kernel.
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
library_name: transformers
|
4 |
+
base_model:
|
5 |
+
- deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
|
6 |
+
base_model_relation: "adapter"
|
7 |
+
---
|
8 |
+
|
9 |
+
This repo only contains the AttnGates' weights for deepseek-ai/DeepSeek-R1-Distill-Qwen-14B Model. It's only used for decoding. However, the current inference framework is unoptimized and only for accuracy tests.
|
10 |
+
|
11 |
+
[SeerAttention](https://arxiv.org/pdf/2410.13276) introduces learnable AttnGate modules to accelerate the computationally intensive prefill stage of long-context large language models (LLMs) via dynamic block-level sparsity. The AttnGates are trained in a parameter-efficient self-distillation framework, where they learn to mimic the block-wise attention patterns of the original frozen model, preserving its integrity while avoiding costly retraining. During inference, these gates generate block-sparse binary masks by applying threshold/TopK to their learned soft scores, enabling efficient computation through a custom block-sparse FlashAttention kernel.
|
12 |
+
|
13 |
+
|
14 |
+
## AIME
|
15 |
+
|
16 |
+
| threshold| 0.005 | 0.001 | Dense |
|
17 |
+
|----------|--------|-------|-------|
|
18 |
+
| Acc | 73.33 | 73.33 | 70 |
|
19 |
+
| Sparsity | 86% | 61.68%| 0 |
|
20 |
+
|