SeerAttention
/

SeerAttention-DeepSeek-R1-Distill-Qwen-14B-Decode-AttnGates

Text Generation

Model card Files Files and versions Community

SeerAttention commited on 18 days ago

Commit

e2e579d

·

verified ·

1 Parent(s): 537920b

Update README.md

Files changed (1) hide show

README.md +20 -12

README.md CHANGED Viewed

@@ -1,12 +1,20 @@
----
-license: mit
-library_name: transformers
-base_model:
-- deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
-base_model_relation: "adapter"
----
-This repo only contains the AttnGates' weights for deepseek-ai/DeepSeek-R1-Distill-Qwen-14B Model. It's only used for decoding. However, the current inference framework is unoptimized and only for accuracy tests.
-[SeerAttention](https://arxiv.org/pdf/2410.13276) introduces learnable AttnGate modules to accelerate the computationally intensive prefill stage of long-context large language models (LLMs) via dynamic block-level sparsity. The AttnGates are trained in a parameter-efficient self-distillation framework, where they learn to mimic the block-wise attention patterns of the original frozen model, preserving its integrity while avoiding costly retraining. During inference, these gates generate block-sparse binary masks by applying threshold/TopK to their learned soft scores, enabling efficient computation through a custom block-sparse FlashAttention kernel.

+---
+license: mit
+library_name: transformers
+base_model:
+- deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
+base_model_relation: "adapter"
+---
+This repo only contains the AttnGates' weights for deepseek-ai/DeepSeek-R1-Distill-Qwen-14B Model. It's only used for decoding. However, the current inference framework is unoptimized and only for accuracy tests.
+[SeerAttention](https://arxiv.org/pdf/2410.13276) introduces learnable AttnGate modules to accelerate the computationally intensive prefill stage of long-context large language models (LLMs) via dynamic block-level sparsity. The AttnGates are trained in a parameter-efficient self-distillation framework, where they learn to mimic the block-wise attention patterns of the original frozen model, preserving its integrity while avoiding costly retraining. During inference, these gates generate block-sparse binary masks by applying threshold/TopK to their learned soft scores, enabling efficient computation through a custom block-sparse FlashAttention kernel.
+## AIME
+| threshold| 0.005  | 0.001 | Dense |
+|----------|--------|-------|-------|
+| Acc      | 73.33  | 73.33 | 70    |
+| Sparsity | 86%    | 61.68%| 0     |