SparseLLM
/

sparsing-law-0.8b-relu

Text Generation

Model card Files Files and versions

demerzel-iv commited on Nov 7, 2024

Commit

da03d7b

·

verified ·

1 Parent(s): 80b03f6

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -6,7 +6,7 @@ language:
 pipeline_tag: text-generation
 ---
-# Model Card for sparsing-law-0.4b-relu
 - **Paper:** [paper](https://arxiv.org/pdf/2411.02335)
 - **Repository containing relevant codes:** [github](https://github.com/thunlp/SparsingLaw)
@@ -14,7 +14,7 @@ pipeline_tag: text-generation
 ### Introduction
 The model is one of the key checkpoints used for most analyses in the paper *Sparsing Law: Towards Large Language Models with Greater Activation Sparsity*.
-It is ReLU-activated and contains approximately 0.4 billion non-embedding parameters.
 The model was trained from scratch using the pre-training dataset described in our paper, with the WSD (Warmup-Stable-Decay) learning rate scheduler.
 Note that it is a base model derived from the last checkpoint of the stable pre-training stage, which has not undergone the decay or SFT stage.

 pipeline_tag: text-generation
 ---
+# Model Card for sparsing-law-0.8b-relu
 - **Paper:** [paper](https://arxiv.org/pdf/2411.02335)
 - **Repository containing relevant codes:** [github](https://github.com/thunlp/SparsingLaw)
 ### Introduction
 The model is one of the key checkpoints used for most analyses in the paper *Sparsing Law: Towards Large Language Models with Greater Activation Sparsity*.
+It is ReLU-activated and contains approximately 0.8 billion non-embedding parameters.
 The model was trained from scratch using the pre-training dataset described in our paper, with the WSD (Warmup-Stable-Decay) learning rate scheduler.
 Note that it is a base model derived from the last checkpoint of the stable pre-training stage, which has not undergone the decay or SFT stage.