File size: 2,403 Bytes
72beedf
 
 
 
 
 
 
 
 
 
c070c6b
f30cc42
 
1e8309c
 
 
 
912c18c
6dcd284
 
 
f2677ab
1e8309c
 
9d2c90f
f2677ab
 
9d2c90f
 
 
f2677ab
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
---
title: README
emoji: πŸš€
colorFrom: red
colorTo: yellow
sdk: static
pinned: true
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/673ab3647afcea17eb4378fd/XKKbARx5zCfggwT6LbCfo.jpeg
---

<center>
    <img src="https://huggingface.co/spaces/SmallDoge/README/resolve/main/org_icon.png" alt="Doge" width="1080" height="720">
</center>

# SmallDoge

This is the home of the SmallDoge family of small language models, where we release a series of high-quality and ultra-fast small language models based on dynamic algorithms. All training details and code are publicly available on the [small-doge repository](https://github.com/SmallDoges/small-doge).

Join our discord [here](https://discord.gg/P2yYH95N).

We have released:

- [Doge-SLM](https://huggingface.co/collections/SmallDoge/doge-slm-679cc991f027c4a3abbded4a): A series of small language models, including pre-training models, supervised fine-tuning models, and reinforcement learning models.
- [Doge-CheckPoint](https://huggingface.co/collections/SmallDoge/doge-checkpoint-679ce95dae1d498e3ad35068): A series of checkPoint weights that can continue training on new datasets without spikes of the training.
- [Doge-Downstream-Applications](https://huggingface.co/collections/SmallDoge/doge-downstream-applications-679ce627a0b7820e04ca22bd): A series of small language models for downstream applications.

<center>
  <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/refs%2Fpr%2F426/transformers/model_doc/doge_architecture.png" alt="drawing" width="600"/>
</center>

As shown in the figure below, the sequence transformation part of the Doge architecture uses `Dynamic Mask Attention`, which can be understood as using self-attention related to value states during training, and using state-space without past state decay during inference, to solve the problem of existing Transformers or SSMs getting lost in long text. The state transformation part of Doge uses `Cross Domain Mixture of Experts`, which consists of dense linear layers and sparse embedding layers, and can additionally increase sparse parameters to continue training from dense weight checkpoints without retraining the entire model, thereby reducing the cost of continuous iteration of the model. In addition, Doge also uses `RMSNorm` and `Residual` with learnable parameters to adapt the gradient range of deep models.