metadata
library_name: transformers
license: cc-by-nc-sa-4.0
Sewy2 (untrained) 640m
It is a new MoE architecture which uses the following:
- DeepseekV3
- nGPT
- ResFormer
- NeuTRENO (as in resformer)
- Tanh logit softcapping (as in Gemma2)
Architecture:
- 32 Layers
- 32 Heads
- 32 KV heads
- 64 experts
- 8 experts per token