--- library_name: transformers license: cc-by-nc-sa-4.0 --- # Sewy2 (untrained) 640m ## It is a new MoE architecture which uses the following: - DeepseekV3 - nGPT - ResFormer - NeuTRENO (as in resformer) - Tanh logit softcapping (as in Gemma2) ## Architecture: - 32 Layers - 32 Heads - 32 KV heads - 64 experts - 8 experts per token