Aarushhh's picture
Update README.md
4f22b2d verified
metadata
library_name: transformers
license: cc-by-nc-sa-4.0

Sewy2 (untrained) 640m

It is a new MoE architecture which uses the following:

  • DeepseekV3
  • nGPT
  • ResFormer
  • NeuTRENO (as in resformer)
  • Tanh logit softcapping (as in Gemma2)

Architecture:

  • 32 Layers
  • 32 Heads
  • 32 KV heads
  • 64 experts
  • 8 experts per token