SimonX commited on
Commit
e1b7ee9
·
verified ·
1 Parent(s): a98fb88

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -0
README.md CHANGED
@@ -22,6 +22,7 @@ The model has hybrid architecture with Mamba and Attention heads running in para
22
 
23
  This model is ready for commercial use.
24
 
 
25
  **[Caution] During generation, the batch size needs to be 1. Our current implementation does not fully support padding of Meta tokens + SWA; this is a work in progress. Training and pre-filling support any batch size.**
26
 
27
 
@@ -35,6 +36,8 @@ This model is released under the [NVIDIA Open Model License Agreement](https://d
35
 
36
  ## Model Architecture
37
 
 
 
38
  Hymba-1.5B-Base has a model embedding size of 1600, 25 attention heads, and an MLP intermediate dimension of 5504, with 32 layers in total, 16 SSM states, 3 full attention layers, the rest are sliding window attention. Unlike the standard Transformer, each attention layer in Hymba has a hybrid combination of standard attention heads and Mamba heads in parallel. Additionally, it uses Grouped-Query Attention (GQA) and Rotary Position Embeddings (RoPE).
39
 
40
  Features of this architecture:
@@ -54,6 +57,7 @@ Features of this architecture:
54
  </div>
55
 
56
 
 
57
  ## Performance Highlights
58
  - Hymba-1.5B-Base outperforms all sub-2B public models.
59
 
 
22
 
23
  This model is ready for commercial use.
24
 
25
+
26
  **[Caution] During generation, the batch size needs to be 1. Our current implementation does not fully support padding of Meta tokens + SWA; this is a work in progress. Training and pre-filling support any batch size.**
27
 
28
 
 
36
 
37
  ## Model Architecture
38
 
39
+ > ⚡️ We've released a minimal implementation of Hymba on GitHub to help developers understand and implement its design principles in their own models. Check it out! [barebones-hymba](https://github.com/NVlabs/hymba/tree/main/barebones_hymba).
40
+
41
  Hymba-1.5B-Base has a model embedding size of 1600, 25 attention heads, and an MLP intermediate dimension of 5504, with 32 layers in total, 16 SSM states, 3 full attention layers, the rest are sliding window attention. Unlike the standard Transformer, each attention layer in Hymba has a hybrid combination of standard attention heads and Mamba heads in parallel. Additionally, it uses Grouped-Query Attention (GQA) and Rotary Position Embeddings (RoPE).
42
 
43
  Features of this architecture:
 
57
  </div>
58
 
59
 
60
+
61
  ## Performance Highlights
62
  - Hymba-1.5B-Base outperforms all sub-2B public models.
63