Update README.md
Browse files
README.md
CHANGED
@@ -116,20 +116,20 @@ For training data details, please see the [GRAG-SFT-Dataset](https://huggingface
|
|
116 |
### Architecture
|
117 |
|
118 |
|
119 |
-
|
|
120 |
-
|
121 |
-
| d_model
|
122 |
-
| num heads
|
123 |
-
| num layers
|
124 |
-
| MLP ratio
|
125 |
-
| LayerNorm type
|
126 |
-
| pos embeddings
|
127 |
-
| attention variant
|
128 |
-
| biases
|
129 |
-
| block type
|
130 |
-
| activation
|
131 |
-
| sequence length
|
132 |
-
| weight tying
|
133 |
|
134 |
### Hyperparameters
|
135 |
|
|
|
116 |
### Architecture
|
117 |
|
118 |
|
119 |
+
| Parameter | GRAG-PHI-SFT |
|
120 |
+
|-----------------------|-----------------------------------------------------------------------------------------------|
|
121 |
+
| **d_model** | 3072 |
|
122 |
+
| **num heads** | 32 |
|
123 |
+
| **num layers** | 32 |
|
124 |
+
| **MLP ratio** | 2.66 |
|
125 |
+
| **LayerNorm type** | RMSNorm |
|
126 |
+
| **pos embeddings** | RoPE |
|
127 |
+
| **attention variant**| Standard Multi-Head Self Attention with sliding-window of 2047 |
|
128 |
+
| **biases** | none |
|
129 |
+
| **block type** | sequential |
|
130 |
+
| **activation** | SiLU |
|
131 |
+
| **sequence length** | 131072 |
|
132 |
+
| **weight tying** | bfloat16
|
133 |
|
134 |
### Hyperparameters
|
135 |
|