Update README.md
Browse files
README.md
CHANGED
@@ -91,14 +91,13 @@ This is the model card for the 300M-parameter **MrT5 Small** (`mrt5-small`), a m
|
|
91 |
|
92 |
### Model Architecture
|
93 |
|
94 |
-
MrT5 Small uses the model configuration of the standard ByT5 Small, which has
|
95 |
-
$d_\text{ff} = 3584$, $d_\text{model} = 1472$, 12 encoder layers, 4 decoder layers, 6 attention heads in each layer, and 300M total parameters.
|
96 |
|
97 |
-
MrT5 has an additional *delete gate*, which dynamically reduces the encoder sequence length. In this model, it is placed after the third encoder layer, and all subsequent layers operate on a reduced sequence. This model was trained with a deletion rate of
|
98 |
the third layer. MrT5’s gating mechanism only introduces an additional 3,000 parameters.
|
99 |
|
100 |
MrT5 Small is initialized from ByT5 Small and is fine-tuned on the same training objective. Only MrT5's delete gate is randomly initialized before training.
|
101 |
-
The other distinguishing feature of MrT5 is that it uses [
|
102 |
|
103 |
## Uses
|
104 |
|
@@ -160,9 +159,9 @@ When training on the span corruption objective, we calculate the corrupted spans
|
|
160 |
|
161 |
#### Optimization
|
162 |
|
163 |
-
MrT5 is trained for 5,000 gradient steps over batches of
|
164 |
|
165 |
-
To achieve a specific sequence length reduction rate, we use a PI-controller with a target deletion ratio of
|
166 |
|
167 |
|
168 |
## Environmental Impact
|
|
|
91 |
|
92 |
### Model Architecture
|
93 |
|
94 |
+
MrT5 Small uses the model configuration of the standard ByT5 Small, which has a feed-forward dimensionality of 3584, a model dimensionality of 1472, 12 encoder layers, 4 decoder layers, 6 attention heads in each layer, and 300M total parameters.
|
|
|
95 |
|
96 |
+
MrT5 has an additional *delete gate*, which dynamically reduces the encoder sequence length. In this model, it is placed after the third encoder layer, and all subsequent layers operate on a reduced sequence. This model was trained with a deletion rate of δ=0.5, which means that the model reduces its encoder sequence length by ~50% after
|
97 |
the third layer. MrT5’s gating mechanism only introduces an additional 3,000 parameters.
|
98 |
|
99 |
MrT5 Small is initialized from ByT5 Small and is fine-tuned on the same training objective. Only MrT5's delete gate is randomly initialized before training.
|
100 |
+
The other distinguishing feature of MrT5 is that it uses [softmax1](https://www.evanmiller.org/attention-is-off-by-one.html) in its attention mechanism.
|
101 |
|
102 |
## Uses
|
103 |
|
|
|
159 |
|
160 |
#### Optimization
|
161 |
|
162 |
+
MrT5 is trained for 5,000 gradient steps over batches of 2^20 tokens (i.e., an encoder sequence length of 1024 with an effective batch size of 1024). We use the AdamW optimizer with an initial learning rate of 1e-4 with linear decay and no warmup.
|
163 |
|
164 |
+
To achieve a specific sequence length reduction rate, we use a PI-controller with a target deletion ratio of δ=0.5, as described in Section 3.2 of the paper. We also use attention score regularization, as described in Appendix D of the paper.
|
165 |
|
166 |
|
167 |
## Environmental Impact
|