stanfordnlp
/

mrt5-small

Transformers

Safetensors

mrt5

text2text-generation

custom_code

Model card Files Files and versions Community

juliekallini commited on Mar 26

Commit

cc3f3f6

verified ·

1 Parent(s): 0869b35

Update README.md

Browse files

Files changed (1) hide show

README.md +5 -6

README.md CHANGED Viewed

@@ -91,14 +91,13 @@ This is the model card for the 300M-parameter **MrT5 Small** (`mrt5-small`), a m
 ### Model Architecture
-MrT5 Small uses the model configuration of the standard ByT5 Small, which has
-$d_\text{ff} = 3584$, $d_\text{model} = 1472$, 12 encoder layers, 4 decoder layers, 6 attention heads in each layer, and 300M total parameters.
-MrT5 has an additional *delete gate*, which dynamically reduces the encoder sequence length. In this model, it is placed after the third encoder layer, and all subsequent layers operate on a reduced sequence. This model was trained with a deletion rate of $\delta=0.5$, which means that the model reduces its encoder sequence length by ~50% after
 the third layer. MrT5’s gating mechanism only introduces an additional 3,000 parameters.
 MrT5 Small is initialized from ByT5 Small and is fine-tuned on the same training objective. Only MrT5's delete gate is randomly initialized before training.
-The other distinguishing feature of MrT5 is that it uses [$\text{softmax}_1$](https://www.evanmiller.org/attention-is-off-by-one.html) in its attention mechanism.
 ## Uses
@@ -160,9 +159,9 @@ When training on the span corruption objective, we calculate the corrupted spans
 #### Optimization
-MrT5 is trained for 5,000 gradient steps over batches of $2^{20}$ tokens (i.e., an encoder sequence length of 1024 with an effective batch size of 1024). We use the AdamW optimizer with an initial learning rate of $1 \times 10^{-4}$ with linear decay and no warmup.
-To achieve a specific sequence length reduction rate, we use a PI-controller with a target deletion ratio of $\delta=0.5$, as described in Section 3.2 of the paper. We also use attention score regularization, as described in Appendix D of the paper.
 ## Environmental Impact

 ### Model Architecture
+MrT5 Small uses the model configuration of the standard ByT5 Small, which has a feed-forward dimensionality of 3584, a model dimensionality of 1472, 12 encoder layers, 4 decoder layers, 6 attention heads in each layer, and 300M total parameters.
+MrT5 has an additional *delete gate*, which dynamically reduces the encoder sequence length. In this model, it is placed after the third encoder layer, and all subsequent layers operate on a reduced sequence. This model was trained with a deletion rate of δ=0.5, which means that the model reduces its encoder sequence length by ~50% after
 the third layer. MrT5’s gating mechanism only introduces an additional 3,000 parameters.
 MrT5 Small is initialized from ByT5 Small and is fine-tuned on the same training objective. Only MrT5's delete gate is randomly initialized before training.
+The other distinguishing feature of MrT5 is that it uses [softmax1](https://www.evanmiller.org/attention-is-off-by-one.html) in its attention mechanism.
 ## Uses
 #### Optimization
+MrT5 is trained for 5,000 gradient steps over batches of 2^20 tokens (i.e., an encoder sequence length of 1024 with an effective batch size of 1024). We use the AdamW optimizer with an initial learning rate of 1e-4 with linear decay and no warmup.
+To achieve a specific sequence length reduction rate, we use a PI-controller with a target deletion ratio of δ=0.5, as described in Section 3.2 of the paper. We also use attention score regularization, as described in Appendix D of the paper.
 ## Environmental Impact