juliekallini commited on
Commit
cc3f3f6
·
verified ·
1 Parent(s): 0869b35

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -6
README.md CHANGED
@@ -91,14 +91,13 @@ This is the model card for the 300M-parameter **MrT5 Small** (`mrt5-small`), a m
91
 
92
  ### Model Architecture
93
 
94
- MrT5 Small uses the model configuration of the standard ByT5 Small, which has
95
- $d_\text{ff} = 3584$, $d_\text{model} = 1472$, 12 encoder layers, 4 decoder layers, 6 attention heads in each layer, and 300M total parameters.
96
 
97
- MrT5 has an additional *delete gate*, which dynamically reduces the encoder sequence length. In this model, it is placed after the third encoder layer, and all subsequent layers operate on a reduced sequence. This model was trained with a deletion rate of $\delta=0.5$, which means that the model reduces its encoder sequence length by ~50% after
98
  the third layer. MrT5’s gating mechanism only introduces an additional 3,000 parameters.
99
 
100
  MrT5 Small is initialized from ByT5 Small and is fine-tuned on the same training objective. Only MrT5's delete gate is randomly initialized before training.
101
- The other distinguishing feature of MrT5 is that it uses [$\text{softmax}_1$](https://www.evanmiller.org/attention-is-off-by-one.html) in its attention mechanism.
102
 
103
  ## Uses
104
 
@@ -160,9 +159,9 @@ When training on the span corruption objective, we calculate the corrupted spans
160
 
161
  #### Optimization
162
 
163
- MrT5 is trained for 5,000 gradient steps over batches of $2^{20}$ tokens (i.e., an encoder sequence length of 1024 with an effective batch size of 1024). We use the AdamW optimizer with an initial learning rate of $1 \times 10^{-4}$ with linear decay and no warmup.
164
 
165
- To achieve a specific sequence length reduction rate, we use a PI-controller with a target deletion ratio of $\delta=0.5$, as described in Section 3.2 of the paper. We also use attention score regularization, as described in Appendix D of the paper.
166
 
167
 
168
  ## Environmental Impact
 
91
 
92
  ### Model Architecture
93
 
94
+ MrT5 Small uses the model configuration of the standard ByT5 Small, which has a feed-forward dimensionality of 3584, a model dimensionality of 1472, 12 encoder layers, 4 decoder layers, 6 attention heads in each layer, and 300M total parameters.
 
95
 
96
+ MrT5 has an additional *delete gate*, which dynamically reduces the encoder sequence length. In this model, it is placed after the third encoder layer, and all subsequent layers operate on a reduced sequence. This model was trained with a deletion rate of δ=0.5, which means that the model reduces its encoder sequence length by ~50% after
97
  the third layer. MrT5’s gating mechanism only introduces an additional 3,000 parameters.
98
 
99
  MrT5 Small is initialized from ByT5 Small and is fine-tuned on the same training objective. Only MrT5's delete gate is randomly initialized before training.
100
+ The other distinguishing feature of MrT5 is that it uses [softmax1](https://www.evanmiller.org/attention-is-off-by-one.html) in its attention mechanism.
101
 
102
  ## Uses
103
 
 
159
 
160
  #### Optimization
161
 
162
+ MrT5 is trained for 5,000 gradient steps over batches of 2^20 tokens (i.e., an encoder sequence length of 1024 with an effective batch size of 1024). We use the AdamW optimizer with an initial learning rate of 1e-4 with linear decay and no warmup.
163
 
164
+ To achieve a specific sequence length reduction rate, we use a PI-controller with a target deletion ratio of δ=0.5, as described in Section 3.2 of the paper. We also use attention score regularization, as described in Appendix D of the paper.
165
 
166
 
167
  ## Environmental Impact