Update README.md
Browse files
README.md
CHANGED
@@ -161,7 +161,7 @@ When training on the span corruption objective, we calculate the corrupted spans
|
|
161 |
|
162 |
MrT5 is trained for 5,000 gradient steps over batches of 2^20 tokens (i.e., an encoder sequence length of 1024 with an effective batch size of 1024). We use the AdamW optimizer with an initial learning rate of 1e-4 with linear decay and no warmup.
|
163 |
|
164 |
-
To achieve a specific sequence length reduction rate, we use a PI
|
165 |
|
166 |
|
167 |
## Environmental Impact
|
|
|
161 |
|
162 |
MrT5 is trained for 5,000 gradient steps over batches of 2^20 tokens (i.e., an encoder sequence length of 1024 with an effective batch size of 1024). We use the AdamW optimizer with an initial learning rate of 1e-4 with linear decay and no warmup.
|
163 |
|
164 |
+
To achieve a specific sequence length reduction rate, we use a PI controller with a target deletion ratio of δ=0.5, as described in Section 3.2 of the paper. We also use attention score regularization, as described in Appendix D of the paper.
|
165 |
|
166 |
|
167 |
## Environmental Impact
|