stanfordnlp
/

mrt5-small

Text Generation

text2text-generation

Model card Files Files and versions Community

juliekallini commited on Mar 26

Commit

57c4322

·

verified ·

1 Parent(s): 70b9781

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -161,7 +161,7 @@ When training on the span corruption objective, we calculate the corrupted spans
 MrT5 is trained for 5,000 gradient steps over batches of 2^20 tokens (i.e., an encoder sequence length of 1024 with an effective batch size of 1024). We use the AdamW optimizer with an initial learning rate of 1e-4 with linear decay and no warmup.
-To achieve a specific sequence length reduction rate, we use a PI-controller with a target deletion ratio of δ=0.5, as described in Section 3.2 of the paper. We also use attention score regularization, as described in Appendix D of the paper.
 ## Environmental Impact

 MrT5 is trained for 5,000 gradient steps over batches of 2^20 tokens (i.e., an encoder sequence length of 1024 with an effective batch size of 1024). We use the AdamW optimizer with an initial learning rate of 1e-4 with linear decay and no warmup.
+To achieve a specific sequence length reduction rate, we use a PI controller with a target deletion ratio of δ=0.5, as described in Section 3.2 of the paper. We also use attention score regularization, as described in Appendix D of the paper.
 ## Environmental Impact