juliekallini commited on
Commit
57c4322
·
verified ·
1 Parent(s): 70b9781

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -161,7 +161,7 @@ When training on the span corruption objective, we calculate the corrupted spans
161
 
162
  MrT5 is trained for 5,000 gradient steps over batches of 2^20 tokens (i.e., an encoder sequence length of 1024 with an effective batch size of 1024). We use the AdamW optimizer with an initial learning rate of 1e-4 with linear decay and no warmup.
163
 
164
- To achieve a specific sequence length reduction rate, we use a PI-controller with a target deletion ratio of δ=0.5, as described in Section 3.2 of the paper. We also use attention score regularization, as described in Appendix D of the paper.
165
 
166
 
167
  ## Environmental Impact
 
161
 
162
  MrT5 is trained for 5,000 gradient steps over batches of 2^20 tokens (i.e., an encoder sequence length of 1024 with an effective batch size of 1024). We use the AdamW optimizer with an initial learning rate of 1e-4 with linear decay and no warmup.
163
 
164
+ To achieve a specific sequence length reduction rate, we use a PI controller with a target deletion ratio of δ=0.5, as described in Section 3.2 of the paper. We also use attention score regularization, as described in Appendix D of the paper.
165
 
166
 
167
  ## Environmental Impact