stanfordnlp
/

mrt5-small

@@ -84,7 +84,7 @@ This is the model card for the 300M-parameter **MrT5 Small** (`mrt5-small`), a m
 - **Developed by:** Julie Kallini, Shikhar Murty, Christopher D. Manning, Christopher Potts, Róbert Csordás
 - **Model type:** MrT5
 - **Languages:** English, French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili, and Urdu
-- **Finetuned from model:** [google/byt5-small](https://huggingface.co/google/byt5-small)
 - **Sources for more information**:
     - [GitHub Repository](https://github.com/jkallini/mrt5)
     - [Paper](https://arxiv.org/abs/2410.20771)
@@ -96,12 +96,12 @@ MrT5 Small uses the model configuration of the standard ByT5 Small, which has a
 MrT5 has an additional *delete gate*, which dynamically reduces the encoder sequence length. In this model, it is placed after the third encoder layer, and all subsequent layers operate on a reduced sequence. This model was trained with a deletion rate of δ=0.5, which means that the model reduces its encoder sequence length by ~50% after
 the third layer. MrT5’s gating mechanism only introduces an additional 3,000 parameters.
-MrT5 Small is initialized from ByT5 Small and is fine-tuned on the same training objective. Only MrT5's delete gate is randomly initialized before training.
 The other distinguishing feature of MrT5 is that it uses [softmax1](https://www.evanmiller.org/attention-is-off-by-one.html) in its attention mechanism.
 ## Uses
-This model is an encoder-decoder architecture designed primarily for sequence-to-sequence tasks. While it can be used as-is for exploratory or academic purposes, it is recommended to fine-tune the model to achieve optimal performance on specific downstream tasks.
 To leverage the model’s deletion feature, please use the custom **MrT5Trainer** available in the [accompanying repository](https://github.com/jkallini/mrt5). This specialized trainer ensures that the deletion mechanism is properly maintained and integrated during fine-tuning.
@@ -145,7 +145,7 @@ loss = model(**model_inputs, labels=labels).loss # forward pass
 ### Training Data
-For continued pre-training, we use the [multilingual C4 (mC4) corpus](https://huggingface.co/datasets/allenai/c4) ([Raffel et al., 2020](https://arxiv.org/abs/1910.10683); [Xue et al., 2021](https://arxiv.org/abs/2010.11934)). We MrT5 on 15 typologically diverse languages: English, French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili, and Urdu.
 To avoid training models for multiple epochs, we ensure that the samples drawn from the mC4 corpus are sufficiently large. Additionally, we extract equal-sized samples for each language (in terms of bytes) from the mC4 training split.
 ### Training Procedure

 - **Developed by:** Julie Kallini, Shikhar Murty, Christopher D. Manning, Christopher Potts, Róbert Csordás
 - **Model type:** MrT5
 - **Languages:** English, French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili, and Urdu
+- **Fine-tuned from model:** [google/byt5-small](https://huggingface.co/google/byt5-small)
 - **Sources for more information**:
     - [GitHub Repository](https://github.com/jkallini/mrt5)
     - [Paper](https://arxiv.org/abs/2410.20771)
 MrT5 has an additional *delete gate*, which dynamically reduces the encoder sequence length. In this model, it is placed after the third encoder layer, and all subsequent layers operate on a reduced sequence. This model was trained with a deletion rate of δ=0.5, which means that the model reduces its encoder sequence length by ~50% after
 the third layer. MrT5’s gating mechanism only introduces an additional 3,000 parameters.
+MrT5 Small is initialized from ByT5 Small and fine-tuned on the same training objective. Only MrT5's delete gate is randomly initialized before training.
 The other distinguishing feature of MrT5 is that it uses [softmax1](https://www.evanmiller.org/attention-is-off-by-one.html) in its attention mechanism.
 ## Uses
+This model is an encoder-decoder architecture designed primarily for sequence-to-sequence tasks. While it can be used as-is for exploratory or academic purposes, fine-tuning is recommended to achieve optimal performance on specific downstream tasks.
 To leverage the model’s deletion feature, please use the custom **MrT5Trainer** available in the [accompanying repository](https://github.com/jkallini/mrt5). This specialized trainer ensures that the deletion mechanism is properly maintained and integrated during fine-tuning.
 ### Training Data
+For continued pre-training, we use the [multilingual C4 (mC4) corpus](https://huggingface.co/datasets/allenai/c4) ([Raffel et al., 2020](https://arxiv.org/abs/1910.10683); [Xue et al., 2021](https://arxiv.org/abs/2010.11934)). MrT5 is trained on 15 typologically diverse languages: English, French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili, and Urdu.
 To avoid training models for multiple epochs, we ensure that the samples drawn from the mC4 corpus are sufficiently large. Additionally, we extract equal-sized samples for each language (in terms of bytes) from the mC4 training split.
 ### Training Procedure