Update README.md
Browse files
README.md
CHANGED
@@ -84,7 +84,7 @@ This is the model card for the 300M-parameter **MrT5 Small** (`mrt5-small`), a m
|
|
84 |
- **Developed by:** Julie Kallini, Shikhar Murty, Christopher D. Manning, Christopher Potts, R贸bert Csord谩s
|
85 |
- **Model type:** MrT5
|
86 |
- **Languages:** English, French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili, and Urdu
|
87 |
-
- **
|
88 |
- **Sources for more information**:
|
89 |
- [GitHub Repository](https://github.com/jkallini/mrt5)
|
90 |
- [Paper](https://arxiv.org/abs/2410.20771)
|
@@ -96,12 +96,12 @@ MrT5 Small uses the model configuration of the standard ByT5 Small, which has a
|
|
96 |
MrT5 has an additional *delete gate*, which dynamically reduces the encoder sequence length. In this model, it is placed after the third encoder layer, and all subsequent layers operate on a reduced sequence. This model was trained with a deletion rate of 未=0.5, which means that the model reduces its encoder sequence length by ~50% after
|
97 |
the third layer. MrT5鈥檚 gating mechanism only introduces an additional 3,000 parameters.
|
98 |
|
99 |
-
MrT5 Small is initialized from ByT5 Small and
|
100 |
The other distinguishing feature of MrT5 is that it uses [softmax1](https://www.evanmiller.org/attention-is-off-by-one.html) in its attention mechanism.
|
101 |
|
102 |
## Uses
|
103 |
|
104 |
-
This model is an encoder-decoder architecture designed primarily for sequence-to-sequence tasks. While it can be used as-is for exploratory or academic purposes,
|
105 |
|
106 |
To leverage the model鈥檚 deletion feature, please use the custom **MrT5Trainer** available in the [accompanying repository](https://github.com/jkallini/mrt5). This specialized trainer ensures that the deletion mechanism is properly maintained and integrated during fine-tuning.
|
107 |
|
@@ -145,7 +145,7 @@ loss = model(**model_inputs, labels=labels).loss # forward pass
|
|
145 |
|
146 |
### Training Data
|
147 |
|
148 |
-
For continued pre-training, we use the [multilingual C4 (mC4) corpus](https://huggingface.co/datasets/allenai/c4) ([Raffel et al., 2020](https://arxiv.org/abs/1910.10683); [Xue et al., 2021](https://arxiv.org/abs/2010.11934)).
|
149 |
To avoid training models for multiple epochs, we ensure that the samples drawn from the mC4 corpus are sufficiently large. Additionally, we extract equal-sized samples for each language (in terms of bytes) from the mC4 training split.
|
150 |
|
151 |
### Training Procedure
|
|
|
84 |
- **Developed by:** Julie Kallini, Shikhar Murty, Christopher D. Manning, Christopher Potts, R贸bert Csord谩s
|
85 |
- **Model type:** MrT5
|
86 |
- **Languages:** English, French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili, and Urdu
|
87 |
+
- **Fine-tuned from model:** [google/byt5-small](https://huggingface.co/google/byt5-small)
|
88 |
- **Sources for more information**:
|
89 |
- [GitHub Repository](https://github.com/jkallini/mrt5)
|
90 |
- [Paper](https://arxiv.org/abs/2410.20771)
|
|
|
96 |
MrT5 has an additional *delete gate*, which dynamically reduces the encoder sequence length. In this model, it is placed after the third encoder layer, and all subsequent layers operate on a reduced sequence. This model was trained with a deletion rate of 未=0.5, which means that the model reduces its encoder sequence length by ~50% after
|
97 |
the third layer. MrT5鈥檚 gating mechanism only introduces an additional 3,000 parameters.
|
98 |
|
99 |
+
MrT5 Small is initialized from ByT5 Small and fine-tuned on the same training objective. Only MrT5's delete gate is randomly initialized before training.
|
100 |
The other distinguishing feature of MrT5 is that it uses [softmax1](https://www.evanmiller.org/attention-is-off-by-one.html) in its attention mechanism.
|
101 |
|
102 |
## Uses
|
103 |
|
104 |
+
This model is an encoder-decoder architecture designed primarily for sequence-to-sequence tasks. While it can be used as-is for exploratory or academic purposes, fine-tuning is recommended to achieve optimal performance on specific downstream tasks.
|
105 |
|
106 |
To leverage the model鈥檚 deletion feature, please use the custom **MrT5Trainer** available in the [accompanying repository](https://github.com/jkallini/mrt5). This specialized trainer ensures that the deletion mechanism is properly maintained and integrated during fine-tuning.
|
107 |
|
|
|
145 |
|
146 |
### Training Data
|
147 |
|
148 |
+
For continued pre-training, we use the [multilingual C4 (mC4) corpus](https://huggingface.co/datasets/allenai/c4) ([Raffel et al., 2020](https://arxiv.org/abs/1910.10683); [Xue et al., 2021](https://arxiv.org/abs/2010.11934)). MrT5 is trained on 15 typologically diverse languages: English, French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili, and Urdu.
|
149 |
To avoid training models for multiple epochs, we ensure that the samples drawn from the mC4 corpus are sufficiently large. Additionally, we extract equal-sized samples for each language (in terms of bytes) from the mC4 training split.
|
150 |
|
151 |
### Training Procedure
|