juliekallini commited on
Commit
70b9781
verified
1 Parent(s): a8660d7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -84,7 +84,7 @@ This is the model card for the 300M-parameter **MrT5 Small** (`mrt5-small`), a m
84
  - **Developed by:** Julie Kallini, Shikhar Murty, Christopher D. Manning, Christopher Potts, R贸bert Csord谩s
85
  - **Model type:** MrT5
86
  - **Languages:** English, French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili, and Urdu
87
- - **Finetuned from model:** [google/byt5-small](https://huggingface.co/google/byt5-small)
88
  - **Sources for more information**:
89
  - [GitHub Repository](https://github.com/jkallini/mrt5)
90
  - [Paper](https://arxiv.org/abs/2410.20771)
@@ -96,12 +96,12 @@ MrT5 Small uses the model configuration of the standard ByT5 Small, which has a
96
  MrT5 has an additional *delete gate*, which dynamically reduces the encoder sequence length. In this model, it is placed after the third encoder layer, and all subsequent layers operate on a reduced sequence. This model was trained with a deletion rate of 未=0.5, which means that the model reduces its encoder sequence length by ~50% after
97
  the third layer. MrT5鈥檚 gating mechanism only introduces an additional 3,000 parameters.
98
 
99
- MrT5 Small is initialized from ByT5 Small and is fine-tuned on the same training objective. Only MrT5's delete gate is randomly initialized before training.
100
  The other distinguishing feature of MrT5 is that it uses [softmax1](https://www.evanmiller.org/attention-is-off-by-one.html) in its attention mechanism.
101
 
102
  ## Uses
103
 
104
- This model is an encoder-decoder architecture designed primarily for sequence-to-sequence tasks. While it can be used as-is for exploratory or academic purposes, it is recommended to fine-tune the model to achieve optimal performance on specific downstream tasks.
105
 
106
  To leverage the model鈥檚 deletion feature, please use the custom **MrT5Trainer** available in the [accompanying repository](https://github.com/jkallini/mrt5). This specialized trainer ensures that the deletion mechanism is properly maintained and integrated during fine-tuning.
107
 
@@ -145,7 +145,7 @@ loss = model(**model_inputs, labels=labels).loss # forward pass
145
 
146
  ### Training Data
147
 
148
- For continued pre-training, we use the [multilingual C4 (mC4) corpus](https://huggingface.co/datasets/allenai/c4) ([Raffel et al., 2020](https://arxiv.org/abs/1910.10683); [Xue et al., 2021](https://arxiv.org/abs/2010.11934)). We MrT5 on 15 typologically diverse languages: English, French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili, and Urdu.
149
  To avoid training models for multiple epochs, we ensure that the samples drawn from the mC4 corpus are sufficiently large. Additionally, we extract equal-sized samples for each language (in terms of bytes) from the mC4 training split.
150
 
151
  ### Training Procedure
 
84
  - **Developed by:** Julie Kallini, Shikhar Murty, Christopher D. Manning, Christopher Potts, R贸bert Csord谩s
85
  - **Model type:** MrT5
86
  - **Languages:** English, French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili, and Urdu
87
+ - **Fine-tuned from model:** [google/byt5-small](https://huggingface.co/google/byt5-small)
88
  - **Sources for more information**:
89
  - [GitHub Repository](https://github.com/jkallini/mrt5)
90
  - [Paper](https://arxiv.org/abs/2410.20771)
 
96
  MrT5 has an additional *delete gate*, which dynamically reduces the encoder sequence length. In this model, it is placed after the third encoder layer, and all subsequent layers operate on a reduced sequence. This model was trained with a deletion rate of 未=0.5, which means that the model reduces its encoder sequence length by ~50% after
97
  the third layer. MrT5鈥檚 gating mechanism only introduces an additional 3,000 parameters.
98
 
99
+ MrT5 Small is initialized from ByT5 Small and fine-tuned on the same training objective. Only MrT5's delete gate is randomly initialized before training.
100
  The other distinguishing feature of MrT5 is that it uses [softmax1](https://www.evanmiller.org/attention-is-off-by-one.html) in its attention mechanism.
101
 
102
  ## Uses
103
 
104
+ This model is an encoder-decoder architecture designed primarily for sequence-to-sequence tasks. While it can be used as-is for exploratory or academic purposes, fine-tuning is recommended to achieve optimal performance on specific downstream tasks.
105
 
106
  To leverage the model鈥檚 deletion feature, please use the custom **MrT5Trainer** available in the [accompanying repository](https://github.com/jkallini/mrt5). This specialized trainer ensures that the deletion mechanism is properly maintained and integrated during fine-tuning.
107
 
 
145
 
146
  ### Training Data
147
 
148
+ For continued pre-training, we use the [multilingual C4 (mC4) corpus](https://huggingface.co/datasets/allenai/c4) ([Raffel et al., 2020](https://arxiv.org/abs/1910.10683); [Xue et al., 2021](https://arxiv.org/abs/2010.11934)). MrT5 is trained on 15 typologically diverse languages: English, French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili, and Urdu.
149
  To avoid training models for multiple epochs, we ensure that the samples drawn from the mC4 corpus are sufficiently large. Additionally, we extract equal-sized samples for each language (in terms of bytes) from the mC4 training split.
150
 
151
  ### Training Procedure