Update README.md
Browse files
README.md
CHANGED
@@ -16,8 +16,8 @@ We explore supervised multitask pre-training by proposing ***Instruction Pre-Tra
|
|
16 |
|
17 |
**************************** **Updates** ****************************
|
18 |
* 2024/7/15: We scaled up the pre-trained tokens from 100B to 250B, with the number of synthesized instruction-response pairs reaching 500M! Below, we show the performance trend on downstream tasks throughout the pre-training process:
|
19 |
-
<p align='
|
20 |
-
<img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0okCfRkC6uALTfuNxt0Fa.png" width="
|
21 |
</p>
|
22 |
* 2024/6/21: Released the [paper](https://huggingface.co/papers/2406.14491), [code](https://github.com/microsoft/LMOps), and [resources](https://huggingface.co/instruction-pretrain)
|
23 |
|
@@ -100,6 +100,8 @@ for index, pair in enumerate(instruction_response_pairs):
|
|
100 |
```
|
101 |
|
102 |
### Advanced Usage: Convert Raw Corpora into Instruction-Augmented Corpora at Scale
|
|
|
|
|
103 |
1. Set up dependencies:
|
104 |
|
105 |
```bash
|
@@ -160,13 +162,19 @@ for idx, entry in enumerate(prev_examples):
|
|
160 |
# change random seed for each entry for diveristy
|
161 |
instruction_augmented_texts.extend(texts)
|
162 |
|
163 |
-
# 3. print out the
|
164 |
for idx, text in enumerate(instruction_augmented_texts):
|
165 |
-
print(
|
166 |
|
167 |
# Now you can use `instruction_augmented_texts` for pre-training!
|
168 |
```
|
169 |
|
|
|
|
|
|
|
|
|
|
|
|
|
170 |
|
171 |
## Citation
|
172 |
If you find our work helpful, please cite us:
|
|
|
16 |
|
17 |
**************************** **Updates** ****************************
|
18 |
* 2024/7/15: We scaled up the pre-trained tokens from 100B to 250B, with the number of synthesized instruction-response pairs reaching 500M! Below, we show the performance trend on downstream tasks throughout the pre-training process:
|
19 |
+
<p align='left'>
|
20 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0okCfRkC6uALTfuNxt0Fa.png" width="500">
|
21 |
</p>
|
22 |
* 2024/6/21: Released the [paper](https://huggingface.co/papers/2406.14491), [code](https://github.com/microsoft/LMOps), and [resources](https://huggingface.co/instruction-pretrain)
|
23 |
|
|
|
100 |
```
|
101 |
|
102 |
### Advanced Usage: Convert Raw Corpora into Instruction-Augmented Corpora at Scale
|
103 |
+
We use vLLM to accelerate the synthesis process. On a single A100-80GB GPU, it takes about 2 days to synthesize instruction-response pairs for 1 billion tokens of raw corpora.
|
104 |
+
|
105 |
1. Set up dependencies:
|
106 |
|
107 |
```bash
|
|
|
162 |
# change random seed for each entry for diveristy
|
163 |
instruction_augmented_texts.extend(texts)
|
164 |
|
165 |
+
# 3. print out the instruction_augmented_texts
|
166 |
for idx, text in enumerate(instruction_augmented_texts):
|
167 |
+
print(text)
|
168 |
|
169 |
# Now you can use `instruction_augmented_texts` for pre-training!
|
170 |
```
|
171 |
|
172 |
+
**Pre-Training Suggestions:**
|
173 |
+
|
174 |
+
Except for the pre-training data, *Instruction Pre-Training* keeps all other pre-training settings the same as *Vanilla Pre-Training*.
|
175 |
+
|
176 |
+
1. For general pre-training from scratch, we recommend setting M = 2 and mixing the instruction-augmented corpora with unchanged raw corpora.
|
177 |
+
2. For domain-adaptive continual pre-training, we recommend setting M = 3 and mixing the instruction-augmented corpora with general instructions from [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) at a 1:1 ratio (counted by tokens).
|
178 |
|
179 |
## Citation
|
180 |
If you find our work helpful, please cite us:
|