instruction-pretrain
/

instruction-synthesizer

@@ -16,8 +16,8 @@ We explore supervised multitask pre-training by proposing ***Instruction Pre-Tra
 **************************** **Updates** ****************************
 * 2024/7/15: We scaled up the pre-trained tokens from 100B to 250B, with the number of synthesized instruction-response pairs reaching 500M! Below, we show the performance trend on downstream tasks throughout the pre-training process:
-<p align='center'>
-    <img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0okCfRkC6uALTfuNxt0Fa.png" width="700">
 </p>
 * 2024/6/21: Released the [paper](https://huggingface.co/papers/2406.14491), [code](https://github.com/microsoft/LMOps), and [resources](https://huggingface.co/instruction-pretrain)
@@ -100,6 +100,8 @@ for index, pair in enumerate(instruction_response_pairs):
 ```
 ### Advanced Usage: Convert Raw Corpora into Instruction-Augmented Corpora at Scale
 1. Set up dependencies:
 ```bash
@@ -160,13 +162,19 @@ for idx, entry in enumerate(prev_examples):
                                                 # change random seed for each entry for diveristy
     instruction_augmented_texts.extend(texts)
-# 3. print out the results
 for idx, text in enumerate(instruction_augmented_texts):
-    print(f'## Instruction-augmented Text {idx+1}\n{text}\n')
 # Now you can use `instruction_augmented_texts` for pre-training!
 ```
 ## Citation
 If you find our work helpful, please cite us:

 **************************** **Updates** ****************************
 * 2024/7/15: We scaled up the pre-trained tokens from 100B to 250B, with the number of synthesized instruction-response pairs reaching 500M! Below, we show the performance trend on downstream tasks throughout the pre-training process:
+<p align='left'>
+    <img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0okCfRkC6uALTfuNxt0Fa.png" width="500">
 </p>
 * 2024/6/21: Released the [paper](https://huggingface.co/papers/2406.14491), [code](https://github.com/microsoft/LMOps), and [resources](https://huggingface.co/instruction-pretrain)
 ```
 ### Advanced Usage: Convert Raw Corpora into Instruction-Augmented Corpora at Scale
+We use vLLM to accelerate the synthesis process. On a single A100-80GB GPU, it takes about 2 days to synthesize instruction-response pairs for 1 billion tokens of raw corpora.
 1. Set up dependencies:
 ```bash
                                                 # change random seed for each entry for diveristy
     instruction_augmented_texts.extend(texts)
+# 3. print out the instruction_augmented_texts
 for idx, text in enumerate(instruction_augmented_texts):
+    print(text)
 # Now you can use `instruction_augmented_texts` for pre-training!
 ```
+**Pre-Training Suggestions:**
+Except for the pre-training data, *Instruction Pre-Training* keeps all other pre-training settings the same as *Vanilla Pre-Training*.
+1. For general pre-training from scratch, we recommend setting M = 2 and mixing the instruction-augmented corpora with unchanged raw corpora.
+2. For domain-adaptive continual pre-training, we recommend setting M = 3 and mixing the instruction-augmented corpora with general instructions from [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) at a 1:1 ratio (counted by tokens).
 ## Citation
 If you find our work helpful, please cite us: