Text Generation
Transformers
Safetensors
English
mistral
text-generation-inference
instruction-pretrain commited on
Commit
f894ca2
·
verified ·
1 Parent(s): 7a0b2da

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -4
README.md CHANGED
@@ -16,8 +16,8 @@ We explore supervised multitask pre-training by proposing ***Instruction Pre-Tra
16
 
17
  **************************** **Updates** ****************************
18
  * 2024/7/15: We scaled up the pre-trained tokens from 100B to 250B, with the number of synthesized instruction-response pairs reaching 500M! Below, we show the performance trend on downstream tasks throughout the pre-training process:
19
- <p align='center'>
20
- <img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0okCfRkC6uALTfuNxt0Fa.png" width="700">
21
  </p>
22
  * 2024/6/21: Released the [paper](https://huggingface.co/papers/2406.14491), [code](https://github.com/microsoft/LMOps), and [resources](https://huggingface.co/instruction-pretrain)
23
 
@@ -100,6 +100,8 @@ for index, pair in enumerate(instruction_response_pairs):
100
  ```
101
 
102
  ### Advanced Usage: Convert Raw Corpora into Instruction-Augmented Corpora at Scale
 
 
103
  1. Set up dependencies:
104
 
105
  ```bash
@@ -160,13 +162,19 @@ for idx, entry in enumerate(prev_examples):
160
  # change random seed for each entry for diveristy
161
  instruction_augmented_texts.extend(texts)
162
 
163
- # 3. print out the results
164
  for idx, text in enumerate(instruction_augmented_texts):
165
- print(f'## Instruction-augmented Text {idx+1}\n{text}\n')
166
 
167
  # Now you can use `instruction_augmented_texts` for pre-training!
168
  ```
169
 
 
 
 
 
 
 
170
 
171
  ## Citation
172
  If you find our work helpful, please cite us:
 
16
 
17
  **************************** **Updates** ****************************
18
  * 2024/7/15: We scaled up the pre-trained tokens from 100B to 250B, with the number of synthesized instruction-response pairs reaching 500M! Below, we show the performance trend on downstream tasks throughout the pre-training process:
19
+ <p align='left'>
20
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0okCfRkC6uALTfuNxt0Fa.png" width="500">
21
  </p>
22
  * 2024/6/21: Released the [paper](https://huggingface.co/papers/2406.14491), [code](https://github.com/microsoft/LMOps), and [resources](https://huggingface.co/instruction-pretrain)
23
 
 
100
  ```
101
 
102
  ### Advanced Usage: Convert Raw Corpora into Instruction-Augmented Corpora at Scale
103
+ We use vLLM to accelerate the synthesis process. On a single A100-80GB GPU, it takes about 2 days to synthesize instruction-response pairs for 1 billion tokens of raw corpora.
104
+
105
  1. Set up dependencies:
106
 
107
  ```bash
 
162
  # change random seed for each entry for diveristy
163
  instruction_augmented_texts.extend(texts)
164
 
165
+ # 3. print out the instruction_augmented_texts
166
  for idx, text in enumerate(instruction_augmented_texts):
167
+ print(text)
168
 
169
  # Now you can use `instruction_augmented_texts` for pre-training!
170
  ```
171
 
172
+ **Pre-Training Suggestions:**
173
+
174
+ Except for the pre-training data, *Instruction Pre-Training* keeps all other pre-training settings the same as *Vanilla Pre-Training*.
175
+
176
+ 1. For general pre-training from scratch, we recommend setting M = 2 and mixing the instruction-augmented corpora with unchanged raw corpora.
177
+ 2. For domain-adaptive continual pre-training, we recommend setting M = 3 and mixing the instruction-augmented corpora with general instructions from [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) at a 1:1 ratio (counted by tokens).
178
 
179
  ## Citation
180
  If you find our work helpful, please cite us: