Update README.md
Browse files
README.md
CHANGED
@@ -119,10 +119,10 @@ The models have been pre-trained using a blend of the following datasets.
|
|
119 |
|
120 |
### Mid-training
|
121 |
|
122 |
-
In the LLM-jp-3.1 series, we performed
|
123 |
Instruction Pre-Training enhances a model’s ability to follow instructions by continuing pre-training on a large collection of instruction–response pairs.
|
124 |
-
We prepared approximately 90B tokens of instruction–response data and mixed it with our pre-training datasets, conducting
|
125 |
-
Each model was initialized from existing checkpoints ([llm-jp/llm-jp-3-1.8b](https://huggingface.co/llm-jp/llm-jp-3-1.8b), [llm-jp/llm-jp-3-13b](https://huggingface.co/llm-jp/llm-jp-3-13b), and [llm-jp/llm-jp-3-8x13b](https://huggingface.co/llm-jp/llm-jp-3-8x13b)) and underwent
|
126 |
Since the LLM-jp-3 series was originally pre-trained on 2.1T tokens, the total pre-training token count amounts to 2.5T tokens.
|
127 |
|
128 |
Details of this training process will be released in a forthcoming paper. The instruction–response dataset used for this training will also be made publicly available.
|
@@ -156,7 +156,7 @@ For Direct Preference Optimization (DPO), we adopted rejection sampling.
|
|
156 |
Prompts were sampled from the dataset used in SFT, and multiple responses were generated for each prompt.
|
157 |
These responses were then scored (by [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct)), and DPO was performed by treating high-scoring responses as positive examples and low-scoring responses as negative examples.
|
158 |
|
159 |
-
|
160 |
In the second stage, we additionally used [ac-self-inst](https://huggingface.co/datasets/llm-jp/ac-self-inst), a Japanese preference dataset focused on safety.
|
161 |
|
162 |
|
|
|
119 |
|
120 |
### Mid-training
|
121 |
|
122 |
+
In the LLM-jp-3.1 series, we performed continuous pre-training based on [Instruction Pre-Training](https://aclanthology.org/2024.emnlp-main.148/).
|
123 |
Instruction Pre-Training enhances a model’s ability to follow instructions by continuing pre-training on a large collection of instruction–response pairs.
|
124 |
+
We prepared approximately 90B tokens of instruction–response data and mixed it with our pre-training datasets, conducting continuous pre-training on a total of 400B tokens.
|
125 |
+
Each model was initialized from existing checkpoints ([llm-jp/llm-jp-3-1.8b](https://huggingface.co/llm-jp/llm-jp-3-1.8b), [llm-jp/llm-jp-3-13b](https://huggingface.co/llm-jp/llm-jp-3-13b), and [llm-jp/llm-jp-3-8x13b](https://huggingface.co/llm-jp/llm-jp-3-8x13b)) and underwent continuous instruction pre-training.
|
126 |
Since the LLM-jp-3 series was originally pre-trained on 2.1T tokens, the total pre-training token count amounts to 2.5T tokens.
|
127 |
|
128 |
Details of this training process will be released in a forthcoming paper. The instruction–response dataset used for this training will also be made publicly available.
|
|
|
156 |
Prompts were sampled from the dataset used in SFT, and multiple responses were generated for each prompt.
|
157 |
These responses were then scored (by [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct)), and DPO was performed by treating high-scoring responses as positive examples and low-scoring responses as negative examples.
|
158 |
|
159 |
+
We conducted DPO in two stages.
|
160 |
In the second stage, we additionally used [ac-self-inst](https://huggingface.co/datasets/llm-jp/ac-self-inst), a Japanese preference dataset focused on safety.
|
161 |
|
162 |
|