llm-jp
/

llm-jp-3.1-1.8b-instruct4

@@ -119,10 +119,10 @@ The models have been pre-trained using a blend of the following datasets.
 ### Mid-training
-In the LLM-jp-3.1 series, we performed continued pre-training based on [Instruction Pre-Training](https://aclanthology.org/2024.emnlp-main.148/).
 Instruction Pre-Training enhances a model’s ability to follow instructions by continuing pre-training on a large collection of instruction–response pairs.
-We prepared approximately 90B tokens of instruction–response data and mixed it with our pre-training datasets, conducting continued pre-training on a total of 400B tokens.
-Each model was initialized from existing checkpoints ([llm-jp/llm-jp-3-1.8b](https://huggingface.co/llm-jp/llm-jp-3-1.8b), [llm-jp/llm-jp-3-13b](https://huggingface.co/llm-jp/llm-jp-3-13b), and [llm-jp/llm-jp-3-8x13b](https://huggingface.co/llm-jp/llm-jp-3-8x13b)) and underwent continued instruction pre-training.
 Since the LLM-jp-3 series was originally pre-trained on 2.1T tokens, the total pre-training token count amounts to 2.5T tokens.
 Details of this training process will be released in a forthcoming paper. The instruction–response dataset used for this training will also be made publicly available.
@@ -156,7 +156,7 @@ For Direct Preference Optimization (DPO), we adopted rejection sampling.
 Prompts were sampled from the dataset used in SFT, and multiple responses were generated for each prompt.
 These responses were then scored (by [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct)), and DPO was performed by treating high-scoring responses as positive examples and low-scoring responses as negative examples.
-In the case of *instruct4*, DPO was conducted in two stages.
 In the second stage, we additionally used [ac-self-inst](https://huggingface.co/datasets/llm-jp/ac-self-inst), a Japanese preference dataset focused on safety.

 ### Mid-training
+In the LLM-jp-3.1 series, we performed continuous pre-training based on [Instruction Pre-Training](https://aclanthology.org/2024.emnlp-main.148/).
 Instruction Pre-Training enhances a model’s ability to follow instructions by continuing pre-training on a large collection of instruction–response pairs.
+We prepared approximately 90B tokens of instruction–response data and mixed it with our pre-training datasets, conducting continuous pre-training on a total of 400B tokens.
+Each model was initialized from existing checkpoints ([llm-jp/llm-jp-3-1.8b](https://huggingface.co/llm-jp/llm-jp-3-1.8b), [llm-jp/llm-jp-3-13b](https://huggingface.co/llm-jp/llm-jp-3-13b), and [llm-jp/llm-jp-3-8x13b](https://huggingface.co/llm-jp/llm-jp-3-8x13b)) and underwent continuous instruction pre-training.
 Since the LLM-jp-3 series was originally pre-trained on 2.1T tokens, the total pre-training token count amounts to 2.5T tokens.
 Details of this training process will be released in a forthcoming paper. The instruction–response dataset used for this training will also be made publicly available.
 Prompts were sampled from the dataset used in SFT, and multiple responses were generated for each prompt.
 These responses were then scored (by [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct)), and DPO was performed by treating high-scoring responses as positive examples and low-scoring responses as negative examples.
+We conducted DPO in two stages.
 In the second stage, we additionally used [ac-self-inst](https://huggingface.co/datasets/llm-jp/ac-self-inst), a Japanese preference dataset focused on safety.