Taka008 commited on
Commit
0e7a5ec
·
verified ·
1 Parent(s): a63821c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -119,10 +119,10 @@ The models have been pre-trained using a blend of the following datasets.
119
 
120
  ### Mid-training
121
 
122
- In the LLM-jp-3.1 series, we performed continued pre-training based on [Instruction Pre-Training](https://aclanthology.org/2024.emnlp-main.148/).
123
  Instruction Pre-Training enhances a model’s ability to follow instructions by continuing pre-training on a large collection of instruction–response pairs.
124
- We prepared approximately 90B tokens of instruction–response data and mixed it with our pre-training datasets, conducting continued pre-training on a total of 400B tokens.
125
- Each model was initialized from existing checkpoints ([llm-jp/llm-jp-3-1.8b](https://huggingface.co/llm-jp/llm-jp-3-1.8b), [llm-jp/llm-jp-3-13b](https://huggingface.co/llm-jp/llm-jp-3-13b), and [llm-jp/llm-jp-3-8x13b](https://huggingface.co/llm-jp/llm-jp-3-8x13b)) and underwent continued instruction pre-training.
126
  Since the LLM-jp-3 series was originally pre-trained on 2.1T tokens, the total pre-training token count amounts to 2.5T tokens.
127
 
128
  Details of this training process will be released in a forthcoming paper. The instruction–response dataset used for this training will also be made publicly available.
@@ -156,7 +156,7 @@ For Direct Preference Optimization (DPO), we adopted rejection sampling.
156
  Prompts were sampled from the dataset used in SFT, and multiple responses were generated for each prompt.
157
  These responses were then scored (by [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct)), and DPO was performed by treating high-scoring responses as positive examples and low-scoring responses as negative examples.
158
 
159
- In the case of *instruct4*, DPO was conducted in two stages.
160
  In the second stage, we additionally used [ac-self-inst](https://huggingface.co/datasets/llm-jp/ac-self-inst), a Japanese preference dataset focused on safety.
161
 
162
 
 
119
 
120
  ### Mid-training
121
 
122
+ In the LLM-jp-3.1 series, we performed continuous pre-training based on [Instruction Pre-Training](https://aclanthology.org/2024.emnlp-main.148/).
123
  Instruction Pre-Training enhances a model’s ability to follow instructions by continuing pre-training on a large collection of instruction–response pairs.
124
+ We prepared approximately 90B tokens of instruction–response data and mixed it with our pre-training datasets, conducting continuous pre-training on a total of 400B tokens.
125
+ Each model was initialized from existing checkpoints ([llm-jp/llm-jp-3-1.8b](https://huggingface.co/llm-jp/llm-jp-3-1.8b), [llm-jp/llm-jp-3-13b](https://huggingface.co/llm-jp/llm-jp-3-13b), and [llm-jp/llm-jp-3-8x13b](https://huggingface.co/llm-jp/llm-jp-3-8x13b)) and underwent continuous instruction pre-training.
126
  Since the LLM-jp-3 series was originally pre-trained on 2.1T tokens, the total pre-training token count amounts to 2.5T tokens.
127
 
128
  Details of this training process will be released in a forthcoming paper. The instruction–response dataset used for this training will also be made publicly available.
 
156
  Prompts were sampled from the dataset used in SFT, and multiple responses were generated for each prompt.
157
  These responses were then scored (by [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct)), and DPO was performed by treating high-scoring responses as positive examples and low-scoring responses as negative examples.
158
 
159
+ We conducted DPO in two stages.
160
  In the second stage, we additionally used [ac-self-inst](https://huggingface.co/datasets/llm-jp/ac-self-inst), a Japanese preference dataset focused on safety.
161
 
162