Update README.md
Browse files
README.md
CHANGED
@@ -16,6 +16,7 @@ This model is a model that performed continued pre-training and fine-tuning (ins
|
|
16 |
|
17 |
### DUS(Depth Up-Scaling) and continued pre-training
|
18 |
Similar to the methodology disclosed in the paper, we expanded from 32 transformer blocks to 48 blocks and then continued pre-training with the public dataset. Pre-training was performed for 3 days using 4 `ml.g5.48xlarge` instances from AWS (NVIDIA A10G GPU x 32ea). For pre-training, we used a sample set from Wikipedia.
|
|
|
19 |
For distributed training, all weights were trained without adapter techniques, and sharding parallelization was performed with ZeRO-2. The presets are as follows.
|
20 |
|
21 |
```json
|
|
|
16 |
|
17 |
### DUS(Depth Up-Scaling) and continued pre-training
|
18 |
Similar to the methodology disclosed in the paper, we expanded from 32 transformer blocks to 48 blocks and then continued pre-training with the public dataset. Pre-training was performed for 3 days using 4 `ml.g5.48xlarge` instances from AWS (NVIDIA A10G GPU x 32ea). For pre-training, we used a sample set from Wikipedia.
|
19 |
+
Note that performance is not guaranteed since only a small number of datasets were used for the experiment. The number of samples for training set is just around 1.5 million after tokenization.
|
20 |
For distributed training, all weights were trained without adapter techniques, and sharding parallelization was performed with ZeRO-2. The presets are as follows.
|
21 |
|
22 |
```json
|