chaoscodes commited on
Commit
678bce2
·
verified ·
1 Parent(s): a7b785a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -14
README.md CHANGED
@@ -4,12 +4,8 @@ datasets:
4
  - cerebras/SlimPajama-627B
5
  language:
6
  - en
7
-
8
  ---
9
-
10
  <div align="center">
11
-
12
-
13
  # TinyLlama-1.1B-v1.1
14
 
15
  </div>
@@ -21,21 +17,18 @@ https://github.com/jzhang38/TinyLlama
21
  <img src="https://huggingface.co/PY007/TinyLlama-1.1B-intermediate-step-240k-503b/resolve/main/TinyLlama_logo.png" width="300"/>
22
  </div>
23
 
24
-
25
  We adopted exactly the same architecture and tokenizer as Llama 2. This means TinyLlama can be plugged and played in many open-source projects built upon Llama. Besides, TinyLlama is compact with only 1.1B parameters. This compactness allows it to cater to a multitude of applications demanding a restricted computation and memory footprint.
26
 
27
- ### Overview
28
 
29
  In this project, rather than only training a single TinyLlama model, we first train TinyLlama on a corpus of 1.5 trillion tokens to obtain foundational language capabilities. Subsequently, we take this model and turn it into three different models by continual pre-training with three distinct data sampling. For a visual representation of this process, please refer to the figure below.
30
 
31
- ![image-20240401225128124](overview.png)
32
 
33
- ### Pretraining
34
 
35
  Due to these issues([bug1](https://whimsical-aphid-86d.notion.site/Release-of-TinyLlama-1-5T-Checkpoints-Postponed-01b266998c1c47f78f5ae1520196d194?pvs=4), [bug2](https://whimsical-aphid-86d.notion.site/2023-12-18-Updates-from-TinyLlama-Team-7d30c01fff794da28ccc952f327c8d4f)). We try to retrain our TinyLlama to provide a better model. We train our model with 2T tokens and divided our pretraining into 3 different stages: 1) basic pretraining, 2) continual pretraining with specific domain, and 3) cooldown .
36
 
37
-
38
-
39
  #### Basic pretraining
40
 
41
  In this initial phase, we managed to train our model with only slimpajama to develop its commonsense reasoning capabilities. The model was trained with 1.5T tokens during this basic pretraining period. Since we used a cluster with 4 A100-40G per node and we only shard model weights within a node, we can only set the batch size to approximately 1.8M this time.
@@ -58,13 +51,52 @@ Following an extensive and detailed pretraining process. We are now releasing th
58
  2. **TinyLlama_v1.1_math_code**: Equipped with better ability for math and code.
59
  3. **TinyLlama_v1.1_chinese**: Good understanding capacity for Chinese.
60
 
61
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
  ### How to use
64
-
65
  You will need the transformers>=4.31
66
  Do check the [TinyLlama](https://github.com/jzhang38/TinyLlama) GitHub page for more information.
67
-
68
  ```
69
  from transformers import AutoTokenizer
70
  import transformers
@@ -92,7 +124,6 @@ for seq in sequences:
92
  ```
93
 
94
  ### Eval
95
-
96
  | Model | Pretrain Tokens | HellaSwag | Obqa | WinoGrande | ARC_c | ARC_e | boolq | piqa | avg |
97
  | ----------------------------------------- | --------------- | --------- | --------- | ---------- | --------- | --------- | ----- | --------- | --------- |
98
  | Pythia-1.0B | 300B | 47.16 | 31.40 | 53.43 | 27.05 | 48.99 | 60.83 | 69.21 | 48.30 |
 
4
  - cerebras/SlimPajama-627B
5
  language:
6
  - en
 
7
  ---
 
8
  <div align="center">
 
 
9
  # TinyLlama-1.1B-v1.1
10
 
11
  </div>
 
17
  <img src="https://huggingface.co/PY007/TinyLlama-1.1B-intermediate-step-240k-503b/resolve/main/TinyLlama_logo.png" width="300"/>
18
  </div>
19
 
 
20
  We adopted exactly the same architecture and tokenizer as Llama 2. This means TinyLlama can be plugged and played in many open-source projects built upon Llama. Besides, TinyLlama is compact with only 1.1B parameters. This compactness allows it to cater to a multitude of applications demanding a restricted computation and memory footprint.
21
 
22
+ ## Overview
23
 
24
  In this project, rather than only training a single TinyLlama model, we first train TinyLlama on a corpus of 1.5 trillion tokens to obtain foundational language capabilities. Subsequently, we take this model and turn it into three different models by continual pre-training with three distinct data sampling. For a visual representation of this process, please refer to the figure below.
25
 
26
+ ![Overview](overview.png)
27
 
28
+ ## Pretraining
29
 
30
  Due to these issues([bug1](https://whimsical-aphid-86d.notion.site/Release-of-TinyLlama-1-5T-Checkpoints-Postponed-01b266998c1c47f78f5ae1520196d194?pvs=4), [bug2](https://whimsical-aphid-86d.notion.site/2023-12-18-Updates-from-TinyLlama-Team-7d30c01fff794da28ccc952f327c8d4f)). We try to retrain our TinyLlama to provide a better model. We train our model with 2T tokens and divided our pretraining into 3 different stages: 1) basic pretraining, 2) continual pretraining with specific domain, and 3) cooldown .
31
 
 
 
32
  #### Basic pretraining
33
 
34
  In this initial phase, we managed to train our model with only slimpajama to develop its commonsense reasoning capabilities. The model was trained with 1.5T tokens during this basic pretraining period. Since we used a cluster with 4 A100-40G per node and we only shard model weights within a node, we can only set the batch size to approximately 1.8M this time.
 
51
  2. **TinyLlama_v1.1_math_code**: Equipped with better ability for math and code.
52
  3. **TinyLlama_v1.1_chinese**: Good understanding capacity for Chinese.
53
 
54
+ ## Data
55
+
56
+ Here we list our data distribution in each stage:
57
+
58
+ ### TinyLlama_v1.1
59
+
60
+ | Corpus | Basic pretraining | Continual pretraining with specific domain | Cooldown |
61
+ | ------------- | ----------------- | ------------------------------------------ | -------- |
62
+ | RedPajamaBook | 5.4 | 5.4 | 5.4 |
63
+ | C4 | 35.0 | 35.0 | 35.0 |
64
+ | CommonCrawl | 70.1 | 70.1 | 70.1 |
65
+ | Github | 6.5 | 6.5 | 6.5 |
66
+ | StackExchange | 4.2 | 4.2 | 4.2 |
67
+ | ArXiv | 5.7 | 5.7 | 5.7 |
68
+ | Wikipedia | 4.5 | 4.5 | 4.5 |
69
+
70
+ ### TinyLlama_v1.1_math_code
71
+
72
+ | Corpus | Basic pretraining | Continual pretraining with specific domain | Cooldown |
73
+ | ------------- | ----------------- | ------------------------------------------ | -------- |
74
+ | RedPajamaBook | 5.4 | - | - |
75
+ | C4 | 35.0 | 21.6 | 21.6 |
76
+ | CommonCrawl | 70.1 | 43.0 | 43.0 |
77
+ | Github | 6.5 | - | - |
78
+ | StackExchange | 4.2 | 2.6 | 2.6 |
79
+ | ArXiv | 5.7 | 5.0 | 5.0 |
80
+ | Wikipedia | 4.5 | 2.8 | 2.8 |
81
+ | starcoder | - | 15.0 | 15.0 |
82
+ | proof_pile | - | 10.0 | 10.0 |
83
+
84
+ ### TinyLlama_v1.1_chinese
85
+
86
+ | orpus | Basic pretraining | Continual pretraining with specific domain | Cooldown |
87
+ | ------------- | ----------------- | ------------------------------------------ | -------- |
88
+ | RedPajamaBook | 5.4 | - | - |
89
+ | C4 | 35.0 | 14.6 | 14.6 |
90
+ | CommonCrawl | 70.1 | 29.3 | 29.3 |
91
+ | Github | 6.5 | - | - |
92
+ | StackExchange | 4.2 | 1.8 | 1.8 |
93
+ | ArXiv | 5.7 | 2.4 | 2.4 |
94
+ | Wikipedia | 4.5 | 1.9 | 1.9 |
95
+ | skypile | - | 50.0 | 50.0 |
96
 
97
  ### How to use
 
98
  You will need the transformers>=4.31
99
  Do check the [TinyLlama](https://github.com/jzhang38/TinyLlama) GitHub page for more information.
 
100
  ```
101
  from transformers import AutoTokenizer
102
  import transformers
 
124
  ```
125
 
126
  ### Eval
 
127
  | Model | Pretrain Tokens | HellaSwag | Obqa | WinoGrande | ARC_c | ARC_e | boolq | piqa | avg |
128
  | ----------------------------------------- | --------------- | --------- | --------- | ---------- | --------- | --------- | ----- | --------- | --------- |
129
  | Pythia-1.0B | 300B | 47.16 | 31.40 | 53.43 | 27.05 | 48.99 | 60.83 | 69.21 | 48.30 |