Removing redundant text from README
Browse files
README.md
CHANGED
@@ -22,7 +22,7 @@ For full documentation and code, visit our two main repositories:
|
|
22 |
|
23 |
This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the code to train and analyze your own model suites from scratch.
|
24 |
|
25 |
-
> Pro Tip
|
26 |
> To learn more about these libraries and explore detailed tutorials, visit our official website [**picolm.io**](https://www.picolm.io) and get fully acquainted with the Pico ecosystem.
|
27 |
|
28 |
---
|
@@ -37,7 +37,7 @@ Our complete suite of models from 10M to 500M parameters trained with Pico:
|
|
37 |
- [**pico-decoder-medium**](https://huggingface.co/pico-lm/pico-decoder-medium) (100M parameters)
|
38 |
- [**pico-decoder-large**](https://huggingface.co/pico-lm/pico-decoder-large) (500M parameters)
|
39 |
|
40 |
-
> π§ **Coming Soon!** **pico-decoder-xl** (1B parameters) Watch this space or star our [GitHub repository](https://github.com/
|
41 |
|
42 |
All models are trained for 50,000 steps on the [**pretokenized-dolma**](https://huggingface.co/datasets/pico-lm/pretokenized-dolma) dataset. They all see the same training data at each training step, use the same optimizatation process, and share the same model architecture; the only difference between models is the size of their hidden dimension.
|
43 |
|
@@ -61,14 +61,6 @@ We visualize the learning process in our **[Wandb](https://wandb.ai/pico-lm/pico
|
|
61 |
| **Precision** | Mixed precision training |
|
62 |
| **Vocabulary Size** | 50,280 |
|
63 |
|
64 |
-
|
65 |
-
In each model repository, we version control checkpoints every 1000 steps that contain:
|
66 |
-
- Weights and optimizer states (HuggingFace and Lightning Fabric-compatible versions)
|
67 |
-
- Model activations and gradients
|
68 |
-
- The batch of training data observed at the given training step
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
### **2. Datasets**
|
73 |
1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
|
74 |
- 420B tokens of pre-processed, tokenized and shuffled text extraced from the **[DOLMA](https://allenai.org/dolma)** corpus
|
|
|
22 |
|
23 |
This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the code to train and analyze your own model suites from scratch.
|
24 |
|
25 |
+
> Pro Tip π :
|
26 |
> To learn more about these libraries and explore detailed tutorials, visit our official website [**picolm.io**](https://www.picolm.io) and get fully acquainted with the Pico ecosystem.
|
27 |
|
28 |
---
|
|
|
37 |
- [**pico-decoder-medium**](https://huggingface.co/pico-lm/pico-decoder-medium) (100M parameters)
|
38 |
- [**pico-decoder-large**](https://huggingface.co/pico-lm/pico-decoder-large) (500M parameters)
|
39 |
|
40 |
+
> π§ **Coming Soon!** **pico-decoder-xl** (1B parameters) Watch this space or star our [GitHub repository](https://github.com/pico-lm) for updates!
|
41 |
|
42 |
All models are trained for 50,000 steps on the [**pretokenized-dolma**](https://huggingface.co/datasets/pico-lm/pretokenized-dolma) dataset. They all see the same training data at each training step, use the same optimizatation process, and share the same model architecture; the only difference between models is the size of their hidden dimension.
|
43 |
|
|
|
61 |
| **Precision** | Mixed precision training |
|
62 |
| **Vocabulary Size** | 50,280 |
|
63 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
64 |
### **2. Datasets**
|
65 |
1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
|
66 |
- 420B tokens of pre-processed, tokenized and shuffled text extraced from the **[DOLMA](https://allenai.org/dolma)** corpus
|