Spaces:

pico-lm
/

README

Running

App Files Files Community

rdiehlmartinez commited on Mar 17

Commit

8aa4460

verified ·

1 Parent(s): 2e0d2c9

Removing redundant text from README

Browse files

Files changed (1) hide show

README.md +2 -10

README.md CHANGED Viewed

@@ -22,7 +22,7 @@ For full documentation and code, visit our two main repositories:
 This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the code to train and analyze your own model suites from scratch.
-> Pro Tip 🚀:
 > To learn more about these libraries and explore detailed tutorials, visit our official website [**picolm.io**](https://www.picolm.io) and get fully acquainted with the Pico ecosystem.
 ---
@@ -37,7 +37,7 @@ Our complete suite of models from 10M to 500M parameters trained with Pico:
 - [**pico-decoder-medium**](https://huggingface.co/pico-lm/pico-decoder-medium) (100M parameters)
 - [**pico-decoder-large**](https://huggingface.co/pico-lm/pico-decoder-large) (500M parameters)
-> 🚧 **Coming Soon!** **pico-decoder-xl** (1B parameters) Watch this space or star our [GitHub repository](https://github.com/rdiehlmartinez/pico) for updates!
 All models are trained for 50,000 steps on the [**pretokenized-dolma**](https://huggingface.co/datasets/pico-lm/pretokenized-dolma) dataset. They all see the same training data at each training step, use the same optimizatation process, and share the same model architecture; the only difference between models is the size of their hidden dimension.
@@ -61,14 +61,6 @@ We visualize the learning process in our **[Wandb](https://wandb.ai/pico-lm/pico
 | **Precision**           | Mixed precision training                                                                                                                                                                   |
 | **Vocabulary Size**     | 50,280                                                                                                                                                                                    |
-In each model repository, we version control checkpoints every 1000 steps that contain:
-  - Weights and optimizer states (HuggingFace and Lightning Fabric-compatible versions)
-  - Model activations and gradients
-  - The batch of training data observed at the given training step
 ### **2. Datasets**
 1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
    - 420B tokens of pre-processed, tokenized and shuffled text extraced from the **[DOLMA](https://allenai.org/dolma)** corpus

 This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the code to train and analyze your own model suites from scratch.
+> Pro Tip 🚀 :
 > To learn more about these libraries and explore detailed tutorials, visit our official website [**picolm.io**](https://www.picolm.io) and get fully acquainted with the Pico ecosystem.
 ---
 - [**pico-decoder-medium**](https://huggingface.co/pico-lm/pico-decoder-medium) (100M parameters)
 - [**pico-decoder-large**](https://huggingface.co/pico-lm/pico-decoder-large) (500M parameters)
+> 🚧 **Coming Soon!** **pico-decoder-xl** (1B parameters) Watch this space or star our [GitHub repository](https://github.com/pico-lm) for updates!
 All models are trained for 50,000 steps on the [**pretokenized-dolma**](https://huggingface.co/datasets/pico-lm/pretokenized-dolma) dataset. They all see the same training data at each training step, use the same optimizatation process, and share the same model architecture; the only difference between models is the size of their hidden dimension.
 | **Precision**           | Mixed precision training                                                                                                                                                                   |
 | **Vocabulary Size**     | 50,280                                                                                                                                                                                    |
 ### **2. Datasets**
 1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
    - 420B tokens of pre-processed, tokenized and shuffled text extraced from the **[DOLMA](https://allenai.org/dolma)** corpus