Spaces:

pico-lm
/

README

Running

App Files Files Community

rdiehlmartinez commited on Feb 24

Commit

a547025

verified ·

1 Parent(s): c59c115

Update README.md

Browse files

Files changed (1) hide show

README.md +31 -18

README.md CHANGED Viewed

@@ -9,41 +9,51 @@ pinned: false
 # Pico: A Lightweight Framework for Studying Learning Dynamics
-Pico is a lightweight research framework that demystifies how language models learn. Built with simplicity in mind, it provides an efficient way to train and study models of different sizes. Visit our [website](https://www.picolm.io/) for more information.
 Pico consists of two key components:
-1. **Pre-trained Model Suite** (hosted here on HuggingFace)
-2. **Training Framework** (available on [GitHub](https://github.com/rdiehlmartinez/pico))
-This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the infrastructure to train your own model suites from scratch.
 ## 🤗 HuggingFace Resources (You Are Here)
-> 🚧 **Coming Soon!** Our complete suite of pre-trained models (1M to 1B parameters) is currently being trained and will be released here in January 2025. Watch this space or star our [GitHub repository](https://github.com/rdiehlmartinez/pico) for updates!
-### Pre-trained Model Suite (Releasing January 2025)
-Our complete suite of models from 1M to 1B parameters:
 - **pico-tiny** (1M parameters)
 - **pico-small** (10M parameters)
 - **pico-medium** (100M parameters)
 - **pico-large** (500M parameters)
-- **pico-xl** (1B parameters)
 Each model includes:
-- Complete training checkpoints
-- Saved activations and gradients
-- Pre-computed evaluation perplexity scores
 ### Available Datasets
 1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
-   - 420B tokens of pre-processed text
-   - Cleaned and shuffled DOLMA corpus
 2. **[pretokenized-dolma-tiny](https://huggingface.co/datasets/pico-lm/pretokenized-dolma-tiny)**
-   - Smaller version for quick experiments
-3. **[pretokenized-eval-batch](https://huggingface.co/datasets/pico-lm/pretokenized-eval-batch)**
-   - Batch of eval data for generating model activations
 ## 🔧 GitHub Training Framework
@@ -91,7 +101,7 @@ python train.py --config configs/train.yaml
 ## 📊 Model Details
 ### Architecture
-All models (both pre-trained and self-trained) use:
 - LLAMA-style transformer
 - RMSNorm for normalization
 - RoPE positional embeddings
@@ -100,11 +110,14 @@ All models (both pre-trained and self-trained) use:
 ### Training Configuration
 Standard configuration (customizable in GitHub training):
 - Batch size: 1024
 - Learning rate: 1e-3
 - Weight decay: 0.1
 - Gradient clipping: 1.0
 - Mixed precision training
 ## 🔬 Research Applications

 # Pico: A Lightweight Framework for Studying Learning Dynamics
+Pico is a lightweight research framework that aims to demystify how language models learn. Built with simplicity in mind, it provides an efficient way to train and study models of different sizes. Visit our [website](https://www.picolm.io/) for more information.
 Pico consists of two key components:
+1. **Pico Training Framework** (available on [GitHub](https://github.com/pico-lm/pico)): A transparent, lightweight codebase for training language models. We use this framework to train a series of language models across scale that we release on this HuggingFace space.
+1. **Pico Analysis Framework** (available on [GitHub](https://github.com/pico-lm/pico-analysis)): A resarch framework to investigate and probe the learning dynamics of models trained using Pico.
+This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the code to train and analyze your own model suites from scratch.
 ## 🤗 HuggingFace Resources (You Are Here)
+### Pre-trained Model Suite
+Our complete suite of models from 1M to 500M parameters trained with Pico:
 - **pico-tiny** (1M parameters)
 - **pico-small** (10M parameters)
 - **pico-medium** (100M parameters)
 - **pico-large** (500M parameters)
+> 🚧 **Coming Soon!** **pico-xl** (1B parameters) Watch this space or star our [GitHub repository](https://github.com/rdiehlmartinez/pico) for updates!
+All models are trained for 50,000 steps on the **pretokenized-dolma** dataset. They all see the same training data at each training step, use the same optimizatation process, and share the same model architecture; the only difference between models is the size of their hidden dimension.
 Each model includes:
+- Advanced training checkpoints (stored every 1,000 steps) that contain:
+  - Weights and optimizer states (HuggingFace and Lightning Fabric-compatible versions)
+  - Model activations and gradients
+  - The batch of training data observed at the given training step
+- Wandb logs tracking the learning process
+- Pre-computed perplexity scores on the paloma evaluation set
 ### Available Datasets
 1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
+   - 420B tokens of pre-processed, tokenized and shuffled text extraced from the **DOLMA**[https://allenai.org/dolma] corpus
+   - We use this dataset to train our model suite
 2. **[pretokenized-dolma-tiny](https://huggingface.co/datasets/pico-lm/pretokenized-dolma-tiny)**
+   - A smaller version of the **pretokenized-dolma** corpus for quick experiments
+3. **[pretokenized-paloma](https://huggingface.co/datasets/pico-lm/pretokenized-paloma)**
+   - A tokenized and shuffled version of the **Paloma**[vhttps://allenai.org/evaluation-frameworks] evaluation corpus
+   - The Paloma corpus was carefully curated to be disjoint from the Dolma corpus and provides
+   - We use this corpus to evaluate the perplexity of our models
+4. **[pretokenized-paloma-tinsy](https://huggingface.co/datasets/pico-lm/pretokenized-paloma-tinsy)**
+   - A sub-sampled version of the **pretokenized-dolma** corpus
 ## 🔧 GitHub Training Framework
 ## 📊 Model Details
 ### Architecture
+All models use:
 - LLAMA-style transformer
 - RMSNorm for normalization
 - RoPE positional embeddings
 ### Training Configuration
 Standard configuration (customizable in GitHub training):
+- Sequence length: 2048
 - Batch size: 1024
 - Learning rate: 1e-3
 - Weight decay: 0.1
 - Gradient clipping: 1.0
 - Mixed precision training
+- Vocab size: 50280 (using the **[OLMo Tokenizer]**(https://huggingface.co/allenai/OLMo-7B-0724-hf/blob/main/tokenizer_config.json))
 ## 🔬 Research Applications