Spaces:

pico-lm
/

README

Running

App Files Files Community

rdiehlmartinez commited on Mar 17

Commit

2e0d2c9

verified ·

1 Parent(s): ae00997

Updating README

Browse files

Files changed (1) hide show

README.md +58 -104

README.md CHANGED Viewed

@@ -4,41 +4,72 @@ emoji: 📈
 colorFrom: red
 colorTo: yellow
 sdk: static
-pinned: false
 ---
-# Pico: A Lightweight Framework for Studying Learning Dynamics
-Pico is a lightweight research framework that aims to demystify how language models learn. Built with simplicity in mind, it provides an efficient way to train and study models of different sizes. Visit our [website](https://www.picolm.io/) for more information.
-Pico consists of two key components:
-1. **Pico Training Framework** (available on [GitHub](https://github.com/pico-lm/pico)): A transparent, lightweight codebase for training language models. We use this framework to train a series of language models across scale that we release on this HuggingFace space.
-1. **Pico Analysis Framework** (available on [GitHub](https://github.com/pico-lm/pico-analysis)): A resarch framework to investigate and probe the learning dynamics of models trained using Pico.
 This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the code to train and analyze your own model suites from scratch.
 ## 🤗 HuggingFace Resources (You Are Here)
-### Pre-trained Model Suite
-Our complete suite of models from 1M to 500M parameters trained with Pico:
-- **pico-tiny** (1M parameters)
-- **pico-small** (10M parameters)
-- **pico-medium** (100M parameters)
-- **pico-large** (500M parameters)
-> 🚧 **Coming Soon!** **pico-xl** (1B parameters) Watch this space or star our [GitHub repository](https://github.com/rdiehlmartinez/pico) for updates!
-All models are trained for 50,000 steps on the **pretokenized-dolma** dataset. They all see the same training data at each training step, use the same optimizatation process, and share the same model architecture; the only difference between models is the size of their hidden dimension.
-Each model includes:
-- Advanced training checkpoints (stored every 1,000 steps) that contain:
   - Weights and optimizer states (HuggingFace and Lightning Fabric-compatible versions)
   - Model activations and gradients
   - The batch of training data observed at the given training step
-- Wandb logs tracking the learning process
-- Pre-computed perplexity scores on the paloma evaluation set
-### Available Datasets
 1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
    - 420B tokens of pre-processed, tokenized and shuffled text extraced from the **[DOLMA](https://allenai.org/dolma)** corpus
    - We use this dataset to train our model suite
@@ -56,96 +87,19 @@ Each model includes:
 All datasets are tokenized using the **[OLMo Tokenizer](https://huggingface.co/allenai/OLMo-7B-0724-hf/blob/main/tokenizer_config.json)**
-## 🔧 GitHub Training Framework
-Want to train your own suite of models? Visit our [GitHub repository](https://github.com/rdiehlmartinez/pico) to:
-- Train models with custom architectures
-- Experiment with different training regimes
-- Modify checkpoint saving behavior
-- Implement custom evaluation metrics
-The training framework makes it easy to:
-1. Train multiple models of different sizes
-2. Ensure consistent training across all models
-3. Save rich checkpoint data for learning dynamics analysis
-4. Compare learning dynamics across scales
-## 🛠️ Using the Resources
-### Using Pre-trained Models (HuggingFace)
-```python
-from transformers import AutoModelForCausalLM
-# Load our pre-trained model
-model = AutoModelForCausalLM.from_pretrained("pico-lm/pico-small")
-# Access specific checkpoint
-model = AutoModelForCausalLM.from_pretrained(
-    "pico-lm/pico-small",
-    revision="step-xyz"
-)
-```
-### Training Your Own Suite (GitHub)
-```bash
-# Clone the repository
-git clone https://github.com/rdiehlmartinez/pico.git && cd pico
-source setup.sh
-# Configure your model suite
-# Edit configs/train.yaml to specify model sizes and training parameters
-# Train your suite
-python train.py --config configs/train.yaml
-```
-## 📊 Model Details
-### Architecture
-All models use:
-- LLAMA-style transformer
-- RMSNorm for normalization
-- RoPE positional embeddings
-- Multi-head attention with KV-cache
-- SwiGLU activation function
-### Training Configuration
-Standard configuration (customizable in GitHub training):
-- Sequence length: 2048
-- Batch size: 1024
-- Learning rate: 1e-3
-- Weight decay: 0.1
-- Gradient clipping: 1.0
-- Mixed precision training
-- Vocab size: 50280
-## 🔬 Research Applications
-Perfect for researchers studying:
-- Learning dynamics across model scales
-- Mechanistic interpretability
-- Architecture and training effects
-- Emergent model behaviors
-Whether using our pre-trained models or training your own suite, Pico provides the tools needed for in-depth learning dynamics research.
-## 🤝 Contributing
-Contributions welcome on both platforms:
-- **HuggingFace**: Model weights, datasets, and evaluation results
-- **GitHub**: Training framework improvements, analysis tools, and documentation
-## 📫 Contact
-- GitHub: [rdiehlmartinez/pico](https://github.com/rdiehlmartinez/pico)
-- Author: [Richard Diehl Martinez](https://richarddiehlmartinez.com)
 ## 🔍 Citation
 ```bibtex
 @software{pico2024,
     author = {Diehl Martinez, Richard},
-    title = {Pico: Framework for Training Tiny Language Models},
     year = {2024},
 }
-```

 colorFrom: red
 colorTo: yellow
 sdk: static
+pinned: true
+thumbnail: >-
+  https://cdn-uploads.huggingface.co/production/uploads/638a2e13f32316c0440f5337/ly9Z0_QnlHzkKQN0TvfZq.png
 ---
+# **Pico: A Lightweight Framework for Studying Language Model Learning Dynamics**
+Welcome to the **pico-lm** organization on Hugging Face! Pico is designed to **demystify** how language models learn by:
+1. **Training** a family of language models at different scales using a transparent, minimally opinionated codebase.
+2. **Analyzing** these models’ learning behaviors using checkpoints enriched with activations, gradients, and evaluation metrics.
+For full documentation and code, visit our two main repositories:
+- [**pico-train**](https://github.com/pico-lm/pico-train): Minimalist training framework for language models.
+- [**pico-analyze**](https://github.com/pico-lm/pico-analyze): Tools for measuring and visualizing model learning dynamics across checkpoints.
 This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the code to train and analyze your own model suites from scratch.
+> Pro Tip 🚀:
+> To learn more about these libraries and explore detailed tutorials, visit our official website [**picolm.io**](https://www.picolm.io) and get fully acquainted with the Pico ecosystem.
+---
 ## 🤗 HuggingFace Resources (You Are Here)
+### **1. Pre-trained Model Suite**
+Our complete suite of models from 10M to 500M parameters trained with Pico:
+- [**pico-decoder-tiny**)](https://huggingface.co/pico-lm/pico-decoder-tiny) (1M parameters)
+- [**pico-decoder-small**](https://huggingface.co/pico-lm/pico-decoder-small) (10M parameters)
+- [**pico-decoder-medium**](https://huggingface.co/pico-lm/pico-decoder-medium) (100M parameters)
+- [**pico-decoder-large**](https://huggingface.co/pico-lm/pico-decoder-large) (500M parameters)
+> 🚧 **Coming Soon!** **pico-decoder-xl** (1B parameters) Watch this space or star our [GitHub repository](https://github.com/rdiehlmartinez/pico) for updates!
+All models are trained for 50,000 steps on the [**pretokenized-dolma**](https://huggingface.co/datasets/pico-lm/pretokenized-dolma) dataset. They all see the same training data at each training step, use the same optimizatation process, and share the same model architecture; the only difference between models is the size of their hidden dimension.
+In each model repository, we version control checkpoints every 1000 steps that contain:
+  - Weights and optimizer states (HuggingFace and Lightning Fabric-compatible versions)
+  - Model activations and gradients
+  - The batch of training data observed at the given training step
+We visualize the learning process in our **[Wandb](https://wandb.ai/pico-lm/pico-decoder/reports/Pico-Decoder-Models---VmlldzoxMTgzNTQ4Mw)**.
+📊 Model Details
+| **Aspect**              | **Details**                                                                                                                                                                               |
+|-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| **Architecture**        | - Llama-style transformer (decoder-only)<br>- RMSNorm normalization<br>- RoPE (Rotary Positional Embeddings)<br>- Multi-head attention with KV-cache<br>- SwiGLU activation function    |
+| **Sequence Length**     | 2048                                                                                                                                                                                      |
+| **Batch Size**          | 1024                                                                                                                                                                                      |
+| **Optimizer**           | AdamW                                                                                                                                                                                     |
+| **Learning Rate**       | 3e-4 (one-cycle warmup)                                                                                                                                                                   |
+| **Gradient Clipping**   | 1.0                                                                                                                                                                                       |
+| **Precision**           | Mixed precision training                                                                                                                                                                   |
+| **Vocabulary Size**     | 50,280                                                                                                                                                                                    |
+In each model repository, we version control checkpoints every 1000 steps that contain:
   - Weights and optimizer states (HuggingFace and Lightning Fabric-compatible versions)
   - Model activations and gradients
   - The batch of training data observed at the given training step
+### **2. Datasets**
 1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
    - 420B tokens of pre-processed, tokenized and shuffled text extraced from the **[DOLMA](https://allenai.org/dolma)** corpus
    - We use this dataset to train our model suite
 All datasets are tokenized using the **[OLMo Tokenizer](https://huggingface.co/allenai/OLMo-7B-0724-hf/blob/main/tokenizer_config.json)**
+---
 ## 🔍 Citation
+If you use Pico in academic or professional work, please cite it:
 ```bibtex
 @software{pico2024,
     author = {Diehl Martinez, Richard},
+    title = {Pico: A Lightweight Framework for Studying Learning Dynamics in Language Models},
     year = {2024},
+    url = {https://github.com/pico-lm}
 }
+```
+**Thanks for checking out Pico!**
+Star our [GitHub repositories](https://github.com/pico-lm) or join our community discussions to stay updated. If you find a bug or have questions, open an issue—contributions are welcome!