File size: 7,004 Bytes
35ed80d 8ec16ff 35ed80d 2e0d2c9 fd21fad 35ed80d 2e0d2c9 c59c115 2e0d2c9 e948f36 2e0d2c9 e948f36 a547025 e948f36 fd21fad 8aa4460 2e0d2c9 8ec16ff e948f36 2e0d2c9 267696c 066e23e 2e0d2c9 267696c 066e23e 2e0d2c9 066e23e 2e0d2c9 c30eec0 a547025 2e0d2c9 a547025 2e0d2c9 8ec16ff ae00997 a547025 e948f36 60f09e9 a547025 ae00997 a547025 8ec16ff ae00997 8ec16ff 2e0d2c9 8ec16ff 2e0d2c9 8ec16ff f7c327a 30ee8fd cb976f4 f7c327a 2e0d2c9 8ec16ff 2e0d2c9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
---
title: README
emoji: 📈
colorFrom: red
colorTo: yellow
sdk: static
pinned: true
thumbnail: >-
https://cdn-uploads.huggingface.co/production/uploads/638a2e13f32316c0440f5337/ly9Z0_QnlHzkKQN0TvfZq.png
license: apache-2.0
---
# **Pico: A Lightweight Framework for Studying Language Model Learning Dynamics**
Welcome to the **pico-lm** organization on Hugging Face! Pico is designed to **demystify** how language models learn by:
1. **Training** a family of language models at different scales using a transparent, minimally opinionated codebase.
2. **Analyzing** these models’ learning behaviors using checkpoints enriched with activations, gradients, and evaluation metrics.
For full documentation and code, visit our two main repositories:
- [**pico-train**](https://github.com/pico-lm/pico-train): Minimalist training framework for language models.
- [**pico-analyze**](https://github.com/pico-lm/pico-analyze): Tools for measuring and visualizing model learning dynamics across checkpoints.
This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the code to train and analyze your own model suites from scratch.
All code and artifacts are licensed under a permissive Apache-2.0 license.
> Pro Tip 🚀 :
> To learn more about these libraries and explore detailed tutorials, visit our official website [**picolm.io**](https://www.picolm.io) and get fully acquainted with the Pico ecosystem.
---
## 🤗 HuggingFace Resources (You Are Here)
### **1. Pre-trained Model Suite**
Our complete suite of models from 11M to 570M parameters trained with Pico:
- [**pico-decoder-tiny**](https://huggingface.co/pico-lm/pico-decoder-tiny) (11M parameters)
- [**pico-decoder-small**](https://huggingface.co/pico-lm/pico-decoder-small) (65M parameters)
- [**pico-decoder-medium**](https://huggingface.co/pico-lm/pico-decoder-medium) (181M parameters)
- [**pico-decoder-large**](https://huggingface.co/pico-lm/pico-decoder-large) (570M parameters)
> 🚧 **Disclaimer** These models are still under construction. The models released in this repository have been trained for 50,000 steps (corresponding to ~100B tokens). Training will finalize after 200,000 steps.
>
> 🚧 **Coming Soon!** **pico-decoder-xl** (1B+ parameters) Watch this space or star our [GitHub repository](https://github.com/pico-lm) for updates!
All models are on the [**pretokenized-dolma**](https://huggingface.co/datasets/pico-lm/pretokenized-dolma) dataset. They all see the same training data at each training step, use the same optimizatation process, and share the same model architecture; the only difference between models is the size of their hidden dimension.
In each model repository, we version control checkpoints every 1000 steps that contain:
- Weights and optimizer states (HuggingFace and Lightning Fabric-compatible versions)
- Model activations and gradients
- The batch of training data observed at the given training step
We visualize the learning process in our **[Wandb](https://wandb.ai/pico-lm/pico-decoder/reports/Pico-Decoder-Models---VmlldzoxMTgzNTQ4Mw)**.
Model Details:
| **Aspect** | **Details** |
|-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Architecture** | - Llama-style transformer (decoder-only)<br>- RMSNorm normalization<br>- RoPE (Rotary Positional Embeddings)<br>- Multi-head attention with KV-cache<br>- SwiGLU activation function |
| **Sequence Length** | 2048 |
| **Batch Size** | 1024 |
| **Optimizer** | AdamW |
| **Learning Rate** | 3e-4 (one-cycle warmup) |
| **Gradient Clipping** | 1.0 |
| **Precision** | Mixed precision training |
| **Vocabulary Size** | 50,280 |
### **2. Datasets**
1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
- 420B tokens of pre-processed, tokenized and shuffled text extraced from the **[DOLMA](https://allenai.org/dolma)** corpus
- We use this dataset to train our model suite
2. **[pretokenized-dolma-tinsy](https://huggingface.co/datasets/pico-lm/pretokenized-dolma-tinsy)**
- A smaller version of the **pretokenized-dolma** corpus for quick experiments
3. **[pretokenized-paloma](https://huggingface.co/datasets/pico-lm/pretokenized-paloma)**
- A tokenized and shuffled version of the **[Paloma](https://allenai.org/evaluation-frameworks)** evaluation corpus
- The Paloma corpus was carefully curated to be disjoint from the Dolma corpus and provides
- We use this corpus to evaluate the perplexity of our models
4. **[pretokenized-paloma-tinsy](https://huggingface.co/datasets/pico-lm/pretokenized-paloma-tinsy)**
- A sub-sampled version of the **pretokenized-dolma** corpus
All datasets are tokenized using the **[OLMo Tokenizer](https://huggingface.co/allenai/OLMo-7B-0724-hf/blob/main/tokenizer_config.json)**
---
## 🔍 Citation
If you use Pico in academic or professional work, please cite it:
```bibtex
@software{pico2025,
author = {Diehl Martinez, Richard},
title = {Pico: A Lightweight Framework for Studying Language Model Learning Dynamics},
year = {2025,
url = {https://github.com/pico-lm}
}
```
**Thanks for checking out Pico!**
Star our [GitHub repositories](https://github.com/pico-lm) or join our community discussions to stay updated. If you find a bug or have questions, open an issue—contributions are welcome! |