README / README.md
rdiehlmartinez's picture
Adding license information
fd21fad verified
---
title: README
emoji: 📈
colorFrom: red
colorTo: yellow
sdk: static
pinned: true
thumbnail: >-
https://cdn-uploads.huggingface.co/production/uploads/638a2e13f32316c0440f5337/ly9Z0_QnlHzkKQN0TvfZq.png
license: apache-2.0
---
# **Pico: A Lightweight Framework for Studying Language Model Learning Dynamics**
Welcome to the **pico-lm** organization on Hugging Face! Pico is designed to **demystify** how language models learn by:
1. **Training** a family of language models at different scales using a transparent, minimally opinionated codebase.
2. **Analyzing** these models’ learning behaviors using checkpoints enriched with activations, gradients, and evaluation metrics.
For full documentation and code, visit our two main repositories:
- [**pico-train**](https://github.com/pico-lm/pico-train): Minimalist training framework for language models.
- [**pico-analyze**](https://github.com/pico-lm/pico-analyze): Tools for measuring and visualizing model learning dynamics across checkpoints.
This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the code to train and analyze your own model suites from scratch.
All code and artifacts are licensed under a permissive Apache-2.0 license.
> Pro Tip 🚀 :
> To learn more about these libraries and explore detailed tutorials, visit our official website [**picolm.io**](https://www.picolm.io) and get fully acquainted with the Pico ecosystem.
---
## 🤗 HuggingFace Resources (You Are Here)
### **1. Pre-trained Model Suite**
Our complete suite of models from 11M to 570M parameters trained with Pico:
- [**pico-decoder-tiny**](https://huggingface.co/pico-lm/pico-decoder-tiny) (11M parameters)
- [**pico-decoder-small**](https://huggingface.co/pico-lm/pico-decoder-small) (65M parameters)
- [**pico-decoder-medium**](https://huggingface.co/pico-lm/pico-decoder-medium) (181M parameters)
- [**pico-decoder-large**](https://huggingface.co/pico-lm/pico-decoder-large) (570M parameters)
> 🚧 **Disclaimer** These models are still under construction. The models released in this repository have been trained for 50,000 steps (corresponding to ~100B tokens). Training will finalize after 200,000 steps.
>
> 🚧 **Coming Soon!** **pico-decoder-xl** (1B+ parameters) Watch this space or star our [GitHub repository](https://github.com/pico-lm) for updates!
All models are on the [**pretokenized-dolma**](https://huggingface.co/datasets/pico-lm/pretokenized-dolma) dataset. They all see the same training data at each training step, use the same optimizatation process, and share the same model architecture; the only difference between models is the size of their hidden dimension.
In each model repository, we version control checkpoints every 1000 steps that contain:
- Weights and optimizer states (HuggingFace and Lightning Fabric-compatible versions)
- Model activations and gradients
- The batch of training data observed at the given training step
We visualize the learning process in our **[Wandb](https://wandb.ai/pico-lm/pico-decoder/reports/Pico-Decoder-Models---VmlldzoxMTgzNTQ4Mw)**.
Model Details:
| **Aspect** | **Details** |
|-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Architecture** | - Llama-style transformer (decoder-only)<br>- RMSNorm normalization<br>- RoPE (Rotary Positional Embeddings)<br>- Multi-head attention with KV-cache<br>- SwiGLU activation function |
| **Sequence Length** | 2048 |
| **Batch Size** | 1024 |
| **Optimizer** | AdamW |
| **Learning Rate** | 3e-4 (one-cycle warmup) |
| **Gradient Clipping** | 1.0 |
| **Precision** | Mixed precision training |
| **Vocabulary Size** | 50,280 |
### **2. Datasets**
1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
- 420B tokens of pre-processed, tokenized and shuffled text extraced from the **[DOLMA](https://allenai.org/dolma)** corpus
- We use this dataset to train our model suite
2. **[pretokenized-dolma-tinsy](https://huggingface.co/datasets/pico-lm/pretokenized-dolma-tinsy)**
- A smaller version of the **pretokenized-dolma** corpus for quick experiments
3. **[pretokenized-paloma](https://huggingface.co/datasets/pico-lm/pretokenized-paloma)**
- A tokenized and shuffled version of the **[Paloma](https://allenai.org/evaluation-frameworks)** evaluation corpus
- The Paloma corpus was carefully curated to be disjoint from the Dolma corpus and provides
- We use this corpus to evaluate the perplexity of our models
4. **[pretokenized-paloma-tinsy](https://huggingface.co/datasets/pico-lm/pretokenized-paloma-tinsy)**
- A sub-sampled version of the **pretokenized-dolma** corpus
All datasets are tokenized using the **[OLMo Tokenizer](https://huggingface.co/allenai/OLMo-7B-0724-hf/blob/main/tokenizer_config.json)**
---
## 🔍 Citation
If you use Pico in academic or professional work, please cite it:
```bibtex
@software{pico2025,
author = {Diehl Martinez, Richard},
title = {Pico: A Lightweight Framework for Studying Language Model Learning Dynamics},
year = {2025,
url = {https://github.com/pico-lm}
}
```
**Thanks for checking out Pico!**
Star our [GitHub repositories](https://github.com/pico-lm) or join our community discussions to stay updated. If you find a bug or have questions, open an issue—contributions are welcome!