|
--- |
|
title: README |
|
emoji: 📈 |
|
colorFrom: red |
|
colorTo: yellow |
|
sdk: static |
|
pinned: true |
|
thumbnail: >- |
|
https://cdn-uploads.huggingface.co/production/uploads/638a2e13f32316c0440f5337/ly9Z0_QnlHzkKQN0TvfZq.png |
|
license: apache-2.0 |
|
--- |
|
|
|
# **Pico: A Lightweight Framework for Studying Language Model Learning Dynamics** |
|
|
|
Welcome to the **pico-lm** organization on Hugging Face! Pico is designed to **demystify** how language models learn by: |
|
|
|
1. **Training** a family of language models at different scales using a transparent, minimally opinionated codebase. |
|
2. **Analyzing** these models’ learning behaviors using checkpoints enriched with activations, gradients, and evaluation metrics. |
|
|
|
For full documentation and code, visit our two main repositories: |
|
- [**pico-train**](https://github.com/pico-lm/pico-train): Minimalist training framework for language models. |
|
- [**pico-analyze**](https://github.com/pico-lm/pico-analyze): Tools for measuring and visualizing model learning dynamics across checkpoints. |
|
|
|
This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the code to train and analyze your own model suites from scratch. |
|
|
|
All code and artifacts are licensed under a permissive Apache-2.0 license. |
|
|
|
> Pro Tip 🚀 : |
|
> To learn more about these libraries and explore detailed tutorials, visit our official website [**picolm.io**](https://www.picolm.io) and get fully acquainted with the Pico ecosystem. |
|
|
|
--- |
|
|
|
## 🤗 HuggingFace Resources (You Are Here) |
|
|
|
### **1. Pre-trained Model Suite** |
|
|
|
Our complete suite of models from 11M to 570M parameters trained with Pico: |
|
- [**pico-decoder-tiny**](https://huggingface.co/pico-lm/pico-decoder-tiny) (11M parameters) |
|
- [**pico-decoder-small**](https://huggingface.co/pico-lm/pico-decoder-small) (65M parameters) |
|
- [**pico-decoder-medium**](https://huggingface.co/pico-lm/pico-decoder-medium) (181M parameters) |
|
- [**pico-decoder-large**](https://huggingface.co/pico-lm/pico-decoder-large) (570M parameters) |
|
|
|
> 🚧 **Disclaimer** These models are still under construction. The models released in this repository have been trained for 50,000 steps (corresponding to ~100B tokens). Training will finalize after 200,000 steps. |
|
> |
|
> 🚧 **Coming Soon!** **pico-decoder-xl** (1B+ parameters) Watch this space or star our [GitHub repository](https://github.com/pico-lm) for updates! |
|
|
|
|
|
All models are on the [**pretokenized-dolma**](https://huggingface.co/datasets/pico-lm/pretokenized-dolma) dataset. They all see the same training data at each training step, use the same optimizatation process, and share the same model architecture; the only difference between models is the size of their hidden dimension. |
|
|
|
In each model repository, we version control checkpoints every 1000 steps that contain: |
|
- Weights and optimizer states (HuggingFace and Lightning Fabric-compatible versions) |
|
- Model activations and gradients |
|
- The batch of training data observed at the given training step |
|
|
|
We visualize the learning process in our **[Wandb](https://wandb.ai/pico-lm/pico-decoder/reports/Pico-Decoder-Models---VmlldzoxMTgzNTQ4Mw)**. |
|
|
|
Model Details: |
|
|
|
| **Aspect** | **Details** | |
|
|-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
|
| **Architecture** | - Llama-style transformer (decoder-only)<br>- RMSNorm normalization<br>- RoPE (Rotary Positional Embeddings)<br>- Multi-head attention with KV-cache<br>- SwiGLU activation function | |
|
| **Sequence Length** | 2048 | |
|
| **Batch Size** | 1024 | |
|
| **Optimizer** | AdamW | |
|
| **Learning Rate** | 3e-4 (one-cycle warmup) | |
|
| **Gradient Clipping** | 1.0 | |
|
| **Precision** | Mixed precision training | |
|
| **Vocabulary Size** | 50,280 | |
|
|
|
### **2. Datasets** |
|
1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)** |
|
- 420B tokens of pre-processed, tokenized and shuffled text extraced from the **[DOLMA](https://allenai.org/dolma)** corpus |
|
- We use this dataset to train our model suite |
|
|
|
2. **[pretokenized-dolma-tinsy](https://huggingface.co/datasets/pico-lm/pretokenized-dolma-tinsy)** |
|
- A smaller version of the **pretokenized-dolma** corpus for quick experiments |
|
|
|
3. **[pretokenized-paloma](https://huggingface.co/datasets/pico-lm/pretokenized-paloma)** |
|
- A tokenized and shuffled version of the **[Paloma](https://allenai.org/evaluation-frameworks)** evaluation corpus |
|
- The Paloma corpus was carefully curated to be disjoint from the Dolma corpus and provides |
|
- We use this corpus to evaluate the perplexity of our models |
|
|
|
4. **[pretokenized-paloma-tinsy](https://huggingface.co/datasets/pico-lm/pretokenized-paloma-tinsy)** |
|
- A sub-sampled version of the **pretokenized-dolma** corpus |
|
|
|
All datasets are tokenized using the **[OLMo Tokenizer](https://huggingface.co/allenai/OLMo-7B-0724-hf/blob/main/tokenizer_config.json)** |
|
|
|
--- |
|
|
|
## 🔍 Citation |
|
If you use Pico in academic or professional work, please cite it: |
|
|
|
```bibtex |
|
@software{pico2025, |
|
author = {Diehl Martinez, Richard}, |
|
title = {Pico: A Lightweight Framework for Studying Language Model Learning Dynamics}, |
|
year = {2025, |
|
url = {https://github.com/pico-lm} |
|
} |
|
``` |
|
|
|
**Thanks for checking out Pico!** |
|
Star our [GitHub repositories](https://github.com/pico-lm) or join our community discussions to stay updated. If you find a bug or have questions, open an issue—contributions are welcome! |