--- title: README emoji: 📈 colorFrom: red colorTo: yellow sdk: static pinned: true thumbnail: >- https://cdn-uploads.huggingface.co/production/uploads/638a2e13f32316c0440f5337/ly9Z0_QnlHzkKQN0TvfZq.png license: apache-2.0 --- # **Pico: A Lightweight Framework for Studying Language Model Learning Dynamics** Welcome to the **pico-lm** organization on Hugging Face! Pico is designed to **demystify** how language models learn by: 1. **Training** a family of language models at different scales using a transparent, minimally opinionated codebase. 2. **Analyzing** these models’ learning behaviors using checkpoints enriched with activations, gradients, and evaluation metrics. For full documentation and code, visit our two main repositories: - [**pico-train**](https://github.com/pico-lm/pico-train): Minimalist training framework for language models. - [**pico-analyze**](https://github.com/pico-lm/pico-analyze): Tools for measuring and visualizing model learning dynamics across checkpoints. This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the code to train and analyze your own model suites from scratch. All code and artifacts are licensed under a permissive Apache-2.0 license. > Pro Tip 🚀 : > To learn more about these libraries and explore detailed tutorials, visit our official website [**picolm.io**](https://www.picolm.io) and get fully acquainted with the Pico ecosystem. --- ## 🤗 HuggingFace Resources (You Are Here) ### **1. Pre-trained Model Suite** Our complete suite of models from 11M to 570M parameters trained with Pico: - [**pico-decoder-tiny**](https://huggingface.co/pico-lm/pico-decoder-tiny) (11M parameters) - [**pico-decoder-small**](https://huggingface.co/pico-lm/pico-decoder-small) (65M parameters) - [**pico-decoder-medium**](https://huggingface.co/pico-lm/pico-decoder-medium) (181M parameters) - [**pico-decoder-large**](https://huggingface.co/pico-lm/pico-decoder-large) (570M parameters) > 🚧 **Disclaimer** These models are still under construction. The models released in this repository have been trained for 50,000 steps (corresponding to ~100B tokens). Training will finalize after 200,000 steps. > > 🚧 **Coming Soon!** **pico-decoder-xl** (1B+ parameters) Watch this space or star our [GitHub repository](https://github.com/pico-lm) for updates! All models are on the [**pretokenized-dolma**](https://huggingface.co/datasets/pico-lm/pretokenized-dolma) dataset. They all see the same training data at each training step, use the same optimizatation process, and share the same model architecture; the only difference between models is the size of their hidden dimension. In each model repository, we version control checkpoints every 1000 steps that contain: - Weights and optimizer states (HuggingFace and Lightning Fabric-compatible versions) - Model activations and gradients - The batch of training data observed at the given training step We visualize the learning process in our **[Wandb](https://wandb.ai/pico-lm/pico-decoder/reports/Pico-Decoder-Models---VmlldzoxMTgzNTQ4Mw)**. Model Details: | **Aspect** | **Details** | |-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | **Architecture** | - Llama-style transformer (decoder-only)
- RMSNorm normalization
- RoPE (Rotary Positional Embeddings)
- Multi-head attention with KV-cache
- SwiGLU activation function | | **Sequence Length** | 2048 | | **Batch Size** | 1024 | | **Optimizer** | AdamW | | **Learning Rate** | 3e-4 (one-cycle warmup) | | **Gradient Clipping** | 1.0 | | **Precision** | Mixed precision training | | **Vocabulary Size** | 50,280 | ### **2. Datasets** 1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)** - 420B tokens of pre-processed, tokenized and shuffled text extraced from the **[DOLMA](https://allenai.org/dolma)** corpus - We use this dataset to train our model suite 2. **[pretokenized-dolma-tinsy](https://huggingface.co/datasets/pico-lm/pretokenized-dolma-tinsy)** - A smaller version of the **pretokenized-dolma** corpus for quick experiments 3. **[pretokenized-paloma](https://huggingface.co/datasets/pico-lm/pretokenized-paloma)** - A tokenized and shuffled version of the **[Paloma](https://allenai.org/evaluation-frameworks)** evaluation corpus - The Paloma corpus was carefully curated to be disjoint from the Dolma corpus and provides - We use this corpus to evaluate the perplexity of our models 4. **[pretokenized-paloma-tinsy](https://huggingface.co/datasets/pico-lm/pretokenized-paloma-tinsy)** - A sub-sampled version of the **pretokenized-dolma** corpus All datasets are tokenized using the **[OLMo Tokenizer](https://huggingface.co/allenai/OLMo-7B-0724-hf/blob/main/tokenizer_config.json)** --- ## 🔍 Citation If you use Pico in academic or professional work, please cite it: ```bibtex @software{pico2025, author = {Diehl Martinez, Richard}, title = {Pico: A Lightweight Framework for Studying Language Model Learning Dynamics}, year = {2025, url = {https://github.com/pico-lm} } ``` **Thanks for checking out Pico!** Star our [GitHub repositories](https://github.com/pico-lm) or join our community discussions to stay updated. If you find a bug or have questions, open an issue—contributions are welcome!