---
title: README
emoji: 📈
colorFrom: red
colorTo: yellow
sdk: static
pinned: true
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/638a2e13f32316c0440f5337/ly9Z0_QnlHzkKQN0TvfZq.png
license: apache-2.0
---

# **Pico: A Modular Framework for Hypothesis-Driven Small Language Model Research**

Welcome to Pico LM 👋, a research initiative dedicated to demystifying language model learning. 

We create two complementary frameworks (pico-train and pico-analyze) for training and analyzing small to mid-scale language models (1M–1B parameters). Our mission is to provide a transparent, research-oriented workflow that illuminates how these models learn.

For full documentation and code, visit our two main repositories:
- [**pico-train**](https://github.com/pico-lm/pico-train): Minimalist training framework for language models.  
- [**pico-analyze**](https://github.com/pico-lm/pico-analyze): Tools for measuring and visualizing model learning dynamics across checkpoints.

This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the code to train and analyze your own model suites from scratch.

All code and artifacts are licensed under a permissive Apache-2.0 license.

> Pro Tip 🚀 : 
> To learn more about these libraries and explore detailed tutorials, visit our official website [**picolm.io**](https://www.picolm.io) and get fully acquainted with the Pico ecosystem.

---

## 🤗 HuggingFace Resources (You Are Here)

### **1. Pre-trained Model Suite**

Our complete suite of models from 11M to 570M parameters trained with Pico:
- [**pico-decoder-tiny**](https://huggingface.co/pico-lm/pico-decoder-tiny) (11M parameters) 
- [**pico-decoder-small**](https://huggingface.co/pico-lm/pico-decoder-small) (65M parameters)
- [**pico-decoder-medium**](https://huggingface.co/pico-lm/pico-decoder-medium) (181M parameters)
- [**pico-decoder-large**](https://huggingface.co/pico-lm/pico-decoder-large) (570M parameters)

> 🚧 **Disclaimer** These models are still under construction. The models released in this repository have been trained for 125,000 steps (corresponding to ~250B tokens). Training will finalize after 200,000 steps.
> 
> 🚧 **Coming Soon!** **pico-decoder-xl** (1B+ parameters) Watch this space or star our [GitHub repository](https://github.com/pico-lm) for updates!


All models are on the [**pretokenized-dolma**](https://huggingface.co/datasets/pico-lm/pretokenized-dolma) dataset. They all see the same training data at each training step, use the same optimizatation process, and share the same model architecture; the only difference between models is the size of their hidden dimension.

In each model repository, we version control checkpoints every 1000 steps that contain:
  - Weights and optimizer states (HuggingFace and Lightning Fabric-compatible versions)
  - Model activations and gradients 
  - The batch of training data observed at the given training step

We visualize the learning process in our **[Wandb](https://wandb.ai/pico-lm/pico-decoder/reports/Pico-Decoder-Models---VmlldzoxMTgzNTQ4Mw)**.

Model Details:

| **Aspect**              | **Details**                                                                                                                                                                               |
|-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Architecture**        | - Llama-style transformer (decoder-only)<br>- RMSNorm normalization<br>- RoPE (Rotary Positional Embeddings)<br>- Multi-head attention with KV-cache<br>- SwiGLU activation function    |
| **Sequence Length**     | 2048                                                                                                                                                                                      |
| **Batch Size**          | 1024                                                                                                                                                                                      |
| **Optimizer**           | AdamW                                                                                                                                                                                     |
| **Learning Rate**       | 3e-4 (one-cycle warmup)                                                                                                                                                                   |
| **Gradient Clipping**   | 1.0                                                                                                                                                                                       |
| **Precision**           | Mixed precision training                                                                                                                                                                   |
| **Vocabulary Size**     | 50,280                                                                                                                                                                                    |

### **2. Datasets**
1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
   - 420B tokens of pre-processed, tokenized and shuffled text extraced from the **[DOLMA](https://allenai.org/dolma)** corpus
   - We use this dataset to train our model suite

2. **[pretokenized-dolma-tinsy](https://huggingface.co/datasets/pico-lm/pretokenized-dolma-tinsy)**
   - A smaller version of the **pretokenized-dolma** corpus for quick experiments

3. **[pretokenized-paloma](https://huggingface.co/datasets/pico-lm/pretokenized-paloma)**
   - A tokenized and shuffled version of the **[Paloma](https://allenai.org/evaluation-frameworks)** evaluation corpus
   - The Paloma corpus was carefully curated to be disjoint from the Dolma corpus and provides
   - We use this corpus to evaluate the perplexity of our models
     
4. **[pretokenized-paloma-tinsy](https://huggingface.co/datasets/pico-lm/pretokenized-paloma-tinsy)**
   - A sub-sampled version of the **pretokenized-dolma** corpus 

All datasets are tokenized using the **[OLMo Tokenizer](https://huggingface.co/allenai/OLMo-7B-0724-hf/blob/main/tokenizer_config.json)**

---

## 🔍 Citation
If you use Pico in academic or professional work, please cite it:

```bibtex
@software{pico2025,
    author = {Diehl Martinez, Richard},
    title = {Pico: A Lightweight Framework for Studying Language Model Learning Dynamics},
    year = {2025,
    url = {https://github.com/pico-lm}
}
```

**Thanks for checking out Pico!**  
Star our [GitHub repositories](https://github.com/pico-lm) or join our community discussions to stay updated. If you find a bug or have questions, open an issue—contributions are welcome!