Spaces:

pico-lm
/

README

Running

File size: 7,005 Bytes

35ed80d
 
8ec16ff
35ed80d
 
 
2e0d2c9
 
 
fd21fad
35ed80d
 
2e0d2c9
c59c115
2e0d2c9
e948f36
2e0d2c9
 
 
 
 
 
e948f36
a547025
e948f36
fd21fad
 
8aa4460
2e0d2c9
 
 
 
8ec16ff
e948f36
2e0d2c9
 
267696c
066e23e
 
 
 
2e0d2c9
bc22fbb
066e23e
 
2e0d2c9
066e23e
 
2e0d2c9
 
 
 
 
 
 
 
c30eec0
a547025
2e0d2c9
 
 
 
 
 
 
 
 
 
a547025
2e0d2c9
8ec16ff
ae00997
a547025
e948f36
60f09e9
a547025
 
 
ae00997
a547025
 
 
 
 
8ec16ff
ae00997
8ec16ff
2e0d2c9
8ec16ff
 
2e0d2c9
8ec16ff
 
f7c327a
30ee8fd
cb976f4
f7c327a
2e0d2c9
8ec16ff
2e0d2c9

---
title: README
emoji: 📈
colorFrom: red
colorTo: yellow
sdk: static
pinned: true
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/638a2e13f32316c0440f5337/ly9Z0_QnlHzkKQN0TvfZq.png
license: apache-2.0
---

# **Pico: A Lightweight Framework for Studying Language Model Learning Dynamics**

Welcome to the **pico-lm** organization on Hugging Face! Pico is designed to **demystify** how language models learn by:

1. **Training** a family of language models at different scales using a transparent, minimally opinionated codebase.  
2. **Analyzing** these models’ learning behaviors using checkpoints enriched with activations, gradients, and evaluation metrics.

For full documentation and code, visit our two main repositories:
- [**pico-train**](https://github.com/pico-lm/pico-train): Minimalist training framework for language models.  
- [**pico-analyze**](https://github.com/pico-lm/pico-analyze): Tools for measuring and visualizing model learning dynamics across checkpoints.

This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the code to train and analyze your own model suites from scratch.

All code and artifacts are licensed under a permissive Apache-2.0 license.

> Pro Tip 🚀 : 
> To learn more about these libraries and explore detailed tutorials, visit our official website [**picolm.io**](https://www.picolm.io) and get fully acquainted with the Pico ecosystem.

---

## 🤗 HuggingFace Resources (You Are Here)

### **1. Pre-trained Model Suite**

Our complete suite of models from 11M to 570M parameters trained with Pico:
- [**pico-decoder-tiny**](https://huggingface.co/pico-lm/pico-decoder-tiny) (11M parameters) 
- [**pico-decoder-small**](https://huggingface.co/pico-lm/pico-decoder-small) (65M parameters)
- [**pico-decoder-medium**](https://huggingface.co/pico-lm/pico-decoder-medium) (181M parameters)
- [**pico-decoder-large**](https://huggingface.co/pico-lm/pico-decoder-large) (570M parameters)

> 🚧 **Disclaimer** These models are still under construction. The models released in this repository have been trained for 125,000 steps (corresponding to ~250B tokens). Training will finalize after 200,000 steps.
> 
> 🚧 **Coming Soon!** **pico-decoder-xl** (1B+ parameters) Watch this space or star our [GitHub repository](https://github.com/pico-lm) for updates!


All models are on the [**pretokenized-dolma**](https://huggingface.co/datasets/pico-lm/pretokenized-dolma) dataset. They all see the same training data at each training step, use the same optimizatation process, and share the same model architecture; the only difference between models is the size of their hidden dimension.

In each model repository, we version control checkpoints every 1000 steps that contain:
  - Weights and optimizer states (HuggingFace and Lightning Fabric-compatible versions)
  - Model activations and gradients 
  - The batch of training data observed at the given training step

We visualize the learning process in our **[Wandb](https://wandb.ai/pico-lm/pico-decoder/reports/Pico-Decoder-Models---VmlldzoxMTgzNTQ4Mw)**.

Model Details:

| **Aspect**              | **Details**                                                                                                                                                                               |
|-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Architecture**        | - Llama-style transformer (decoder-only)<br>- RMSNorm normalization<br>- RoPE (Rotary Positional Embeddings)<br>- Multi-head attention with KV-cache<br>- SwiGLU activation function    |
| **Sequence Length**     | 2048                                                                                                                                                                                      |
| **Batch Size**          | 1024                                                                                                                                                                                      |
| **Optimizer**           | AdamW                                                                                                                                                                                     |
| **Learning Rate**       | 3e-4 (one-cycle warmup)                                                                                                                                                                   |
| **Gradient Clipping**   | 1.0                                                                                                                                                                                       |
| **Precision**           | Mixed precision training                                                                                                                                                                   |
| **Vocabulary Size**     | 50,280                                                                                                                                                                                    |

### **2. Datasets**
1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
   - 420B tokens of pre-processed, tokenized and shuffled text extraced from the **[DOLMA](https://allenai.org/dolma)** corpus
   - We use this dataset to train our model suite

2. **[pretokenized-dolma-tinsy](https://huggingface.co/datasets/pico-lm/pretokenized-dolma-tinsy)**
   - A smaller version of the **pretokenized-dolma** corpus for quick experiments

3. **[pretokenized-paloma](https://huggingface.co/datasets/pico-lm/pretokenized-paloma)**
   - A tokenized and shuffled version of the **[Paloma](https://allenai.org/evaluation-frameworks)** evaluation corpus
   - The Paloma corpus was carefully curated to be disjoint from the Dolma corpus and provides
   - We use this corpus to evaluate the perplexity of our models
     
4. **[pretokenized-paloma-tinsy](https://huggingface.co/datasets/pico-lm/pretokenized-paloma-tinsy)**
   - A sub-sampled version of the **pretokenized-dolma** corpus 

All datasets are tokenized using the **[OLMo Tokenizer](https://huggingface.co/allenai/OLMo-7B-0724-hf/blob/main/tokenizer_config.json)**

---

## 🔍 Citation
If you use Pico in academic or professional work, please cite it:

```bibtex
@software{pico2025,
    author = {Diehl Martinez, Richard},
    title = {Pico: A Lightweight Framework for Studying Language Model Learning Dynamics},
    year = {2025,
    url = {https://github.com/pico-lm}
}
```

**Thanks for checking out Pico!**  
Star our [GitHub repositories](https://github.com/pico-lm) or join our community discussions to stay updated. If you find a bug or have questions, open an issue—contributions are welcome!