Spaces:

pico-lm
/

README

Running

File size: 6,741 Bytes

35ed80d
 
8ec16ff
35ed80d
 
 
2e0d2c9
 
 
35ed80d
 
2e0d2c9
c59c115
2e0d2c9
e948f36
2e0d2c9
 
 
 
 
 
e948f36
a547025
e948f36
8aa4460
2e0d2c9
 
 
 
8ec16ff
e948f36
2e0d2c9
 
 
aeb5c9c
54887af
 
2e0d2c9
 
8aa4460
2e0d2c9
54887af
2e0d2c9
 
 
 
 
 
 
 
c30eec0
a547025
2e0d2c9
 
 
 
 
 
 
 
 
 
a547025
2e0d2c9
8ec16ff
ae00997
a547025
e948f36
8ec16ff
a547025
 
 
ae00997
a547025
 
 
 
 
8ec16ff
ae00997
8ec16ff
2e0d2c9
8ec16ff
 
2e0d2c9
8ec16ff
 
f7c327a
30ee8fd
cb976f4
f7c327a
2e0d2c9
8ec16ff
2e0d2c9

---
title: README
emoji: 📈
colorFrom: red
colorTo: yellow
sdk: static
pinned: true
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/638a2e13f32316c0440f5337/ly9Z0_QnlHzkKQN0TvfZq.png
---

# **Pico: A Lightweight Framework for Studying Language Model Learning Dynamics**

Welcome to the **pico-lm** organization on Hugging Face! Pico is designed to **demystify** how language models learn by:

1. **Training** a family of language models at different scales using a transparent, minimally opinionated codebase.  
2. **Analyzing** these models’ learning behaviors using checkpoints enriched with activations, gradients, and evaluation metrics.

For full documentation and code, visit our two main repositories:
- [**pico-train**](https://github.com/pico-lm/pico-train): Minimalist training framework for language models.  
- [**pico-analyze**](https://github.com/pico-lm/pico-analyze): Tools for measuring and visualizing model learning dynamics across checkpoints.

This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the code to train and analyze your own model suites from scratch.

> Pro Tip 🚀 : 
> To learn more about these libraries and explore detailed tutorials, visit our official website [**picolm.io**](https://www.picolm.io) and get fully acquainted with the Pico ecosystem.

---

## 🤗 HuggingFace Resources (You Are Here)

### **1. Pre-trained Model Suite**

Our complete suite of models from 10M to 500M parameters trained with Pico:
- [**pico-decoder-tiny**](https://huggingface.co/pico-lm/pico-decoder-tiny) (10M parameters) 
- [**pico-decoder-small**](https://huggingface.co/pico-lm/pico-decoder-small) (50M parameters)
- [**pico-decoder-medium**](https://huggingface.co/pico-lm/pico-decoder-medium) (200M parameters)
- [**pico-decoder-large**](https://huggingface.co/pico-lm/pico-decoder-large) (500M parameters)

> 🚧 **Coming Soon!** **pico-decoder-xl** (1B parameters) Watch this space or star our [GitHub repository](https://github.com/pico-lm) for updates!

All models are trained for 50,000 steps on the [**pretokenized-dolma**](https://huggingface.co/datasets/pico-lm/pretokenized-dolma) dataset (corresponding to 100B tokens). They all see the same training data at each training step, use the same optimizatation process, and share the same model architecture; the only difference between models is the size of their hidden dimension.

In each model repository, we version control checkpoints every 1000 steps that contain:
  - Weights and optimizer states (HuggingFace and Lightning Fabric-compatible versions)
  - Model activations and gradients 
  - The batch of training data observed at the given training step

We visualize the learning process in our **[Wandb](https://wandb.ai/pico-lm/pico-decoder/reports/Pico-Decoder-Models---VmlldzoxMTgzNTQ4Mw)**.

Model Details:

| **Aspect**              | **Details**                                                                                                                                                                               |
|-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Architecture**        | - Llama-style transformer (decoder-only)<br>- RMSNorm normalization<br>- RoPE (Rotary Positional Embeddings)<br>- Multi-head attention with KV-cache<br>- SwiGLU activation function    |
| **Sequence Length**     | 2048                                                                                                                                                                                      |
| **Batch Size**          | 1024                                                                                                                                                                                      |
| **Optimizer**           | AdamW                                                                                                                                                                                     |
| **Learning Rate**       | 3e-4 (one-cycle warmup)                                                                                                                                                                   |
| **Gradient Clipping**   | 1.0                                                                                                                                                                                       |
| **Precision**           | Mixed precision training                                                                                                                                                                   |
| **Vocabulary Size**     | 50,280                                                                                                                                                                                    |

### **2. Datasets**
1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
   - 420B tokens of pre-processed, tokenized and shuffled text extraced from the **[DOLMA](https://allenai.org/dolma)** corpus
   - We use this dataset to train our model suite

2. **[pretokenized-dolma-tiny](https://huggingface.co/datasets/pico-lm/pretokenized-dolma-tiny)**
   - A smaller version of the **pretokenized-dolma** corpus for quick experiments

3. **[pretokenized-paloma](https://huggingface.co/datasets/pico-lm/pretokenized-paloma)**
   - A tokenized and shuffled version of the **[Paloma](https://allenai.org/evaluation-frameworks)** evaluation corpus
   - The Paloma corpus was carefully curated to be disjoint from the Dolma corpus and provides
   - We use this corpus to evaluate the perplexity of our models
     
4. **[pretokenized-paloma-tinsy](https://huggingface.co/datasets/pico-lm/pretokenized-paloma-tinsy)**
   - A sub-sampled version of the **pretokenized-dolma** corpus 

All datasets are tokenized using the **[OLMo Tokenizer](https://huggingface.co/allenai/OLMo-7B-0724-hf/blob/main/tokenizer_config.json)**

---

## 🔍 Citation
If you use Pico in academic or professional work, please cite it:

```bibtex
@software{pico2025,
    author = {Diehl Martinez, Richard},
    title = {Pico: A Lightweight Framework for Studying Language Model Learning Dynamics},
    year = {2025,
    url = {https://github.com/pico-lm}
}
```

**Thanks for checking out Pico!**  
Star our [GitHub repositories](https://github.com/pico-lm) or join our community discussions to stay updated. If you find a bug or have questions, open an issue—contributions are welcome!