Updating README
Browse files
README.md
CHANGED
@@ -4,41 +4,72 @@ emoji: 📈
|
|
4 |
colorFrom: red
|
5 |
colorTo: yellow
|
6 |
sdk: static
|
7 |
-
pinned:
|
|
|
|
|
8 |
---
|
9 |
|
10 |
-
# Pico: A Lightweight Framework for Studying Learning Dynamics
|
11 |
|
12 |
-
|
13 |
|
14 |
-
|
15 |
-
|
16 |
-
|
|
|
|
|
|
|
17 |
|
18 |
This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the code to train and analyze your own model suites from scratch.
|
19 |
|
|
|
|
|
|
|
|
|
|
|
20 |
## 🤗 HuggingFace Resources (You Are Here)
|
21 |
|
22 |
-
### Pre-trained Model Suite
|
23 |
-
|
24 |
-
|
25 |
-
- **pico-
|
26 |
-
- **pico-
|
27 |
-
- **pico-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
|
29 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
|
31 |
-
All models are trained for 50,000 steps on the **pretokenized-dolma** dataset. They all see the same training data at each training step, use the same optimizatation process, and share the same model architecture; the only difference between models is the size of their hidden dimension.
|
32 |
|
33 |
-
|
34 |
-
- Advanced training checkpoints (stored every 1,000 steps) that contain:
|
35 |
- Weights and optimizer states (HuggingFace and Lightning Fabric-compatible versions)
|
36 |
- Model activations and gradients
|
37 |
- The batch of training data observed at the given training step
|
38 |
-
- Wandb logs tracking the learning process
|
39 |
-
- Pre-computed perplexity scores on the paloma evaluation set
|
40 |
|
41 |
-
|
|
|
|
|
42 |
1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
|
43 |
- 420B tokens of pre-processed, tokenized and shuffled text extraced from the **[DOLMA](https://allenai.org/dolma)** corpus
|
44 |
- We use this dataset to train our model suite
|
@@ -56,96 +87,19 @@ Each model includes:
|
|
56 |
|
57 |
All datasets are tokenized using the **[OLMo Tokenizer](https://huggingface.co/allenai/OLMo-7B-0724-hf/blob/main/tokenizer_config.json)**
|
58 |
|
59 |
-
|
60 |
-
|
61 |
-
Want to train your own suite of models? Visit our [GitHub repository](https://github.com/rdiehlmartinez/pico) to:
|
62 |
-
- Train models with custom architectures
|
63 |
-
- Experiment with different training regimes
|
64 |
-
- Modify checkpoint saving behavior
|
65 |
-
- Implement custom evaluation metrics
|
66 |
-
|
67 |
-
The training framework makes it easy to:
|
68 |
-
1. Train multiple models of different sizes
|
69 |
-
2. Ensure consistent training across all models
|
70 |
-
3. Save rich checkpoint data for learning dynamics analysis
|
71 |
-
4. Compare learning dynamics across scales
|
72 |
-
|
73 |
-
## 🛠️ Using the Resources
|
74 |
-
|
75 |
-
### Using Pre-trained Models (HuggingFace)
|
76 |
-
```python
|
77 |
-
from transformers import AutoModelForCausalLM
|
78 |
-
|
79 |
-
# Load our pre-trained model
|
80 |
-
model = AutoModelForCausalLM.from_pretrained("pico-lm/pico-small")
|
81 |
-
|
82 |
-
# Access specific checkpoint
|
83 |
-
model = AutoModelForCausalLM.from_pretrained(
|
84 |
-
"pico-lm/pico-small",
|
85 |
-
revision="step-xyz"
|
86 |
-
)
|
87 |
-
```
|
88 |
-
|
89 |
-
### Training Your Own Suite (GitHub)
|
90 |
-
```bash
|
91 |
-
# Clone the repository
|
92 |
-
git clone https://github.com/rdiehlmartinez/pico.git && cd pico
|
93 |
-
source setup.sh
|
94 |
-
|
95 |
-
# Configure your model suite
|
96 |
-
# Edit configs/train.yaml to specify model sizes and training parameters
|
97 |
-
|
98 |
-
# Train your suite
|
99 |
-
python train.py --config configs/train.yaml
|
100 |
-
```
|
101 |
-
|
102 |
-
## 📊 Model Details
|
103 |
-
|
104 |
-
### Architecture
|
105 |
-
All models use:
|
106 |
-
- LLAMA-style transformer
|
107 |
-
- RMSNorm for normalization
|
108 |
-
- RoPE positional embeddings
|
109 |
-
- Multi-head attention with KV-cache
|
110 |
-
- SwiGLU activation function
|
111 |
-
|
112 |
-
### Training Configuration
|
113 |
-
Standard configuration (customizable in GitHub training):
|
114 |
-
- Sequence length: 2048
|
115 |
-
- Batch size: 1024
|
116 |
-
- Learning rate: 1e-3
|
117 |
-
- Weight decay: 0.1
|
118 |
-
- Gradient clipping: 1.0
|
119 |
-
- Mixed precision training
|
120 |
-
- Vocab size: 50280
|
121 |
-
|
122 |
-
## 🔬 Research Applications
|
123 |
-
|
124 |
-
Perfect for researchers studying:
|
125 |
-
- Learning dynamics across model scales
|
126 |
-
- Mechanistic interpretability
|
127 |
-
- Architecture and training effects
|
128 |
-
- Emergent model behaviors
|
129 |
-
|
130 |
-
Whether using our pre-trained models or training your own suite, Pico provides the tools needed for in-depth learning dynamics research.
|
131 |
-
|
132 |
-
## 🤝 Contributing
|
133 |
-
|
134 |
-
Contributions welcome on both platforms:
|
135 |
-
- **HuggingFace**: Model weights, datasets, and evaluation results
|
136 |
-
- **GitHub**: Training framework improvements, analysis tools, and documentation
|
137 |
-
|
138 |
-
## 📫 Contact
|
139 |
-
|
140 |
-
- GitHub: [rdiehlmartinez/pico](https://github.com/rdiehlmartinez/pico)
|
141 |
-
- Author: [Richard Diehl Martinez](https://richarddiehlmartinez.com)
|
142 |
|
143 |
## 🔍 Citation
|
|
|
144 |
|
145 |
```bibtex
|
146 |
@software{pico2024,
|
147 |
author = {Diehl Martinez, Richard},
|
148 |
-
title = {Pico: Framework for
|
149 |
year = {2024},
|
|
|
150 |
}
|
151 |
-
```
|
|
|
|
|
|
|
|
4 |
colorFrom: red
|
5 |
colorTo: yellow
|
6 |
sdk: static
|
7 |
+
pinned: true
|
8 |
+
thumbnail: >-
|
9 |
+
https://cdn-uploads.huggingface.co/production/uploads/638a2e13f32316c0440f5337/ly9Z0_QnlHzkKQN0TvfZq.png
|
10 |
---
|
11 |
|
12 |
+
# **Pico: A Lightweight Framework for Studying Language Model Learning Dynamics**
|
13 |
|
14 |
+
Welcome to the **pico-lm** organization on Hugging Face! Pico is designed to **demystify** how language models learn by:
|
15 |
|
16 |
+
1. **Training** a family of language models at different scales using a transparent, minimally opinionated codebase.
|
17 |
+
2. **Analyzing** these models’ learning behaviors using checkpoints enriched with activations, gradients, and evaluation metrics.
|
18 |
+
|
19 |
+
For full documentation and code, visit our two main repositories:
|
20 |
+
- [**pico-train**](https://github.com/pico-lm/pico-train): Minimalist training framework for language models.
|
21 |
+
- [**pico-analyze**](https://github.com/pico-lm/pico-analyze): Tools for measuring and visualizing model learning dynamics across checkpoints.
|
22 |
|
23 |
This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the code to train and analyze your own model suites from scratch.
|
24 |
|
25 |
+
> Pro Tip 🚀:
|
26 |
+
> To learn more about these libraries and explore detailed tutorials, visit our official website [**picolm.io**](https://www.picolm.io) and get fully acquainted with the Pico ecosystem.
|
27 |
+
|
28 |
+
---
|
29 |
+
|
30 |
## 🤗 HuggingFace Resources (You Are Here)
|
31 |
|
32 |
+
### **1. Pre-trained Model Suite**
|
33 |
+
|
34 |
+
Our complete suite of models from 10M to 500M parameters trained with Pico:
|
35 |
+
- [**pico-decoder-tiny**)](https://huggingface.co/pico-lm/pico-decoder-tiny) (1M parameters)
|
36 |
+
- [**pico-decoder-small**](https://huggingface.co/pico-lm/pico-decoder-small) (10M parameters)
|
37 |
+
- [**pico-decoder-medium**](https://huggingface.co/pico-lm/pico-decoder-medium) (100M parameters)
|
38 |
+
- [**pico-decoder-large**](https://huggingface.co/pico-lm/pico-decoder-large) (500M parameters)
|
39 |
+
|
40 |
+
> 🚧 **Coming Soon!** **pico-decoder-xl** (1B parameters) Watch this space or star our [GitHub repository](https://github.com/rdiehlmartinez/pico) for updates!
|
41 |
+
|
42 |
+
All models are trained for 50,000 steps on the [**pretokenized-dolma**](https://huggingface.co/datasets/pico-lm/pretokenized-dolma) dataset. They all see the same training data at each training step, use the same optimizatation process, and share the same model architecture; the only difference between models is the size of their hidden dimension.
|
43 |
+
|
44 |
+
In each model repository, we version control checkpoints every 1000 steps that contain:
|
45 |
+
- Weights and optimizer states (HuggingFace and Lightning Fabric-compatible versions)
|
46 |
+
- Model activations and gradients
|
47 |
+
- The batch of training data observed at the given training step
|
48 |
+
|
49 |
+
We visualize the learning process in our **[Wandb](https://wandb.ai/pico-lm/pico-decoder/reports/Pico-Decoder-Models---VmlldzoxMTgzNTQ4Mw)**.
|
50 |
+
|
51 |
+
📊 Model Details
|
52 |
|
53 |
+
| **Aspect** | **Details** |
|
54 |
+
|-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
55 |
+
| **Architecture** | - Llama-style transformer (decoder-only)<br>- RMSNorm normalization<br>- RoPE (Rotary Positional Embeddings)<br>- Multi-head attention with KV-cache<br>- SwiGLU activation function |
|
56 |
+
| **Sequence Length** | 2048 |
|
57 |
+
| **Batch Size** | 1024 |
|
58 |
+
| **Optimizer** | AdamW |
|
59 |
+
| **Learning Rate** | 3e-4 (one-cycle warmup) |
|
60 |
+
| **Gradient Clipping** | 1.0 |
|
61 |
+
| **Precision** | Mixed precision training |
|
62 |
+
| **Vocabulary Size** | 50,280 |
|
63 |
|
|
|
64 |
|
65 |
+
In each model repository, we version control checkpoints every 1000 steps that contain:
|
|
|
66 |
- Weights and optimizer states (HuggingFace and Lightning Fabric-compatible versions)
|
67 |
- Model activations and gradients
|
68 |
- The batch of training data observed at the given training step
|
|
|
|
|
69 |
|
70 |
+
|
71 |
+
|
72 |
+
### **2. Datasets**
|
73 |
1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
|
74 |
- 420B tokens of pre-processed, tokenized and shuffled text extraced from the **[DOLMA](https://allenai.org/dolma)** corpus
|
75 |
- We use this dataset to train our model suite
|
|
|
87 |
|
88 |
All datasets are tokenized using the **[OLMo Tokenizer](https://huggingface.co/allenai/OLMo-7B-0724-hf/blob/main/tokenizer_config.json)**
|
89 |
|
90 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
91 |
|
92 |
## 🔍 Citation
|
93 |
+
If you use Pico in academic or professional work, please cite it:
|
94 |
|
95 |
```bibtex
|
96 |
@software{pico2024,
|
97 |
author = {Diehl Martinez, Richard},
|
98 |
+
title = {Pico: A Lightweight Framework for Studying Learning Dynamics in Language Models},
|
99 |
year = {2024},
|
100 |
+
url = {https://github.com/pico-lm}
|
101 |
}
|
102 |
+
```
|
103 |
+
|
104 |
+
**Thanks for checking out Pico!**
|
105 |
+
Star our [GitHub repositories](https://github.com/pico-lm) or join our community discussions to stay updated. If you find a bug or have questions, open an issue—contributions are welcome!
|