rdiehlmartinez commited on
Commit
2e0d2c9
·
verified ·
1 Parent(s): ae00997

Updating README

Browse files
Files changed (1) hide show
  1. README.md +58 -104
README.md CHANGED
@@ -4,41 +4,72 @@ emoji: 📈
4
  colorFrom: red
5
  colorTo: yellow
6
  sdk: static
7
- pinned: false
 
 
8
  ---
9
 
10
- # Pico: A Lightweight Framework for Studying Learning Dynamics
11
 
12
- Pico is a lightweight research framework that aims to demystify how language models learn. Built with simplicity in mind, it provides an efficient way to train and study models of different sizes. Visit our [website](https://www.picolm.io/) for more information.
13
 
14
- Pico consists of two key components:
15
- 1. **Pico Training Framework** (available on [GitHub](https://github.com/pico-lm/pico)): A transparent, lightweight codebase for training language models. We use this framework to train a series of language models across scale that we release on this HuggingFace space.
16
- 1. **Pico Analysis Framework** (available on [GitHub](https://github.com/pico-lm/pico-analysis)): A resarch framework to investigate and probe the learning dynamics of models trained using Pico.
 
 
 
17
 
18
  This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the code to train and analyze your own model suites from scratch.
19
 
 
 
 
 
 
20
  ## 🤗 HuggingFace Resources (You Are Here)
21
 
22
- ### Pre-trained Model Suite
23
- Our complete suite of models from 1M to 500M parameters trained with Pico:
24
- - **pico-tiny** (1M parameters)
25
- - **pico-small** (10M parameters)
26
- - **pico-medium** (100M parameters)
27
- - **pico-large** (500M parameters)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
- > 🚧 **Coming Soon!** **pico-xl** (1B parameters) Watch this space or star our [GitHub repository](https://github.com/rdiehlmartinez/pico) for updates!
 
 
 
 
 
 
 
 
 
30
 
31
- All models are trained for 50,000 steps on the **pretokenized-dolma** dataset. They all see the same training data at each training step, use the same optimizatation process, and share the same model architecture; the only difference between models is the size of their hidden dimension.
32
 
33
- Each model includes:
34
- - Advanced training checkpoints (stored every 1,000 steps) that contain:
35
  - Weights and optimizer states (HuggingFace and Lightning Fabric-compatible versions)
36
  - Model activations and gradients
37
  - The batch of training data observed at the given training step
38
- - Wandb logs tracking the learning process
39
- - Pre-computed perplexity scores on the paloma evaluation set
40
 
41
- ### Available Datasets
 
 
42
  1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
43
  - 420B tokens of pre-processed, tokenized and shuffled text extraced from the **[DOLMA](https://allenai.org/dolma)** corpus
44
  - We use this dataset to train our model suite
@@ -56,96 +87,19 @@ Each model includes:
56
 
57
  All datasets are tokenized using the **[OLMo Tokenizer](https://huggingface.co/allenai/OLMo-7B-0724-hf/blob/main/tokenizer_config.json)**
58
 
59
- ## 🔧 GitHub Training Framework
60
-
61
- Want to train your own suite of models? Visit our [GitHub repository](https://github.com/rdiehlmartinez/pico) to:
62
- - Train models with custom architectures
63
- - Experiment with different training regimes
64
- - Modify checkpoint saving behavior
65
- - Implement custom evaluation metrics
66
-
67
- The training framework makes it easy to:
68
- 1. Train multiple models of different sizes
69
- 2. Ensure consistent training across all models
70
- 3. Save rich checkpoint data for learning dynamics analysis
71
- 4. Compare learning dynamics across scales
72
-
73
- ## 🛠️ Using the Resources
74
-
75
- ### Using Pre-trained Models (HuggingFace)
76
- ```python
77
- from transformers import AutoModelForCausalLM
78
-
79
- # Load our pre-trained model
80
- model = AutoModelForCausalLM.from_pretrained("pico-lm/pico-small")
81
-
82
- # Access specific checkpoint
83
- model = AutoModelForCausalLM.from_pretrained(
84
- "pico-lm/pico-small",
85
- revision="step-xyz"
86
- )
87
- ```
88
-
89
- ### Training Your Own Suite (GitHub)
90
- ```bash
91
- # Clone the repository
92
- git clone https://github.com/rdiehlmartinez/pico.git && cd pico
93
- source setup.sh
94
-
95
- # Configure your model suite
96
- # Edit configs/train.yaml to specify model sizes and training parameters
97
-
98
- # Train your suite
99
- python train.py --config configs/train.yaml
100
- ```
101
-
102
- ## 📊 Model Details
103
-
104
- ### Architecture
105
- All models use:
106
- - LLAMA-style transformer
107
- - RMSNorm for normalization
108
- - RoPE positional embeddings
109
- - Multi-head attention with KV-cache
110
- - SwiGLU activation function
111
-
112
- ### Training Configuration
113
- Standard configuration (customizable in GitHub training):
114
- - Sequence length: 2048
115
- - Batch size: 1024
116
- - Learning rate: 1e-3
117
- - Weight decay: 0.1
118
- - Gradient clipping: 1.0
119
- - Mixed precision training
120
- - Vocab size: 50280
121
-
122
- ## 🔬 Research Applications
123
-
124
- Perfect for researchers studying:
125
- - Learning dynamics across model scales
126
- - Mechanistic interpretability
127
- - Architecture and training effects
128
- - Emergent model behaviors
129
-
130
- Whether using our pre-trained models or training your own suite, Pico provides the tools needed for in-depth learning dynamics research.
131
-
132
- ## 🤝 Contributing
133
-
134
- Contributions welcome on both platforms:
135
- - **HuggingFace**: Model weights, datasets, and evaluation results
136
- - **GitHub**: Training framework improvements, analysis tools, and documentation
137
-
138
- ## 📫 Contact
139
-
140
- - GitHub: [rdiehlmartinez/pico](https://github.com/rdiehlmartinez/pico)
141
- - Author: [Richard Diehl Martinez](https://richarddiehlmartinez.com)
142
 
143
  ## 🔍 Citation
 
144
 
145
  ```bibtex
146
  @software{pico2024,
147
  author = {Diehl Martinez, Richard},
148
- title = {Pico: Framework for Training Tiny Language Models},
149
  year = {2024},
 
150
  }
151
- ```
 
 
 
 
4
  colorFrom: red
5
  colorTo: yellow
6
  sdk: static
7
+ pinned: true
8
+ thumbnail: >-
9
+ https://cdn-uploads.huggingface.co/production/uploads/638a2e13f32316c0440f5337/ly9Z0_QnlHzkKQN0TvfZq.png
10
  ---
11
 
12
+ # **Pico: A Lightweight Framework for Studying Language Model Learning Dynamics**
13
 
14
+ Welcome to the **pico-lm** organization on Hugging Face! Pico is designed to **demystify** how language models learn by:
15
 
16
+ 1. **Training** a family of language models at different scales using a transparent, minimally opinionated codebase.
17
+ 2. **Analyzing** these models learning behaviors using checkpoints enriched with activations, gradients, and evaluation metrics.
18
+
19
+ For full documentation and code, visit our two main repositories:
20
+ - [**pico-train**](https://github.com/pico-lm/pico-train): Minimalist training framework for language models.
21
+ - [**pico-analyze**](https://github.com/pico-lm/pico-analyze): Tools for measuring and visualizing model learning dynamics across checkpoints.
22
 
23
  This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the code to train and analyze your own model suites from scratch.
24
 
25
+ > Pro Tip 🚀:
26
+ > To learn more about these libraries and explore detailed tutorials, visit our official website [**picolm.io**](https://www.picolm.io) and get fully acquainted with the Pico ecosystem.
27
+
28
+ ---
29
+
30
  ## 🤗 HuggingFace Resources (You Are Here)
31
 
32
+ ### **1. Pre-trained Model Suite**
33
+
34
+ Our complete suite of models from 10M to 500M parameters trained with Pico:
35
+ - [**pico-decoder-tiny**)](https://huggingface.co/pico-lm/pico-decoder-tiny) (1M parameters)
36
+ - [**pico-decoder-small**](https://huggingface.co/pico-lm/pico-decoder-small) (10M parameters)
37
+ - [**pico-decoder-medium**](https://huggingface.co/pico-lm/pico-decoder-medium) (100M parameters)
38
+ - [**pico-decoder-large**](https://huggingface.co/pico-lm/pico-decoder-large) (500M parameters)
39
+
40
+ > 🚧 **Coming Soon!** **pico-decoder-xl** (1B parameters) Watch this space or star our [GitHub repository](https://github.com/rdiehlmartinez/pico) for updates!
41
+
42
+ All models are trained for 50,000 steps on the [**pretokenized-dolma**](https://huggingface.co/datasets/pico-lm/pretokenized-dolma) dataset. They all see the same training data at each training step, use the same optimizatation process, and share the same model architecture; the only difference between models is the size of their hidden dimension.
43
+
44
+ In each model repository, we version control checkpoints every 1000 steps that contain:
45
+ - Weights and optimizer states (HuggingFace and Lightning Fabric-compatible versions)
46
+ - Model activations and gradients
47
+ - The batch of training data observed at the given training step
48
+
49
+ We visualize the learning process in our **[Wandb](https://wandb.ai/pico-lm/pico-decoder/reports/Pico-Decoder-Models---VmlldzoxMTgzNTQ4Mw)**.
50
+
51
+ 📊 Model Details
52
 
53
+ | **Aspect** | **Details** |
54
+ |-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
55
+ | **Architecture** | - Llama-style transformer (decoder-only)<br>- RMSNorm normalization<br>- RoPE (Rotary Positional Embeddings)<br>- Multi-head attention with KV-cache<br>- SwiGLU activation function |
56
+ | **Sequence Length** | 2048 |
57
+ | **Batch Size** | 1024 |
58
+ | **Optimizer** | AdamW |
59
+ | **Learning Rate** | 3e-4 (one-cycle warmup) |
60
+ | **Gradient Clipping** | 1.0 |
61
+ | **Precision** | Mixed precision training |
62
+ | **Vocabulary Size** | 50,280 |
63
 
 
64
 
65
+ In each model repository, we version control checkpoints every 1000 steps that contain:
 
66
  - Weights and optimizer states (HuggingFace and Lightning Fabric-compatible versions)
67
  - Model activations and gradients
68
  - The batch of training data observed at the given training step
 
 
69
 
70
+
71
+
72
+ ### **2. Datasets**
73
  1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
74
  - 420B tokens of pre-processed, tokenized and shuffled text extraced from the **[DOLMA](https://allenai.org/dolma)** corpus
75
  - We use this dataset to train our model suite
 
87
 
88
  All datasets are tokenized using the **[OLMo Tokenizer](https://huggingface.co/allenai/OLMo-7B-0724-hf/blob/main/tokenizer_config.json)**
89
 
90
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
  ## 🔍 Citation
93
+ If you use Pico in academic or professional work, please cite it:
94
 
95
  ```bibtex
96
  @software{pico2024,
97
  author = {Diehl Martinez, Richard},
98
+ title = {Pico: A Lightweight Framework for Studying Learning Dynamics in Language Models},
99
  year = {2024},
100
+ url = {https://github.com/pico-lm}
101
  }
102
+ ```
103
+
104
+ **Thanks for checking out Pico!**
105
+ Star our [GitHub repositories](https://github.com/pico-lm) or join our community discussions to stay updated. If you find a bug or have questions, open an issue—contributions are welcome!