rdiehlmartinez commited on
Commit
a547025
Β·
verified Β·
1 Parent(s): c59c115

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -18
README.md CHANGED
@@ -9,41 +9,51 @@ pinned: false
9
 
10
  # Pico: A Lightweight Framework for Studying Learning Dynamics
11
 
12
- Pico is a lightweight research framework that demystifies how language models learn. Built with simplicity in mind, it provides an efficient way to train and study models of different sizes. Visit our [website](https://www.picolm.io/) for more information.
13
 
14
  Pico consists of two key components:
15
- 1. **Pre-trained Model Suite** (hosted here on HuggingFace)
16
- 2. **Training Framework** (available on [GitHub](https://github.com/rdiehlmartinez/pico))
17
 
18
- This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the infrastructure to train your own model suites from scratch.
19
 
20
  ## πŸ€— HuggingFace Resources (You Are Here)
21
 
22
- > 🚧 **Coming Soon!** Our complete suite of pre-trained models (1M to 1B parameters) is currently being trained and will be released here in January 2025. Watch this space or star our [GitHub repository](https://github.com/rdiehlmartinez/pico) for updates!
23
-
24
- ### Pre-trained Model Suite (Releasing January 2025)
25
- Our complete suite of models from 1M to 1B parameters:
26
  - **pico-tiny** (1M parameters)
27
  - **pico-small** (10M parameters)
28
  - **pico-medium** (100M parameters)
29
  - **pico-large** (500M parameters)
30
- - **pico-xl** (1B parameters)
 
 
 
31
 
32
  Each model includes:
33
- - Complete training checkpoints
34
- - Saved activations and gradients
35
- - Pre-computed evaluation perplexity scores
 
 
 
36
 
37
  ### Available Datasets
38
  1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
39
- - 420B tokens of pre-processed text
40
- - Cleaned and shuffled DOLMA corpus
41
 
42
  2. **[pretokenized-dolma-tiny](https://huggingface.co/datasets/pico-lm/pretokenized-dolma-tiny)**
43
- - Smaller version for quick experiments
 
 
 
 
 
 
 
 
44
 
45
- 3. **[pretokenized-eval-batch](https://huggingface.co/datasets/pico-lm/pretokenized-eval-batch)**
46
- - Batch of eval data for generating model activations
47
 
48
  ## πŸ”§ GitHub Training Framework
49
 
@@ -91,7 +101,7 @@ python train.py --config configs/train.yaml
91
  ## πŸ“Š Model Details
92
 
93
  ### Architecture
94
- All models (both pre-trained and self-trained) use:
95
  - LLAMA-style transformer
96
  - RMSNorm for normalization
97
  - RoPE positional embeddings
@@ -100,11 +110,14 @@ All models (both pre-trained and self-trained) use:
100
 
101
  ### Training Configuration
102
  Standard configuration (customizable in GitHub training):
 
103
  - Batch size: 1024
104
  - Learning rate: 1e-3
105
  - Weight decay: 0.1
106
  - Gradient clipping: 1.0
107
  - Mixed precision training
 
 
108
 
109
  ## πŸ”¬ Research Applications
110
 
 
9
 
10
  # Pico: A Lightweight Framework for Studying Learning Dynamics
11
 
12
+ Pico is a lightweight research framework that aims to demystify how language models learn. Built with simplicity in mind, it provides an efficient way to train and study models of different sizes. Visit our [website](https://www.picolm.io/) for more information.
13
 
14
  Pico consists of two key components:
15
+ 1. **Pico Training Framework** (available on [GitHub](https://github.com/pico-lm/pico)): A transparent, lightweight codebase for training language models. We use this framework to train a series of language models across scale that we release on this HuggingFace space.
16
+ 1. **Pico Analysis Framework** (available on [GitHub](https://github.com/pico-lm/pico-analysis)): A resarch framework to investigate and probe the learning dynamics of models trained using Pico.
17
 
18
+ This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the code to train and analyze your own model suites from scratch.
19
 
20
  ## πŸ€— HuggingFace Resources (You Are Here)
21
 
22
+ ### Pre-trained Model Suite
23
+ Our complete suite of models from 1M to 500M parameters trained with Pico:
 
 
24
  - **pico-tiny** (1M parameters)
25
  - **pico-small** (10M parameters)
26
  - **pico-medium** (100M parameters)
27
  - **pico-large** (500M parameters)
28
+
29
+ > 🚧 **Coming Soon!** **pico-xl** (1B parameters) Watch this space or star our [GitHub repository](https://github.com/rdiehlmartinez/pico) for updates!
30
+
31
+ All models are trained for 50,000 steps on the **pretokenized-dolma** dataset. They all see the same training data at each training step, use the same optimizatation process, and share the same model architecture; the only difference between models is the size of their hidden dimension.
32
 
33
  Each model includes:
34
+ - Advanced training checkpoints (stored every 1,000 steps) that contain:
35
+ - Weights and optimizer states (HuggingFace and Lightning Fabric-compatible versions)
36
+ - Model activations and gradients
37
+ - The batch of training data observed at the given training step
38
+ - Wandb logs tracking the learning process
39
+ - Pre-computed perplexity scores on the paloma evaluation set
40
 
41
  ### Available Datasets
42
  1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
43
+ - 420B tokens of pre-processed, tokenized and shuffled text extraced from the **DOLMA**[https://allenai.org/dolma] corpus
44
+ - We use this dataset to train our model suite
45
 
46
  2. **[pretokenized-dolma-tiny](https://huggingface.co/datasets/pico-lm/pretokenized-dolma-tiny)**
47
+ - A smaller version of the **pretokenized-dolma** corpus for quick experiments
48
+
49
+ 3. **[pretokenized-paloma](https://huggingface.co/datasets/pico-lm/pretokenized-paloma)**
50
+ - A tokenized and shuffled version of the **Paloma**[vhttps://allenai.org/evaluation-frameworks] evaluation corpus
51
+ - The Paloma corpus was carefully curated to be disjoint from the Dolma corpus and provides
52
+ - We use this corpus to evaluate the perplexity of our models
53
+
54
+ 4. **[pretokenized-paloma-tinsy](https://huggingface.co/datasets/pico-lm/pretokenized-paloma-tinsy)**
55
+ - A sub-sampled version of the **pretokenized-dolma** corpus
56
 
 
 
57
 
58
  ## πŸ”§ GitHub Training Framework
59
 
 
101
  ## πŸ“Š Model Details
102
 
103
  ### Architecture
104
+ All models use:
105
  - LLAMA-style transformer
106
  - RMSNorm for normalization
107
  - RoPE positional embeddings
 
110
 
111
  ### Training Configuration
112
  Standard configuration (customizable in GitHub training):
113
+ - Sequence length: 2048
114
  - Batch size: 1024
115
  - Learning rate: 1e-3
116
  - Weight decay: 0.1
117
  - Gradient clipping: 1.0
118
  - Mixed precision training
119
+ - Vocab size: 50280 (using the **[OLMo Tokenizer]**(https://huggingface.co/allenai/OLMo-7B-0724-hf/blob/main/tokenizer_config.json))
120
+
121
 
122
  ## πŸ”¬ Research Applications
123