Spaces:

pico-lm
/

README

Running

App Files Files Community

README / README.md

rdiehlmartinez

Updating README title

180c1de verified 10 days ago

preview code

raw

history blame contribute delete

7 kB

	---
	title: README
	emoji: 📈
	colorFrom: red
	colorTo: yellow
	sdk: static
	pinned: true
	thumbnail: >-
	https://cdn-uploads.huggingface.co/production/uploads/638a2e13f32316c0440f5337/ly9Z0_QnlHzkKQN0TvfZq.png
	license: apache-2.0
	---

	# Pico: A Modular Framework for Hypothesis-Driven Small Language Model Research

	Welcome to Pico LM 👋, a research initiative dedicated to demystifying language model learning.

	We create two complementary frameworks (pico-train and pico-analyze) for training and analyzing small to mid-scale language models (1M–1B parameters). Our mission is to provide a transparent, research-oriented workflow that illuminates how these models learn.

	For full documentation and code, visit our two main repositories:
	- [pico-train](https://github.com/pico-lm/pico-train): Minimalist training framework for language models.
	- [pico-analyze](https://github.com/pico-lm/pico-analyze): Tools for measuring and visualizing model learning dynamics across checkpoints.

	This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the code to train and analyze your own model suites from scratch.

	All code and artifacts are licensed under a permissive Apache-2.0 license.

	> Pro Tip 🚀 :
	> To learn more about these libraries and explore detailed tutorials, visit our official website [picolm.io](https://www.picolm.io) and get fully acquainted with the Pico ecosystem.

	---

	## 🤗 HuggingFace Resources (You Are Here)

	### 1. Pre-trained Model Suite

	Our complete suite of models from 11M to 570M parameters trained with Pico:
	- [pico-decoder-tiny](https://huggingface.co/pico-lm/pico-decoder-tiny) (11M parameters)
	- [pico-decoder-small](https://huggingface.co/pico-lm/pico-decoder-small) (65M parameters)
	- [pico-decoder-medium](https://huggingface.co/pico-lm/pico-decoder-medium) (181M parameters)
	- [pico-decoder-large](https://huggingface.co/pico-lm/pico-decoder-large) (570M parameters)

	> 🚧 Disclaimer These models are still under construction. The models released in this repository have been trained for 125,000 steps (corresponding to ~250B tokens). Training will finalize after 200,000 steps.
	>
	> 🚧 Coming Soon! pico-decoder-xl (1B+ parameters) Watch this space or star our [GitHub repository](https://github.com/pico-lm) for updates!


	All models are on the [pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma) dataset. They all see the same training data at each training step, use the same optimizatation process, and share the same model architecture; the only difference between models is the size of their hidden dimension.

	In each model repository, we version control checkpoints every 1000 steps that contain:
	- Weights and optimizer states (HuggingFace and Lightning Fabric-compatible versions)
	- Model activations and gradients
	- The batch of training data observed at the given training step

	We visualize the learning process in our [Wandb](https://wandb.ai/pico-lm/pico-decoder/reports/Pico-Decoder-Models---VmlldzoxMTgzNTQ4Mw).

	Model Details:

	\| Aspect \| Details \|
	\|-------------------------\|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|
	\| Architecture \| - Llama-style transformer (decoder-only)<br>- RMSNorm normalization<br>- RoPE (Rotary Positional Embeddings)<br>- Multi-head attention with KV-cache<br>- SwiGLU activation function \|
	\| Sequence Length \| 2048 \|
	\| Batch Size \| 1024 \|
	\| Optimizer \| AdamW \|
	\| Learning Rate \| 3e-4 (one-cycle warmup) \|
	\| Gradient Clipping \| 1.0 \|
	\| Precision \| Mixed precision training \|
	\| Vocabulary Size \| 50,280 \|

	### 2. Datasets
	1. [pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)
	- 420B tokens of pre-processed, tokenized and shuffled text extraced from the [DOLMA](https://allenai.org/dolma) corpus
	- We use this dataset to train our model suite

	2. [pretokenized-dolma-tinsy](https://huggingface.co/datasets/pico-lm/pretokenized-dolma-tinsy)
	- A smaller version of the pretokenized-dolma corpus for quick experiments

	3. [pretokenized-paloma](https://huggingface.co/datasets/pico-lm/pretokenized-paloma)
	- A tokenized and shuffled version of the [Paloma](https://allenai.org/evaluation-frameworks) evaluation corpus
	- The Paloma corpus was carefully curated to be disjoint from the Dolma corpus and provides
	- We use this corpus to evaluate the perplexity of our models

	4. [pretokenized-paloma-tinsy](https://huggingface.co/datasets/pico-lm/pretokenized-paloma-tinsy)
	- A sub-sampled version of the pretokenized-dolma corpus

	All datasets are tokenized using the [OLMo Tokenizer](https://huggingface.co/allenai/OLMo-7B-0724-hf/blob/main/tokenizer_config.json)

	---

	## 🔍 Citation
	If you use Pico in academic or professional work, please cite it:

	```bibtex
	@software{pico2025,
	author = {Diehl Martinez, Richard},
	title = {Pico: A Lightweight Framework for Studying Language Model Learning Dynamics},
	year = {2025,
	url = {https://github.com/pico-lm}
	}
	```

	Thanks for checking out Pico!
	Star our [GitHub repositories](https://github.com/pico-lm) or join our community discussions to stay updated. If you find a bug or have questions, open an issue—contributions are welcome!