license: apache-2.0
datasets:
- bigcode/the-stack
- HuggingFaceFW/fineweb
Model Details
The TinyCodeLM family of tiny language models (LMs) is a collection of pretrained and instruction tuned generative code models in 150M and 400M sizes. These models are pretrained on a mixture of open-source web text and Python code. The instruction tuned TinyCodeLM models are optimized for Python code synthesis, and are trained on synthetic edit sequence data generated with the LintSeq algorithm.
Despite being trained on only 72 billion tokens of text, the models outperform many of the available open source Python code synthesis models on HumanEval and MBPP. The TinyCodeLM-LintSeqInstruct models are state-of-the-art on Python synthesis for their size.
Model Developers Ulyana Piterbarg, Lerrel Pinto, Rob Fergus (NYU)
Variations TinyCodeLM comes in two sizes (150M and 400M parameters) in pretrained and edit sequence instruction tuned variants.
Input Text only.
Output Models generate text and code. Instruction tuned models generate code via sequences of "diffs".
Model Architecture TinyCodeLMs are autoregressive language models with architectures that mimic the two smallest versions of GPT-2 (Radford et al., 2019), while integrating the transformer architecture changes of the OLMo models.
Instruction Tuning Data TinyCodeLMs are instruction tuned on paired instruction and Python edit sequence data. These edit sequences are generated with the LintSeq algorithm over a source dataset of paired instruction and Python programs drawn from the Magicoder and StarCoder2 OSS-Instruct datasets (Wei et al., 2024).
Benchmarks
Pretrained (Temperature 0)
Benchmark | TinyCodeLM 150M | TinyCodeLM 400M |
---|---|---|
HumanEval, pass@1 | 6.1 | 6.7 |
MBPP(+), pass@1 | 5.4 | 6.8 |
Edit Sequence / Instruction Tuned (Temperature-Tuned)
Benchmark | TinyCodeLM 150M | TinyCodeLM 400M |
---|---|---|
HumanEval, pass@1 | 12.8 | 13.4 |
HumanEval, pass@10 | 20.6 | 20.9 |
MBPP(+), pass@1 | 13.6 | 24.4 |
MBPP(+), pass@10 | 24.4 | 29.9 |
Citation
@misc{piterbarg2024training,
title={Training Language Models on Synthetic Edit Sequences Improves Code Synthesis},
author={Ulyana Piterbarg and Lerrel Pinto and Rob Fergus},
year={2024},
eprint={2410.02749},
archivePrefix={arXiv},
primaryClass={cs.LG}
}