--- license: apache-2.0 datasets: - bigcode/the-stack - HuggingFaceFW/fineweb --- # Model Details The TinyCodeLM family of tiny language models (LMs) is a collection of pretrained and instruction tuned generative code models in 150M and 400M sizes. These models are pretrained on a mixture of open-source web text and Python code. The instruction tuned TinyCodeLM models are optimized for Python code synthesis, and are trained on [synthetic edit sequence data generated with the LintSeq algorithm](https://arxiv.org/abs/2410.02749). Despite being trained on only 72 billion tokens of text, the models outperform many of the available open source Python code synthesis models on HumanEval and MBPP. The TinyCodeLM-LintSeqInstruct models are state-of-the-art on Python synthesis for their size. **Model Developers** Ulyana Piterbarg, Lerrel Pinto, Rob Fergus (NYU) **Variations** TinyCodeLM comes in two sizes (150M and 400M parameters) in pretrained and edit sequence instruction tuned variants. **Input** Text only. **Output** Models generate text and code. Instruction tuned models generate code via sequences of "diffs". **Model Architecture** TinyCodeLMs are autoregressive language models with architectures that mimic the two smallest versions of GPT-2 (Radford et al., 2019), while integrating the transformer architecture changes of the OLMo models. **Instruction Tuning Data** TinyCodeLMs are instruction tuned on paired instruction and Python edit sequence data. These edit sequences are generated with the LintSeq algorithm over a source dataset of paired instruction and Python programs drawn from the Magicoder and StarCoder2 OSS-Instruct datasets (Wei et al., 2024). # Benchmarks **Pretrained (Temperature 0)** |**Benchmark**|**TinyCodeLM 150M** |**TinyCodeLM 400M** | | :--------------------- | -----------------: | -----------------: | | HumanEval, pass@1 | 6.1 | 6.7 | | MBPP(+), pass@1 | 5.4 | 6.8 | **Edit Sequence / Instruction Tuned (Temperature-Tuned)** |**Benchmark** |**TinyCodeLM 150M** |**TinyCodeLM 400M** | | :----------- | -----------------: | -----------------: | | HumanEval, pass@1 | 12.8 | 13.4 | | HumanEval, pass@10 | 20.6 | 20.9 | | MBPP(+), pass@1 | 13.6 | 24.4 | | MBPP(+), pass@10 | 24.4 | 29.9 | # Citation ``` @misc{piterbarg2024training, title={Training Language Models on Synthetic Edit Sequences Improves Code Synthesis}, author={Ulyana Piterbarg and Lerrel Pinto and Rob Fergus}, year={2024}, eprint={2410.02749}, archivePrefix={arXiv}, primaryClass={cs.LG} } ```