|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- bigcode/the-stack |
|
- HuggingFaceFW/fineweb |
|
base_model: |
|
- upiter/TinyCodeLM-400M |
|
library_name: transformers |
|
--- |
|
|
|
|
|
|
|
# Model Details |
|
|
|
The TinyCodeLM family of tiny language models (LMs) is a collection of fully open-source pretrained and instruction tuned generative code models in 150M and 400M sizes. These models are pretrained on a mixture of open-source web text and Python code. The instruction tuned TinyCodeLM models are optimized for Python code synthesis, and are trained on [synthetic edit sequence data generated with the LintSeq algorithm](https://lintseq.github.io/). |
|
|
|
Despite being trained on only 72 billion tokens of text, the models outperform many of the available open source Python code synthesis models on HumanEval and MBPP. The TinyCodeLM-LintSeqInstruct models are state-of-the-art on Python synthesis for their size. |
|
|
|
**Model Developers** Ulyana Piterbarg, Lerrel Pinto, Rob Fergus (NYU) |
|
|
|
**Variations** TinyCodeLM comes in two sizes (150M and 400M parameters) in pretrained and edit sequence instruction tuned variants. |
|
|
|
**Input** Text only. |
|
|
|
**Output** Models generate text and code. Instruction tuned models generate code via sequences of "diffs". |
|
|
|
**Model Architecture** TinyCodeLMs are autoregressive language models with architectures that mimic the two smallest versions of GPT-2 (Radford et al., 2019), while integrating the transformer architecture changes of the OLMo models. |
|
|
|
**Instruction Tuning Data** TinyCodeLMs are instruction tuned on paired instruction and Python edit sequence data. These edit sequences are generated with the LintSeq algorithm over a source dataset of paired instruction and Python programs drawn from the Magicoder and StarCoder2 OSS-Instruct datasets (Wei et al., 2024). |
|
|
|
# Training Details |
|
TinyCodeLM models were pretrained from scratch on a single H100 node (four GPUs) for two epochs. Pretraining took about two days and six days, respectively. Instruction tuning was conducted on a single H100 GPU using DeepSpeed and took no more than several hours. |
|
|
|
# Benchmarks |
|
|
|
**Pretrained (Temperature 0)** |
|
|**Benchmark**|**TinyCodeLM 150M** |**TinyCodeLM 400M** | |
|
| :--------------------- | -----------------: | -----------------: | |
|
| HumanEval, pass@1 | 6.1 | 6.7 | |
|
| MBPP(+), pass@1 | 5.4 | 6.8 | |
|
|
|
|
|
**Edit Sequence / Instruction Tuned (Temperature-Tuned)** |
|
|**Benchmark** |**TinyCodeLM 150M** |**TinyCodeLM 400M** | |
|
| :----------- | -----------------: | -----------------: | |
|
| HumanEval, pass@1 | 12.8 | 13.4 | |
|
| HumanEval, pass@10 | 20.6 | 20.9 | |
|
| MBPP(+), pass@1 | 13.6 | 19.4 | |
|
| MBPP(+), pass@10 | 24.4 | 29.9 | |
|
|
|
|
|
# Citation |
|
|
|
``` |
|
@misc{piterbarg2024editseq, |
|
title={Training Language Models on Synthetic Edit Sequences Improves Code Synthesis}, |
|
author={Ulyana Piterbarg and Lerrel Pinto and Rob Fergus}, |
|
year={2024}, |
|
eprint={2410.02749}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.LG} |
|
} |
|
``` |
|
|
|
# Safety |
|
This work explores data-driven mechanisms for improving the quality of language model-generated code. Our synthetic data generation method relies on open-source data and our experiments leverage open-source software and resources. It is important to acknowledge that all language models for code synthesis have the potential to be misused – whether intentionally or unintentionally – for generation of code with vulnerabilities and/or malicious behaviors. Any and all model generated code has the potential to be harmful and must not be executed without precautions. |