ABOUT_TEXT = """# Context
We believe that there are three main expectations of a good execution-based programming benchmark:
1. The benchmark should be easy to use and efficient in evaluating the fundamental capabilities of LLMs. Repo-level and agent-centric benchmarks (e.g., SWE-bench) are not suitable for this purpose.
2. The benchmark should be practical, covering various programming scenarios. Algo-specific benchmarks (e.g., HumanEval and MBPP) are unsuitable. Domain-specific benchmarks (e.g., DS-1000) are also unsuitable for this purpose.
3. The benchmark should be challenging, where the tasks require LLMs' strong compositional reasoning capabilities and instruction-following capabilities. The benchmarks with simple tasks (e.g., ODEX) are unsuitable.
BigCodeBench is the first benchmark that meets all three expectations. It is an *__easy-to-use__* benchmark that evaluates LLMs with *__practical__* and *__challenging__* programming tasks, accompanied by an end-to-end evaluation framework [`bigcodebench`](https://github.com/bigcode-project/bigcodebench). We aim to assess how well LLMs can solve programming tasks in an open-ended setting, with the following two focuses:
- Diverse Function Calls: This design requires LLMs to utilize diverse function calls.
- Complex Instructions: This design requires LLMs to follow complex instructions.
### Benchamrks & Prompts
The dataset has 2 variants:
1. `BigCodeBench-Complete`: _Code Completion based on the structured docstrings_.
1. `BigCodeBench-Instruct`: _Code Generation based on the NL-oriented instructions_.
Figure below shows the example of `Complete` vs `Instruct` prompt. For `Instruct`, we only focus on instruction-tuned LLMs.
The specific prompt template can be found [here](https://github.com/bigcode-project/bigcodebench/blob/main/bigcodebench/model.py).
There are some edge cases:
- Due to the training flaws in StarCoder2 and Granite-Code, we additionally strip the trailing newlines for model inference.
- We have not included the `Instruct` results of Granite-Code-Instruct 8B & 3B as they constantly have empty outputs.
### Evaluation Parameters
- All models were evaluated with the [bigcodebench](https://github.com/bigcode-project/bigcodebench). You can install the [PyPI package](https://pypi.org/project/bigcodebench/).
To get started, please first set up the environment:
```bash
# Install to use bigcodebench.evaluate
pip install bigcodebench --upgrade
# If you want to use the evaluate locally, you need to install the requirements
pip install -I -r https://raw.githubusercontent.com/bigcode-project/bigcodebench/main/Requirements/requirements-eval.txt
# Install to use bigcodebench.generate
# You are strongly recommended to install the generate dependencies in a separate environment
pip install bigcodebench[generate] --upgrade
```
### Scoring and Rankings
- Models are ranked according to Pass@1 using greedy decoding. Setup details can be found here.
- The code to compute Elo rating is [here](https://github.com/bigcode-project/bigcodebench/blob/main/analysis/get_results.py), which is based on [Chatbot Arena Notebook](https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR#scrollTo=JdiJbB6pZB1B&line=2&uniqifier=1). We only compute the Elo rating for the `BigCodeBench-Complete` variant.
### Contact
If you have any questions, feel free to reach out to us at [terry.zhuo@monash.edu](mailto:terry.zhuo@monash.edu) or [contact@bigcode-project.org](mailto:contact@bigcode-project.org)
### Citation Information
```bibtex
@article{bigcodebench,
title={BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions},
author={Zhuo, Terry Yue and Vu, Min Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and Brunner, Simon and Gong, Chen and Hoang, Thong and Zebaze, Armel Randy and Hong, Xiaoheng and Li, Wen-Ding and Kaddour, Jean and Xu, Ming and Zhang, Zhihan and Yadav, Prateek and Jain, Naman and Gu, Alex and Cheng, Zhoujun and Liu, Jiawei and Liu, Qian and Wang, Zijian and Lo, David and Hui, Binyuan and Muennighoff, Niklas and Fried, Daniel and Du, Xiaoning and de Vries, Harm and Von Werra, Leandro},
year={2024}
}
```
"""
SUBMISSION_TEXT = """