File size: 12,608 Bytes
cf11127 8bcd76f cf11127 795f7e4 cf11127 795f7e4 cf11127 795f7e4 cf11127 d5077d1 cf11127 a83107f cf11127 7a44830 6330b9a cf11127 203a9ec cf11127 7e01180 795f7e4 cf11127 03cc096 7ea3609 cf11127 a2413f8 cf11127 7ea3609 cf11127 03cc096 cf11127 a2413f8 cf11127 795f7e4 cf11127 7d88398 795f7e4 53d6ca7 795f7e4 53d6ca7 795f7e4 c492cd8 795f7e4 663ad89 795f7e4 ef2563c 795f7e4 ad42ba2 795f7e4 cf11127 795f7e4 cf11127 795f7e4 cf11127 795f7e4 cf11127 d5077d1 24f40c4 d5077d1 889f4bb d5077d1 cf11127 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 |
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Memorization or Generation of Big Code Model Leaderboard</title>
<link rel="stylesheet" href="style.css">
<script src="echarts.min.js"></script>
</head>
<body>
<section class="section_title">
<h1>
β <span style="color: rgb(223, 194, 25);">Memorization</span> or
<span style="color: rgb(223, 194, 25);">Generation</span>
of Big
<span style="color: rgb(223, 194, 25);">Code</span>
Models
<span style="color: rgb(223, 194, 25);">Leaderboard</span>
</h1>
<div class="section_title__imgs">
<a href="https://github.com/YihongDong/CDD-TED4LLMs" id="a_github" target="_blank">
<img src="https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white">
</a>
<a href="https://arxiv.org/abs/2402.15938" id="a_arxiv" target="_blank">
<img src="https://img.shields.io/badge/PAPER-ACL'24-ad64d4.svg?style=for-the-badge">
</a>
</div>
<div class="section_title__p">
<p>
Inspired by the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard" target="_blank">π€ Open LLM Leaderboard</a> and
<a href="https://huggingface.co/spaces/optimum/llm-perf-leaderboard" target="_blank">π€ Open LLM-Perf Leaderboard ποΈ</a>,
we compare the performance of base code generation models on the
<a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a> and
<a href="https://github.com/YihongDong/CodeGenEvaluation" target="_blank">HumanEval-ET</a> benchmarks.
We also measure the Memorization-Generalization Index and provide the results for the models.
We compare both open-source and closed-source pre-trained code LLMs that can serve as base models for further training.
</p>
</div>
</section>
<section class="section_button">
<button id="btn_evalTable">π Evalution Table</button>
<button id="btn_plot">π Performance Plot</button>
<button id="btn_about">π About</button>
<button id="btn_submit">π Submit results</button>
<button id="btn_more">π€ More Leaderboards</button>
</section>
<section class="section_evalTable" id="sec_evalTable">
<div class="section_evalTable__table">
<table id="evalTable">
<colgroup>
<col style="width: 25%">
<col style="width: 15%">
<col style="width: 15%">
<col style="width: 15%">
<col style="width: 15%">
<col style="width: 15%">
</colgroup>
<thead>
<!-- <th rowspan="2">Benchmark</th> -->
<th rowspan="2" id="th_model">Model
<button class="button_sort" data-direction="desc" data-type="name"></button>
</th>
<th data-direction="desc" rowspan="2" data-type="MGI">MGI
<button class="button_sort" data-direction="desc" data-type="MGI"></button>
</th>
<th colspan="2">Pass@1(temp=0)</th>
<th colspan="2">Pass@1(temp=0.8)</th>
<tr>
<th>HumanEval
<button class="button_sort" data-direction="desc" data-type="temp0_HumanEval"></button>
</th>
<th>HumanEval-ET
<button class="button_sort" data-direction="desc" data-type="temp0_HumanEval_ET"></button>
</th>
<th>HumanEval
<button class="button_sort" data-direction="desc" data-type="temp0_8_HumanEval"></button>
</th>
<th>HumanEval-ET
<button class="button_sort" data-direction="desc" data-type="temp0_8_HumanEval_ET"></button>
</th>
</tr>
</thead>
<tbody>
</tbody>
</table>
</table>
<script src="table.js"></script>
</div>
<div class="section_evalTable__notes">
<p><strong>Notes</strong>
<p>
<ul>
<li>MGI stands for Memorization-Generalization Index, which is derived from Avg. Peak in the original paper. A higher MGI value indicates a greater propensity for a model to engage in memorization as opposed to generalization.</li>
<li>For more details check the π About section.</li>
</ul>
</div>
</section>
<section class="section_plot" id="sec_plot">
<div style="display: flex;">
<div class="section_plot__div" id="sec_plot__div1">
<div class="section_plot__btnGroup" id="sec_plot__btnGroup1">
<button id="btn_temp0_HumanEval"></button>
<span id="span_temp0_HumanEval">Pass@1 (temp = 0)</span>
<button id="btn_temp0_8_HumanEval"></button>
<span id="span_temp0_8_HumanEval">Pass@1 (temp = 0.8)</span>
</div>
<div id="sec_plot__chart1" style="width:716.5px; height:550px;"></div>
</div>
<div class="section_plot__div" id="sec_plot__div2">
<div class="section_plot__btnGroup" id="sec_plot__btnGroup2">
<button id="btn_temp0_HumanEval_ET"></button>
<span id="span_temp0_HumanEval_ET">Pass@1 (temp = 0)</span>
<button id="btn_temp0_8_HumanEval_ET"></button>
<span id="span_temp0_8_HumanEval_ET">Pass@1 (temp = 0.8)</span>
</div>
<div id="sec_plot__chart2" style="width:716.5px; height:550px;"></div>
</div>
</div>
<script src="chart.js"></script>
</section>
<section class="section_about" id="sec_about">
<h3>Benchmarking and Prompts</h3>
<!-- <p>The growing number of code models released by the community necessitates a comprehensive evaluation to
reliably benchmark their capabilities.
Similar to the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard" target="_blank">π€ Open LLM Leaderboard</a>,
we selected two common benchmarks for evaluating Code LLMs on multiple programming languages:</p> -->
<ul>
<li><a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a>: Used to measure the functional correctness of programs generated from docstrings. It includes 164 Python programming problems.
</li>
<li><a href="https://github.com/YihongDong/CodeGenEvaluation" target="_blank">HumanEval-ET</a>: The extended version of HumanEval benchmark, where each task includes more than 100 test cases.
</li>
</ul>
<p>
For all models (except for the Starcoder family), we used the original benchmark prompts from HumanEval and added a `<bos>` token before the provided prompt.
The maximum generation length was set to the length of the original prompt plus 300 tokens.
</p>
<p>
For the Starcoder family models (such as <a href="https://huggingface.co/bigcode/starcoder2-7b" target="_blank">Starcoder2-7b</a> and <a href="https://huggingface.co/bigcode/starcoder2-7b" target="_blank">Starcoder2-15b</a>),
we used the official bigcode-evaluation-harness for generation.
More details can be found <a href="https://github.com/bigcode-project/bigcode-evaluation-harness/" target="_blank">here</a>.
</p>
<h3>Evaluation Parameters</h3>
<p>
For all models, we sampled 1 and 50 samples under temperatures of 0 and 0.8, respectively,
for the subsequent result calculations. The parameters are set as follows:
</p>
<ul>
<li>top-p=1.0 (default parameter in the transformers library)</li>
<li>top-k=50 (default parameter in the transformers library)</li>
<li>max_length_generation=len(prompt)+300</li>
<li>temperature=0 or temperature=0.8</li>
<li>n_samples=50</li>
</ul>
<h3>Performance Metrics</h3>
<ul>
<li>pass@k: Represents the probability that the model successfully solves the test problem at least once out of `k` attempts.</li>
<li>MGI: The average peakedness of the edit distance distribution constructed by the mode samples.</li>
</ul>
</section>
<section class="section_submit" id="sec_submit">
<h2>How to submit models/results to the leaderboard?</h2>
<div>
<p>We welcome the community to submit evaluation results of new models.
These results will be added as non-verified, the authors are however required to upload their generations in case other members want to check.
</p>
<p>
To submit your results create a <span style="font-weight: bold;">Pull Request</span> in the community tab to add them under the
<a href="[https://github.com/YihongDong/CDD-TED4LLMs]" target="_blank">folder</a> <span class="span_">community_results</span> in the repository:
</p>
<ul>
<li>Create a folder called <span class="span_">ORG_MODELNAME_USERNAME</span> for example <span class="span_">meta_CodeLlama_xxx</span>.</li>
<li>Put the generation outputs of your modle in it.</li>
</ul>
<p>The title of the PR should be <span class="span_">[Community Submission] Model: org/model, Username: your_username</span>, replace org and model with those corresponding to the model you evaluated.</p>
</div>
</section>
<section class="section_more" id="sec_more">
<h2>Context</h2>
<p>In addition to Memorization or Generation of Big Code Models Leaderboard, it is recommended to comprehensively
understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as:
</p>
<ul>
<li><a href="https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard" target="_blank">Big Code Models Leaderboard</a></li>
<li><a href="https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard" target="_blank">Chatbot Arena Leaderboard</a></li>
<li><a href="https://fudanselab-classeval.github.io/leaderboard.html" target="_blank">ClassEval</a></li>
<li><a href="https://bigcode-bench.github.io" target="_blank">Code Lingua</a></li>
<li><a href="https://github.com/amazon-science/cceval" target="_blank">CrossCodeEval</a></li>
<li><a href="https://crux-eval.github.io/leaderboard.html" target="_blank">CRUXEval</a></li>
<li><a href="https://evalplus.github.io/leaderboard.html" target="_blank">EvalPlus Leaderboard</a></li>
<li><a href="https://evo-eval.github.io" target="_blank">Evo-Eval</a></li>
<li><a href="https://github.com/01-ai/HumanEval.jl" target="_blank">HumanEval.jl - Julia version HumanEval with EvalPlus test cases</a></li>
<li><a href="https://infi-coder.github.io/infibench/" target="_blank">InfiBench</a></li>
<li><a href="https://livecodebench.github.io/leaderboard.html" target="_blank">LiveCodeBench</a></li>
<li><a href="https://github.com/THUDM/NaturalCodeBench" target="_blank">NaturalCodeBench</a></li>
<li><a href="https://www.swebench.com" target="_blank">SWE-bench</a></li>
<li><a href="https://leaderboard.tabbyml.com" target="_blank">TabbyML Leaderboard</a></li>
<li><a href="https://github.com/Leolty/repobench" target="_blank">RepoBench</a></li>
<li><a href="https://github.com/alphadl/OOP-eval" target="_blank">OOP</a></li>
</ul>
</section>
<footer>
</footer>
<script src="button.js"></script>
</body>
</html> |