File size: 12,608 Bytes
cf11127
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8bcd76f
cf11127
 
 
 
 
 
 
 
 
 
 
 
 
795f7e4
 
cf11127
795f7e4
cf11127
795f7e4
 
 
cf11127
 
 
 
 
 
 
 
 
d5077d1
cf11127
 
 
 
 
 
a83107f
 
 
 
 
 
cf11127
 
 
7a44830
6330b9a
cf11127
 
203a9ec
cf11127
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7e01180
795f7e4
cf11127
 
 
 
 
 
 
 
 
03cc096
7ea3609
 
cf11127
a2413f8
cf11127
 
 
 
7ea3609
 
cf11127
03cc096
cf11127
a2413f8
cf11127
 
 
 
 
 
 
795f7e4
 
cf11127
7d88398
795f7e4
 
53d6ca7
795f7e4
53d6ca7
795f7e4
 
 
c492cd8
795f7e4
 
 
663ad89
795f7e4
ef2563c
795f7e4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ad42ba2
 
795f7e4
cf11127
 
 
 
 
795f7e4
 
 
 
 
 
 
cf11127
795f7e4
 
cf11127
795f7e4
cf11127
 
 
d5077d1
 
 
 
 
 
 
 
 
 
24f40c4
 
 
 
d5077d1
 
 
 
 
 
889f4bb
 
d5077d1
 
 
cf11127
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Memorization or Generation of Big Code Model Leaderboard</title>
    <link rel="stylesheet" href="style.css">
    <script src="echarts.min.js"></script>
</head>

<body>

    <section class="section_title">
        <h1>
            ⭐ <span style="color: rgb(223, 194, 25);">Memorization</span> or 
            <span style="color: rgb(223, 194, 25);">Generation</span>
             of Big 
             <span style="color: rgb(223, 194, 25);">Code</span>
              Models 
              <span style="color: rgb(223, 194, 25);">Leaderboard</span>
        </h1>

        <div class="section_title__imgs">
            <a href="https://github.com/YihongDong/CDD-TED4LLMs" id="a_github" target="_blank">
                <img src="https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white">
            </a>
            <a href="https://arxiv.org/abs/2402.15938" id="a_arxiv" target="_blank">
                <img src="https://img.shields.io/badge/PAPER-ACL'24-ad64d4.svg?style=for-the-badge">
            </a>
        </div>

        <div class="section_title__p">
            <p>
                Inspired by the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard" target="_blank">πŸ€— Open LLM Leaderboard</a> and
                <a href="https://huggingface.co/spaces/optimum/llm-perf-leaderboard" target="_blank">πŸ€— Open LLM-Perf Leaderboard πŸ‹οΈ</a>,
                we compare the performance of base code generation models on the
                <a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a> and
                <a href="https://github.com/YihongDong/CodeGenEvaluation" target="_blank">HumanEval-ET</a> benchmarks.
                We also measure the Memorization-Generalization Index and provide the results for the models.
                We compare both open-source and closed-source pre-trained code LLMs that can serve as base models for further training.
            </p>
        </div>
    </section>

    <section class="section_button">
        <button id="btn_evalTable">πŸ” Evalution Table</button>
        <button id="btn_plot">πŸ“Š Performance Plot</button>
        <button id="btn_about">πŸ“ About</button>
        <button id="btn_submit">πŸš€ Submit results</button>
        <button id="btn_more">πŸ€— More Leaderboards</button>
    </section>

    <section class="section_evalTable" id="sec_evalTable">
        <div class="section_evalTable__table">
            <table id="evalTable">
                <colgroup>
                    <col style="width: 25%">
                    <col style="width: 15%">
                    <col style="width: 15%">
                    <col style="width: 15%">
                    <col style="width: 15%">
                    <col style="width: 15%">
                </colgroup>

                <thead>
                    <!-- <th rowspan="2">Benchmark</th> -->
                    <th rowspan="2" id="th_model">Model
                        <button class="button_sort" data-direction="desc" data-type="name"></button>
                    </th>
                    <th data-direction="desc" rowspan="2" data-type="MGI">MGI
                        <button class="button_sort" data-direction="desc" data-type="MGI"></button>
                    </th>
                    <th colspan="2">Pass@1(temp=0)</th>
                    <th colspan="2">Pass@1(temp=0.8)</th>
                    <tr>
                        <th>HumanEval
                            <button class="button_sort" data-direction="desc" data-type="temp0_HumanEval"></button>
                        </th>
                        <th>HumanEval-ET
                            <button class="button_sort" data-direction="desc" data-type="temp0_HumanEval_ET"></button>
                        </th>
                        <th>HumanEval
                            <button class="button_sort" data-direction="desc" data-type="temp0_8_HumanEval"></button>
                        </th>
                        <th>HumanEval-ET
                            <button class="button_sort" data-direction="desc" data-type="temp0_8_HumanEval_ET"></button>
                        </th>
                    </tr>  
                </thead>
    
                <tbody>
                    
                </tbody>
            </table>
            </table>
            <script src="table.js"></script>
        </div>

        <div class="section_evalTable__notes">
            <p><strong>Notes</strong>
            <p>
            <ul>
                <li>MGI stands for Memorization-Generalization Index, which is derived from Avg. Peak in the original paper.&ensp;A higher MGI value indicates a greater propensity for a model to engage in memorization as opposed to generalization.</li>
                <li>For more details check the πŸ“ About section.</li>
            </ul>
        </div>
    </section>

    <section class="section_plot" id="sec_plot">
        <div style="display: flex;">
            <div class="section_plot__div" id="sec_plot__div1">
                <div class="section_plot__btnGroup" id="sec_plot__btnGroup1">
                    <button id="btn_temp0_HumanEval"></button>
                    <span id="span_temp0_HumanEval">Pass@1 (temp = 0)</span>
                    <button id="btn_temp0_8_HumanEval"></button>
                    <span id="span_temp0_8_HumanEval">Pass@1 (temp = 0.8)</span>
                </div>
                <div id="sec_plot__chart1" style="width:716.5px; height:550px;"></div>
            </div>
            
            <div class="section_plot__div" id="sec_plot__div2">
                <div class="section_plot__btnGroup" id="sec_plot__btnGroup2">
                    <button id="btn_temp0_HumanEval_ET"></button>
                    <span id="span_temp0_HumanEval_ET">Pass@1 (temp = 0)</span>
                    <button id="btn_temp0_8_HumanEval_ET"></button>
                    <span id="span_temp0_8_HumanEval_ET">Pass@1 (temp = 0.8)</span>
                </div>
                <div id="sec_plot__chart2" style="width:716.5px; height:550px;"></div>
            </div>
        </div>
        <script src="chart.js"></script>
    </section>


    <section class="section_about" id="sec_about">
        <h3>Benchmarking and Prompts</h3>
            <!-- <p>The growing number of code models released by the community necessitates a comprehensive evaluation to
                reliably benchmark their capabilities.
                Similar to the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard" target="_blank">πŸ€— Open LLM Leaderboard</a>, 
                we selected two common benchmarks for evaluating Code LLMs on multiple programming languages:</p> -->
        <ul>
            <li><a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a>:&ensp;Used to measure the functional correctness of programs generated from docstrings. It includes 164 Python programming problems.
            </li>
            <li><a href="https://github.com/YihongDong/CodeGenEvaluation" target="_blank">HumanEval-ET</a>:&ensp;The extended version of HumanEval benchmark, where each task includes more than 100 test cases.
            </li>
        </ul>
        <p>
            For all models (except for the Starcoder family), we used the original benchmark prompts from HumanEval and added a `&lt;bos&gt;` token before the provided prompt. 
            The maximum generation length was set to the length of the original prompt plus 300 tokens.
        </p>
        <p>
            For the Starcoder family models (such as <a href="https://huggingface.co/bigcode/starcoder2-7b" target="_blank">Starcoder2-7b</a> and <a href="https://huggingface.co/bigcode/starcoder2-7b" target="_blank">Starcoder2-15b</a>), 
            we used the official bigcode-evaluation-harness for generation. 
            More details can be found <a href="https://github.com/bigcode-project/bigcode-evaluation-harness/" target="_blank">here</a>.
        </p>
        <h3>Evaluation Parameters</h3>
        <p>
            For all models, we sampled 1 and 50 samples under temperatures of 0 and 0.8, respectively, 
            for the subsequent result calculations. The parameters are set as follows:
        </p>
        <ul>
            <li>top-p=1.0 (default parameter in the transformers library)</li>
            <li>top-k=50 (default parameter in the transformers library)</li>
            <li>max_length_generation=len(prompt)+300</li>
            <li>temperature=0 or temperature=0.8</li>
            <li>n_samples=50</li>
        </ul>
        <h3>Performance Metrics</h3>
        <ul>
            <li>pass@k:&ensp;Represents the probability that the model successfully solves the test problem at least once out of `k` attempts.</li>
            <li>MGI:&ensp;The average peakedness of the edit distance distribution constructed by the mode samples.</li>
        </ul>
    </section>

    <section class="section_submit" id="sec_submit">
        <h2>How to submit models/results to the leaderboard?</h2>
        <div>
            <p>We welcome the community to submit evaluation results of new models. 
                These results will be added as non-verified, the authors are however required to upload their generations in case other members want to check.
            </p>
            <p>
                To submit your results create a <span style="font-weight: bold;">Pull Request</span> in the community tab to add them under the 
                <a href="[https://github.com/YihongDong/CDD-TED4LLMs]" target="_blank">folder</a> <span class="span_">community_results</span> in the repository:
            </p>
            <ul>
                <li>Create a folder called <span class="span_">ORG_MODELNAME_USERNAME</span>  for example <span class="span_">meta_CodeLlama_xxx</span>.</li>
                <li>Put the generation outputs of your modle in it.</li>
            </ul>
            <p>The title of the PR should be <span class="span_">[Community Submission] Model: org/model, Username: your_username</span>, replace org and model with those corresponding to the model you evaluated.</p>
        </div>
    </section>

    <section class="section_more" id="sec_more">
        <h2>Context</h2>
        <p>In addition to Memorization or Generation of Big Code Models Leaderboard, it is recommended to comprehensively 
            understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as:
        </p>
        <ul>
            <li><a href="https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard" target="_blank">Big Code Models Leaderboard</a></li>
            <li><a href="https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard" target="_blank">Chatbot Arena Leaderboard</a></li>
            <li><a href="https://fudanselab-classeval.github.io/leaderboard.html" target="_blank">ClassEval</a></li>
            <li><a href="https://bigcode-bench.github.io" target="_blank">Code Lingua</a></li>
            <li><a href="https://github.com/amazon-science/cceval" target="_blank">CrossCodeEval</a></li>
            <li><a href="https://crux-eval.github.io/leaderboard.html" target="_blank">CRUXEval</a></li>
            <li><a href="https://evalplus.github.io/leaderboard.html" target="_blank">EvalPlus Leaderboard</a></li>
            <li><a href="https://evo-eval.github.io" target="_blank">Evo-Eval</a></li>
            <li><a href="https://github.com/01-ai/HumanEval.jl" target="_blank">HumanEval.jl - Julia version HumanEval with EvalPlus test cases</a></li>
            <li><a href="https://infi-coder.github.io/infibench/" target="_blank">InfiBench</a></li>
            <li><a href="https://livecodebench.github.io/leaderboard.html" target="_blank">LiveCodeBench</a></li>
            <li><a href="https://github.com/THUDM/NaturalCodeBench" target="_blank">NaturalCodeBench</a></li>
            <li><a href="https://www.swebench.com" target="_blank">SWE-bench</a></li>
            <li><a href="https://leaderboard.tabbyml.com" target="_blank">TabbyML Leaderboard</a></li>
            <li><a href="https://github.com/Leolty/repobench" target="_blank">RepoBench</a></li>
            <li><a href="https://github.com/alphadl/OOP-eval" target="_blank">OOP</a></li>
        </ul>
    </section>



    <footer>
    </footer>

    <script src="button.js"></script>
</body>

</html>