File size: 4,662 Bytes
f432ba9
 
 
 
 
 
 
 
ffefb14
48ec077
ffefb14
48ec077
916f086
1e75146
0b86e83
6d0e573
 
2ae6447
6d0e573
 
 
 
 
5b1095d
6d0e573
 
d9fdb7e
 
bfb487b
 
d9fdb7e
6d0e573
 
5b1095d
2ae6447
e5bc878
5b1095d
e3ffb64
d9fdb7e
d1605d4
 
 
 
 
e6844a8
 
 
d9fdb7e
 
5d5b548
 
 
 
d9fdb7e
d12d5eb
0b86e83
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
---
title: README
emoji: πŸ‘€
colorFrom: yellow
colorTo: purple
sdk: static
pinned: false
---
<p>
    <img src="https://huggingface.co/datasets/loubnabnl/repo-images/resolve/main/codeparrot_logo.png" alt="drawing" width="440"/>
</p>
			
<p>This organization is dedicated to language models for code generation. In particular CodeParrot is a GPT-2 model trained to generate Python code. Here you can find:</p>

<ul>
<li>
<p>
    <b>Interactive blog:</b> where we compare different code models and explain how they are trained and evaluated <a
    href="https://huggingface.co/spaces/loubnabnl/code-generation-models"
    class="underline">Code generation with πŸ€—</a
    >
</p>
</li>

<li>
<p>
<b>Spaces:</b> 

<li> - Code generation with: <a ref="https://huggingface.co/codeparrot/codeparrot" class="underline">CodeParrot (1.5B)</a>, <a href="https://huggingface.co/facebook/incoder-6B" class="underline">InCoder (6B)</a> and <a href="https://github.com/salesforce/CodeGen" class="underline">CodeGen (6B)</a></li>
<li> - Spaces for some code downstream tasks: algorthmic complexity prediction (BigO), code explanation and code generation from english text.</li>

</p>
</li>

<li><b>Models:</b> CodeParrot (1.5B) and CodeParrot-small (110M), each repo has different ongoing experiments in the branches.</li>
<br>

<li><b>Metrics:</b> <a ref="https://huggingface.co/spaces/codeparrot/apps_metric" class="underline">APPS metric</a> for the evaluation of code models on <a href="https://huggingface.co/datasets/codeparrot/apps" class="underline">APPS</a> benchmark.</li>

<li><b>Datasets:</b><ul>
<li>1- <a href="https://huggingface.co/datasets/codeparrot/codeparrot-clean" class="underline">codeparrot-clean</a>, dataset on which we trained and evaluated CodeParrot, the splits are available under <a href="https://huggingface.co/datasets/codeparrot/codeparrot-clean-train" class="underline">codeparrot-clean-train</a> and <a href="https://huggingface.co/datasets/codeparrot/codeparrot-clean-valid" class="underline">codeparrot-clean-valid</a>.</li>

<li>2- A more filtered version of codeparrot-clean under <a href="https://huggingface.co/datasets/codeparrot/codeparrot-train-more-filtering" class="underline">codeparrot-train-more-filtering</a> and <a href="https://huggingface.co/datasets/codeparrot/codeparrot-valid-more-filtering" class="underline">codeparrot-train-more-filtering</a>.</li>
<li>3- CodeParrot dataset after near deduplication since initially only exact match deduplication was performed, it's available under <a href="https://huggingface.co/datasets/codeparrot/codeparrot-train-near-deduplication" class="underline">codeparrot-train-near-deduplication</a> and <a href="https://huggingface.co/datasets/codeparrot/codeparrot-valid-near-deduplication" class="underline">codeparrot-train-near-deduplication</a>.</li>
</li>
<li>4- CodeParrot dataset after both near deduplication and the additional filtering , it's available under <a href="https://huggingface.co/datasets/codeparrot/codeparrot-train-v2-near-dedup" class="underline">codeparrot-train-v2-near-dedup</a> and <a href="https://huggingface.co/datasets/codeparrot/codeparrot-valid-v2-near-dedup" class="underline">codeparrot-valid-v2-near-dedup</a>.</li>
<li>5- <a href="https://huggingface.co/datasets/codeparrot/github-code" class="underline">GitHub-Code</a>, a 1TB dataset of 32 programming languages from GitHub files.</li>
<li>6- <a href="https://huggingface.co/datasets/codeparrot/github-code-clean" class="underline">GitHub-Code-Clean</a>, a cleaner version of GitHub-Code dataset.</li>
<li>7- <a href="https://huggingface.co/datasets/codeparrot/github-jupyter" class="underline">GitHub-Jupyter</a>, a 16.3GB dataset of Jupyter Notebooks  from BigQuery GitHub.</li>
<li>8- <a href="https://huggingface.co/datasets/codeparrot/github-jupyter-text-code-pairs" class="underline">github-jupyter-text-code-pairs</a>, a dataset of text and code pairs extracted from Jupyter notebooks, it is a parsed version of github-jupyter dataset.</li>
<li>9- <a href="https://huggingface.co/datasets/codeparrot/apps" class="underline">APPS</a>, a benchmark for code generation with 10000 problems.</li>
<li>10- <a href="https://huggingface.co/datasets/codeparrot/codecomplex" class="underline">CodeComplex</a>, an annotated dataset of 4,200 Java codes and their time complexity.</li>
<li>11- <a href="https://huggingface.co/datasets/codeparrot/xlcost-text-to-code" class="underline">XLCOST-text-to-code</a>, a subset of XLCoST benchmark, for text-to-code generation at snippet level and program level for 7 programming languages: Python, C, C#, C++, Java, Javascript and PHP.</li>

</ul>
</li>
</ul>