File size: 2,917 Bytes
f432ba9
 
 
 
 
 
 
 
ffefb14
48ec077
ffefb14
48ec077
0b86e83
c236238
24d0897
 
0b86e83
6d0e573
 
 
7dca969
6d0e573
 
 
 
 
e5bc878
6d0e573
 
7dca969
6d0e573
 
e5bc878
0b86e83
e5bc878
0b86e83
7dca969
 
 
 
 
 
0b86e83
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
---
title: README
emoji: πŸ‘€
colorFrom: yellow
colorTo: purple
sdk: static
pinned: false
---
<p>
    <img src="https://huggingface.co/datasets/loubnabnl/repo-images/resolve/main/codeparrot_logo.png" alt="drawing" width="440"/>
</p>
			
<p>This organization is dedicated to language models for code generation. In particular CodeParrot is a GPT-2 model trained to generate Python code.</p>
<h2>Table of contents:</h2>
<br>

<ul>

<li>
<p>
    Interactive blog: where we compare different code models and explain how they are trained and evaluated <a
    href="https://huggingface.co/spaces/loubnabnl/code-generation-models"
    class="underline">Code generation with πŸ€—</a
    >
</p>
</li>
<br>
<li>
<p>
Spaces: code generation with: <a ref="https://huggingface.co/codeparrot/codeparrot" class="underline">CodeParrot (1.5B)</a>, <a href="https://huggingface.co/facebook/incoder-6B" class="underline">InCoder</a> (6B) and <a href="https://github.com/salesforce/CodeGen" class="underline">CodeGen</a> (6B)
</p>
</li>
<br>
<li>Models: CodeParrot (1.5B) and CodeParrot-small (110M), each repo has different ongoing experiments in the branches.</li>
<br>
<li>Datasets:<ul>
<li><a href="https://huggingface.co/datasets/codeparrot/codeparrot-clean" class="underline">codeparrot-clean</a>, dataset on which we trained and evaluated CodeParrot, the splits are available under <a href="https://huggingface.co/datasets/codeparrot/codeparrot-clean-train" class="underline">codeparrot-clean-train</a> and <a href="https://huggingface.co/datasets/codeparrot/codeparrot-clean-valid" class="underline">codeparrot-clean-valid</a>.</li>
<li>A more filtered version of codeparrot-clean under <a href="https://huggingface.co/datasets/codeparrot/codeparrot-train-more-filtering" class="underline">codeparrot-train-more-filtering</a> and <a href="https://huggingface.co/datasets/codeparrot/codeparrot-valid-more-filtering" class="underline">codeparrot-train-more-filtering</a>.</li>
<li>CodeParrot dataset after near deduplication since initially only exact match deduplication was performed, it's available under <a href="https://huggingface.co/datasets/codeparrot/codeparrot-train-near-deduplication" class="underline">codeparrot-train-near-deduplication</a> and <a href="https://huggingface.co/datasets/codeparrot/codeparrot-valid-near-deduplication" class="underline">codeparrot-train-near-deduplication</a>.</li>
<li><a href="https://huggingface.co/datasets/codeparrot/github-code" class="underline">GitHub-Code</a>, a 1TB dataset of 32 programming languages with 60 from GitHub files.</li>
<li><a href="https://huggingface.co/datasets/codeparrot/github-jupyter" class="underline">GitHub-Jupyter</a>, a 16.3GB dataset of Jupyter Notebooks  from BigQuery GitHub.</li>
<li><a href="https://huggingface.co/datasets/codeparrot/apps" class="underline">APPS</a>, a benchmark for code generation with 10000 problems.</li>
</ul>
</li>
</ul>