File size: 4,886 Bytes
416c864
07a5a56
 
 
 
 
 
 
95a64c9
07a5a56
416c864
07a5a56
 
 
 
9a23788
07a5a56
 
 
9a23788
07a5a56
 
 
95a64c9
07a5a56
 
 
 
9a23788
07a5a56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9a23788
07a5a56
 
 
 
 
 
 
 
 
 
9a23788
 
07a5a56
 
 
 
 
 
 
9a23788
07a5a56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---

language: 
- zh
- en
tags:
- code
- autocomplete
- pytorch
- en
license: "apache-2.0"
---


# GPT2 for Code AutoComplete Model
code-autocomplete, a code completion plugin for Python.

**code-autocomplete** can Automatic completion of code line granularity and block granularity.

## Usage

Open source repo:[code-autocomplete](https://github.com/shibing624/code-autocomplete),support GPT2 model, usage:

```python

from autocomplete.gpt2 import Infer

m = Infer(model_name="gpt2", model_dir="shibing624/code-autocomplete-gpt2-base", use_cuda=False)

i = m.predict('import torch.nn as')

print(i)

```

Also, use huggingface/transformers:

*Please use 'GPT2' related functions to load this model!*

```python

import os

import torch

from transformers import GPT2Tokenizer, GPT2LMHeadModel



os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")



tokenizer = GPT2Tokenizer.from_pretrained("shibing624/code-autocomplete-gpt2-base")

model = GPT2LMHeadModel.from_pretrained("shibing624/code-autocomplete-gpt2-base")

model.to(device)

prompts = [

    """from torch import nn

    class LSTM(Module):

        def __init__(self, *,

                     n_tokens: int,

                     embedding_size: int,

                     hidden_size: int,

                     n_layers: int):""",

    """import numpy as np

    import torch

    import torch.nn as""",

    "import java.util.ArrayList",

    "def factorial(n):",

]

for prompt in prompts:

    input_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors='pt').to(device)

    outputs = model.generate(input_ids=input_ids,

                             max_length=64 + len(prompt),

                             temperature=1.0,

                             top_k=50,

                             top_p=0.95,

                             repetition_penalty=1.0,

                             do_sample=True,

                             num_return_sequences=1,

                             length_penalty=2.0,

                             early_stopping=True)

    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)

    print(decoded)

    print("=" * 20)

```

output:
```shell

from torch import nn

    class LSTM(Module):

        def __init__(self, *,

                     n_tokens: int,

                     embedding_size: int,

                     hidden_size: int,

                     n_layers: int):

            self.embedding_size = embedding_size

====================

import numpy as np

    import torch

    import torch.nn as np

from onmt import nnumpy as np





class PredicterDNN(nn.Module):

     @classmethod

    @parameterized.expand([0.5, 2.5] + (10, 10))

    @classmethod

    @static

    def add(self, sample_rate, max_iters=self.max_iters,     mask_fre

====================

import java.util.ArrayList[Tuple[Int]],

   

====================

def factorial(n): number of elements per dimension,

        assert len(n) > 1

        n.append(self.n_iters)

        n = n_iter(self.n_norm)



     def _score(

====================



Process finished with exit code 0



```

Model files:
```

code-autocomplete-gpt2-base

├── config.json

├── merges.txt

├── pytorch_model.bin

├── special_tokens_map.json

├── tokenizer_config.json

└── vocab.json

```

### Train data
#### pytorch_awesome projects source code



download [code-autocomplete](https://github.com/shibing624/code-autocomplete),

```shell

cd autocomplete

python create_dataset.py
```



If you want train code-autocomplete GPT2 model,refer [https://github.com/shibing624/code-autocomplete/blob/main/autocomplete/gpt2.py](https://github.com/shibing624/code-autocomplete/blob/main/autocomplete/gpt2.py)





### About GPT2



Test the whole generation capabilities here: https://transformer.huggingface.co/doc/gpt2-large



Pretrained model on English language using a causal language modeling (CLM) objective. It was introduced in

[this paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)

and first released at [this page](https://openai.com/blog/better-language-models/).



Disclaimer: The team releasing GPT-2 also wrote a

[model card](https://github.com/openai/gpt-2/blob/master/model_card.md) for their model. Content from this model card

has been written by the Hugging Face team to complete the information they provided and give specific examples of bias.





## Citation



```latex

@misc{code-autocomplete,

  author = {Xu Ming},

  title = {code-autocomplete: Code AutoComplete with GPT model},

  year = {2022},

  publisher = {GitHub},

  journal = {GitHub repository},

  url = {https://github.com/shibing624/code-autocomplete},

}

```