|
--- |
|
license: mit |
|
datasets: |
|
- flytech/python-codes-25k |
|
tags: |
|
- code |
|
language: |
|
- en |
|
library_name: transformers |
|
--- |
|
# GPT2 PyCode |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
This model is a fine-tuned version of the GPT 124M model, specifically adapted for testing purposes in Python code generation. It was trained on a small corpus of 25,000 Python code samples. |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
This project features a GPT (Generative Pre-trained Transformer) language model with 124 million parameters that has been fine-tuned for Python code generation. Unlike larger models like GPT-2 or GPT-3, this is a smaller-scale model designed primarily for testing and experimental purposes. |
|
|
|
- **Developed by:** Maharnab Saikia |
|
- **Model type:** Language model |
|
- **Language(s) (NLP):** English |
|
- **License:** MIT |
|
- **Finetuned from model:** GPT2 124M |
|
|
|
## Uses |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
- **Research:** Studying the behavior of small-scale language models in code generation tasks |
|
- **Benchmarking:** Providing a baseline for comparing different model architectures or training strategies |
|
- **Rapid Prototyping:** Quick tests of code generation ideas without the overhead of larger models |
|
- **Education:** Demonstrating the principles of fine-tuning language models for specific tasks |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
<!-- This section is meant to convey both technical and sociotechnical limitations. --> |
|
|
|
It's crucial to understand the limitations of this model: |
|
|
|
- Limited knowledge base due to the small training corpus |
|
- May struggle with complex or specialized Python code |
|
- Not suitable for production-level code generation tasks |
|
- Performance will likely be significantly lower than larger, more comprehensively trained models |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
```python |
|
from transformers import GPT2LMHeadModel, GPT2Tokenizer |
|
import torch |
|
import re |
|
|
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
tokenizer = GPT2Tokenizer.from_pretrained('maharnab/gpt2_pycode') |
|
model = GPT2LMHeadModel.from_pretrained('maharnab/gpt2_pycode') |
|
model.to(device) |
|
|
|
prompt = "How to reverse a string in Python." |
|
encoded_input = tokenizer.encode_plus(f"<sos><user>{prompt}</user><assistant>", max_length=20, truncation=True, return_tensors="pt").to(device) |
|
|
|
input_ids = encoded_input['input_ids'] |
|
attention_mask = encoded_input['attention_mask'] |
|
|
|
output = model.generate( |
|
input_ids, |
|
max_length=512, |
|
num_return_sequences=1, |
|
no_repeat_ngram_size=2, |
|
temperature=0.7, |
|
do_sample=True, |
|
top_k=50, |
|
top_p=0.95, |
|
attention_mask=attention_mask, |
|
pad_token_id=tokenizer.pad_token_id |
|
) |
|
|
|
generated_code = tokenizer.decode(output[0]) |
|
generated_code = re.search(r'<assistant>(.*?)</assistant>', generated_code, re.DOTALL).group(1) |
|
|
|
print(f"Prompt: {prompt}\nGenerated Code:\n{generated_code}") |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
- **Model:** GPT with 124 million parameters |
|
- **Training Data:** 25,000 Python code samples |
|
- **Fine-tuning:** Adapted specifically for Python code generation tasks |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
- **Epochs:** 5 |
|
- **Batch Size:** 8 |
|
- **Learning Rate:** 5e-5 |
|
- **Contex Window:** 512 |
|
|
|
## Environmental Impact |
|
|
|
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly --> |
|
|
|
Carbon emissions was estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). |
|
|
|
- **Hardware Type:** P100 GPU |
|
- **Hours used:** 5 |
|
- **Cloud Provider:** Kaggle |
|
- **Compute Region:** South Asia |
|
- **Carbon Emitted:** 1.15 |
|
|
|
## Acknowledgements |
|
|
|
This project builds upon the GPT-2 model developed by OpenAI. We acknowledge their groundbreaking work in the field of natural language processing. |