santacoderpack / README.md

update readme

ed0e7bd almost 2 years ago

5.33 kB

	---
	pipeline_tag: text-generation
	inference: true
	widget:
	- text: '<commit_before>def has_close_elements(numbers: List[float], threshold: float) -> bool:\n for idx, elem in enumerate(numbers):\n for idx2, elem2 in enumerate(numbers):\n if idx != idx2:\n distance = elem - elem2\n if distance < threshold:\n return True\n\n return False<commit_message>Fix bugs in has_close_elements.<commit_after>'
	example_title: Fix has_close_elements
	group: Python
	license: bigcode-openrail-m
	datasets:
	- bigcode/commits-8129-v2
	metrics:
	- code_eval
	library_name: transformers
	tags:
	- code
	model-index:
	- name: SantaCoderPack
	results:
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalFix Python
	metrics:
	- name: pass@1
	type: pass@1
	value: 3.2
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalFix JavaScript
	metrics:
	- name: pass@1
	type: pass@1
	value: 4.9
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalFix Java
	metrics:
	- name: pass@1
	type: pass@1
	value: 1.8
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalFix Go
	metrics:
	- name: pass@1
	type: pass@1
	value: 3.6
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalFix C++
	metrics:
	- name: pass@1
	type: pass@1
	value: 4.2
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalFix Rust
	metrics:
	- name: pass@1
	type: pass@1
	value: 1.7
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalFix Average
	metrics:
	- name: pass@1
	type: pass@1
	value: 3.3
	verified: false
	---
	![Octopack](https://github.com/bigcode-project/octopack/blob/31f3320f098703c7910e43492c39366eeea68d83/banner.png?raw=true)

	# Table of Contents

	1. [Model Summary](#model-summary)
	2. [Use](#use)
	3. [Training](#training)
	4. [Citation](#citation)

	# Model Summary

	SantaCoderPack is an pre-trained model with the same architecture of SantaCoder on
	<th><a href=https://huggingface.co/datasets/bigcode/commitpack>CommitPack</a> using this format:
	```html
	<commit_before>code_before<commit_msg>message<commit_after>
	```

	- Repository: [bigcode/octopack](https://github.com/bigcode-project/octopack)
	- Paper: [TODO]()
	- Languages: Python, JavaScript, Java, C++, Go, Rust
	- SantaCoderPack:
	<table>
	<tr>
	<th>Data</t>
	<th><a href=https://huggingface.co/datasets/bigcode/commitpack>CommitPack</a></th>
	<td>4TB of GitHub commits across 350 programming languages</td>
	</tr>
	<tr>
	<th>Model</t>
	<th><a href=https://huggingface.co/bigcode/octocoder>SantaCoderPack</a></th>
	<td>SantaCoderPack (1.1B parameters) pre-trained on CommitPack</td>
	</tr>
	<tr>
	<th>Evaluation  </t>
	<th><a href=https://huggingface.co/datasets/bigcode/humanevalpack>HumanEvalPack/HumanEvalFix</a></th>
	<td>Extension of OpenAI's HumanEval to HumanEvalFix</td>
	</tr>
	</table>


	# Use

	## Intended use

	The model follows instructions provided in the input. We recommend prefacing your input with "<commit_before>def has_close_elements(numbers: List[float], threshold: float) -> bool:\n for idx, elem in enumerate(numbers):\n for idx2, elem2 in enumerate(numbers):\n if idx != idx2:\n distance = elem - elem2\n if distance < threshold:\n return True\n\n return False<commit_message>Fix bugs in has_close_elements.<commit_after>"

	Feel free to share your generations in the Community tab!

	## Generation
	```python
	# pip install -q transformers
	from transformers import AutoModelForCausalLM, AutoTokenizer
	checkpoint = "bigcode/santacoderpack"
	device = "cuda" # for GPU usage or "cpu" for CPU usage
	tokenizer = AutoTokenizer.from_pretrained(checkpoint)
	model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
	inputs = tokenizer.encode("Q<commit_before>def has_close_elements(numbers: List[float], threshold: float) -> bool:\n for idx, elem in enumerate(numbers):\n for idx2, elem2 in enumerate(numbers):\n if idx != idx2:\n distance = elem - elem2\n if distance < threshold:\n return True\n\n return False<commit_message>Fix bugs in has_close_elements.<commit_after>", return_tensors="pt").to(device)
	outputs = model.generate(inputs)
	print(tokenizer.decode(outputs[0]))
	```

	# Training

	## Model

	- Architecture: GPT-2 model with multi-query attention
	- Steps: 250k pretraining
	- Pretraining tokens: 131B
	- Precision: bfloat16

	## Hardware

	- Pretraining:
	- GPUs: 32 Tesla A100
	- Training time: 15 days

	## Software

	- Orchestration: [Megatron-LM/Transformers](https://github.com/bigcode-project/santacoderpack#training)
	- Neural networks: [PyTorch](https://github.com/pytorch/pytorch)

	# Citation

	TODO