Tokenizer for Python Code (Trained on CodeSearchNet)

Model Description

This is a custom Byte-Pair Encoding (BPE) tokenizer, initialized from a gpt2 tokenizer and further trained on the Python subset of the CodeSearchNet dataset. The tokenizer is designed to efficiently tokenize Python code, which can be useful for various downstream tasks like code generation, code completion, and code analysis.

Training Data

The tokenizer was trained on the whole_func_string column of the train split from the claudios/code_search_net dataset, specifically focusing on Python code examples. The training corpus consisted of approximately 412,178 Python function strings.

Training Procedure

  1. Base Tokenizer: Started with a pre-trained gpt2 tokenizer.
  2. Training: The train_new_from_iterator method from transformers.PreTrainedTokenizerFast was used to train a new vocabulary and merges from the CodeSearchNet Python code corpus. The new vocabulary size was set to 52,000 tokens.

How to Use

You can load and use this tokenizer with the transformers library:

from transformers import AutoTokenizer

# Load the tokenizer from the Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained("rajaykumar12959/new_tokeniser")

# Example usage
example_code = """class LinearLayer():
    def __init__(self, input_size, output_size):
        self.weight = torch.randn(input_size, output_size)
        self.bias = torch.zeros(output_size)

    def __call__(self, x):
        return x @ self.weights + self.bias
    """

tokens = tokenizer.tokenize(example_code)
print(tokens)
# Output will be similar to:
# ['class', 'ĠLinear', 'Layer', '():', 'ĊĠĠĠ', 'Ġdef', 'Ġ__', 'init', '__(', 'self', ',', 'Ġinput', '_', 'size', ',', 'Ġoutput', '_', 'size', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'weight', 'Ġ=', 'Ġtorch', '.', 'randn', '(', 'input', '_', 'size', ',', 'Ġoutput', '_', 'size', ')', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'bias', 'Ġ=', 'Ġtorch', '.', 'zeros', '(', 'output', '_', 'size', ')', 'ĊĊĠĠĠ', 'Ġdef', 'Ġ__', 'call', '__(', 'self', ',', 'Ġx', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġreturn', 'Ġx', 'Ġ@', 'Ġself', '.', 'weights', 'Ġ+', 'Ġself', '.', 'bias', 'ĊĠĠĠĠ']

encoded_input = tokenizer(example_code, return_tensors="pt")
print(encoded_input)

License

This tokenizer is licensed under the MIT License.

Author

rajaykumar12959

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rajaykumar12959/python-codesearchnet-tokenizer

Finetuned
(2049)
this model