pweb002's picture
Upload 4 files
a148a53 verified
|
raw
history blame
492 Bytes

TinyStack Tokenizer

ByteLevel BPE tokenizer trained on fhswf/tiny-stack dataset.

Usage

from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing

tokenizer = ByteLevelBPETokenizer("./vocab.json", "./merges.txt")
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

Vocab size: 52000