pweb002's picture
Upload 4 files
a148a53 verified
|
raw
history blame
492 Bytes
# TinyStack Tokenizer
ByteLevel BPE tokenizer trained on fhswf/tiny-stack dataset.
## Usage
```python
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing
tokenizer = ByteLevelBPETokenizer("./vocab.json", "./merges.txt")
tokenizer._tokenizer.post_processor = BertProcessing(
("</s>", tokenizer.token_to_id("</s>")),
("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)
```
Vocab size: 52000