--- library_name: transformers datasets: - HuggingFaceTB/smollm-corpus --- # Doge-tokenizer Tokenizer for the training model on [smollm-corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus), and support reasoning fine-tuning like R1. This tokenizer was trained on 2M samples from: - FineWeb-Edu 70% - Cosmopedia v2 20% - Python-Edu 5% - FineMath 5%