--- title: Hindi BPE Tokenizer emoji: 🔤 colorFrom: blue colorTo: green sdk: streamlit sdk_version: 1.32.0 app_file: app.py pinned: false --- # Hindi BPE Tokenizer A Byte-Pair Encoding tokenizer for Hindi text, implemented using Streamlit. ## Features - Tokenizes Hindi text using BPE algorithm - Visualizes the tokenization process - Supports custom vocabulary # Hindi BPE Tokenizer A Byte-Pair Encoding tokenizer for Hindi text, implemented using Streamlit. ## Features - Tokenizes Hindi text using BPE algorithm - Visualizes the tokenization process ## Dataset The final dataset used for the tokenizer training is found in text_file.txt in the repo. There were 2 prime sources which were combined in the .txt file - - https://www.kaggle.com/datasets/disisbig/hindi-text-short-summarization-corpus (test dataset) - https://hindi.newslaundry.com/report length of text (characters): 5933269 length of text (words): 1150937 ## Results length of text (words): 1150937 length of tokens (regex): 1354962 --- Total bytes before: 14659421 Total bytes after: 1889786 Compression ratio: 7.76X