Spaces:
Sleeping
Sleeping
title: Hindi BPE Tokenizer | |
emoji: 🔤 | |
colorFrom: blue | |
colorTo: green | |
sdk: streamlit | |
sdk_version: 1.32.0 | |
app_file: app.py | |
pinned: false | |
# Hindi BPE Tokenizer | |
A Byte-Pair Encoding tokenizer for Hindi text, implemented using Streamlit. | |
## Features | |
- Tokenizes Hindi text using BPE algorithm | |
- Visualizes the tokenization process | |
- Supports custom vocabulary | |
# Hindi BPE Tokenizer | |
A Byte-Pair Encoding tokenizer for Hindi text, implemented using Streamlit. | |
## Features | |
- Tokenizes Hindi text using BPE algorithm | |
- Visualizes the tokenization process | |
## Dataset | |
The final dataset used for the tokenizer training is found in text_file.txt in the repo. | |
There were 2 prime sources which were combined in the .txt file - | |
- https://www.kaggle.com/datasets/disisbig/hindi-text-short-summarization-corpus (test dataset) | |
- https://hindi.newslaundry.com/report | |
length of text (characters): 5933269 | |
length of text (words): 1150937 | |
## Results | |
length of text (words): 1150937 | |
length of tokens (regex): 1354962 | |
--- | |
Total bytes before: 14659421 | |
Total bytes after: 1889786 | |
Compression ratio: 7.76X | |