Spaces:

atiwari751
/

Hindi-tokenizer

Sleeping

Hindi-tokenizer / README.md

README updated for HF -2

610e365 2 months ago

1.17 kB

	---
	title: Hindi BPE Tokenizer
	emoji: 🔤
	colorFrom: blue
	colorTo: green
	sdk: streamlit
	sdk_version: 1.32.0
	app_file: app.py
	pinned: false
	---

	# Hindi BPE Tokenizer
	A Byte-Pair Encoding tokenizer for Hindi text, implemented using Streamlit.

	## Features
	- Tokenizes Hindi text using BPE algorithm
	- Visualizes the tokenization process
	- Supports custom vocabulary

	# Hindi BPE Tokenizer
	A Byte-Pair Encoding tokenizer for Hindi text, implemented using Streamlit.

	## Features
	- Tokenizes Hindi text using BPE algorithm
	- Visualizes the tokenization process

	## Dataset

	The final dataset used for the tokenizer training is found in text_file.txt in the repo.

	There were 2 prime sources which were combined in the .txt file -

	- https://www.kaggle.com/datasets/disisbig/hindi-text-short-summarization-corpus (test dataset)
	- https://hindi.newslaundry.com/report

	length of text (characters): 5933269

	length of text (words): 1150937


	## Results

	length of text (words): 1150937

	length of tokens (regex): 1354962

	---
	Total bytes before: 14659421

	Total bytes after: 1889786

	Compression ratio: 7.76X