---
title: Telugu Tokenizer App
emoji: 🚀
colorFrom: indigo
colorTo: blue
sdk: docker
sdk_version: "1.0"
app_file: app:app
pinned: false
description: A tokenizer app for tokenizing Telugu text. It uses BPE (Byte Pair Encoding) to tokenize Telugu text. 5k is the vocab size.
tags:
  - telugu
  - tokenizer
  - NLP
  - transformers
license: apache-2.0
model: telugu-tokenizer-model
datasets:
  - telugu-dataset
isPrivate: false
---

# Telugu Tokenizer

This repository provides a tokenizer implementation for processing Telugu text, designed to handle both Telugu Unicode characters and ASCII characters. It uses a Byte Pair Encoding (BPE) approach to efficiently tokenize text and create a vocabulary optimized for Telugu language processing.

## Features

- **Comprehensive Telugu Support**: Includes all Telugu Unicode characters (0C00-0C7F), common ligatures, and valid consonant combinations.
- **Base Vocabulary Creation**: Generates a base vocabulary containing ASCII, Extended ASCII, and Telugu characters.
- **Byte Pair Encoding (BPE)**: Trains the tokenizer to merge frequently occurring token pairs, creating an optimized vocabulary.
- **Parallel Processing**: Utilizes multiprocessing for efficient tokenization of large text datasets.
- **Persistence**: Supports saving and loading the vocabulary to/from JSON files.

## Requirements

The tokenizer requires the following dependencies:

- Python 3.7+
- tqdm
- pandas
- datasets

Install the required packages using pip:
```bash
pip install tqdm pandas datasets
```

## Usage

### 1. Base Vocabulary Creation

The tokenizer first generates a base vocabulary containing ASCII, Extended ASCII, and Telugu characters.

```python
from telugu_tokenizer import create_base_vocab, save_base_vocab

base_vocab = create_base_vocab()
save_base_vocab(base_vocab, path='telugu_base_vocab.json')
```

### 2. Loading an Existing Vocabulary

You can load an existing base vocabulary from a JSON file:

```python
from telugu_tokenizer import load_base_vocab

vocab = load_base_vocab('telugu_base_vocab.json')
```

### 3. Training the Tokenizer

The `BPETokenizer` class can be used to train a tokenizer on a given text input:

```python
from telugu_tokenizer import BPETokenizer

text = "మీరు ఎలా ఉన్నారు?"  # Sample Telugu text
tokenizer = BPETokenizer(vocab_size=5000)
tokenizer.fit(text)
```

### 4. Saving and Loading the Tokenizer

After training, save the tokenizer's vocabulary and merges:

```python
tokenizer.save('telugu_tokenizer')
```

Load the trained tokenizer:

```python
tokenizer.load('telugu_tokenizer')
```

## Telugu Unicode Support

The tokenizer covers the full range of Telugu Unicode characters, including vowels, consonants, vowel signs, digits, and fraction symbols. Additionally, it supports:

- Common ligatures formed with Telugu consonants and vowel signs.
- Valid consonant combinations like `క్క`, `క్జ`, etc.

## File Structure

- **`bpe_tokenizer.py`**: Contains the implementation of the Telugu tokenizer.
- **`telugu_base_vocab.json`**: JSON file storing the base vocabulary.
- **`telugu_tokenizer_vocab.json`**: JSON file storing the trained vocabulary and merges (generated after training).

## Results

- **Final vocabulary size**: 4,999
- **Final compression ratio**: 8.63x

## Logs
- [View Training Logs ](./training_logs.log)

## Performance

The tokenizer uses multiprocessing to handle large datasets efficiently. It processes text in chunks and merges token pairs iteratively to optimize the vocabulary size. This is a simple implementation and can be improved for large-scale datasets.
## Future Enhancements

- Extend support for additional Telugu ligatures and symbols.
- Optimize BPE training for large-scale datasets.
- Provide pre-trained models for common Telugu NLP tasks.

## License

This project is licensed under the MIT License. See the LICENSE file for more details.

## Contributing

Contributions are welcome! Feel free to submit a pull request or open an issue if you encounter bugs or have suggestions for improvement.

## Acknowledgments

- Unicode Consortium for Telugu Unicode character information.
- Community contributions to Telugu NLP development.

---

Feel free to explore the tokenizer and adapt it for your Telugu language processing needs. Happy coding!