Spaces:

sagargurujula
/

telugu-tokenizer

Sleeping

App Files Files Community

Chaitanya Sagar Gurujula commited on Jan 10

Commit

496ac89

1 Parent(s): bc28434

Add application file

Browse files

Files changed (11) hide show

Dockerfile +12 -0
README.md +139 -6
requirements.txt +9 -0
src/__pycache__/bpe_tokenizer.cpython-312.pyc +0 -0
src/app.py +123 -0
src/bpe_tokenizer.py +660 -0
src/templates/index.html +134 -0
telugu_base_vocab.json +0 -0
telugu_tokenizer_merges.json +0 -0
telugu_tokenizer_vocab.json +0 -0
training_logs.log +376 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,12 @@

+FROM python:3.9-slim
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install -r requirements.txt
+COPY src/ .
+COPY telugu_tokenizer_vocab.json .
+COPY telugu_tokenizer_merges.json .
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

README.md CHANGED Viewed

@@ -1,11 +1,144 @@
 ---
-title: Telugu Tokenizer
-emoji: 👁
-colorFrom: purple
-colorTo: yellow
 sdk: docker
 pinned: false
-short_description: Telugu tokenizer with Vocab Size 5k
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Telugu Tokenizer App
+emoji: అ
+colorFrom: indigo
+colorTo: blue
 sdk: docker
+sdk_version: "1.0"
+app_file: app:app
 pinned: false
+description: A tokenizer app for tokenizing Telugu text. It uses BPE (Byte Pair Encoding) to tokenize Telugu text. 5k is the vocab size.
+tags:
+  - telugu
+  - tokenizer
+  - NLP
+  - transformers
+license: apache-2.0
+model: telugu-tokenizer-model
+datasets:
+  - telugu-dataset
+isPrivate: false
 ---
+# Telugu Tokenizer
+This repository provides a tokenizer implementation for processing Telugu text, designed to handle both Telugu Unicode characters and ASCII characters. It uses a Byte Pair Encoding (BPE) approach to efficiently tokenize text and create a vocabulary optimized for Telugu language processing.
+## Features
+- **Comprehensive Telugu Support**: Includes all Telugu Unicode characters (0C00-0C7F), common ligatures, and valid consonant combinations.
+- **Base Vocabulary Creation**: Generates a base vocabulary containing ASCII, Extended ASCII, and Telugu characters.
+- **Byte Pair Encoding (BPE)**: Trains the tokenizer to merge frequently occurring token pairs, creating an optimized vocabulary.
+- **Parallel Processing**: Utilizes multiprocessing for efficient tokenization of large text datasets.
+- **Persistence**: Supports saving and loading the vocabulary to/from JSON files.
+## Requirements
+The tokenizer requires the following dependencies:
+- Python 3.7+
+- tqdm
+- pandas
+- datasets
+Install the required packages using pip:
+```bash
+pip install tqdm pandas datasets
+```
+## Usage
+### 1. Base Vocabulary Creation
+The tokenizer first generates a base vocabulary containing ASCII, Extended ASCII, and Telugu characters.
+```python
+from telugu_tokenizer import create_base_vocab, save_base_vocab
+base_vocab = create_base_vocab()
+save_base_vocab(base_vocab, path='telugu_base_vocab.json')
+```
+### 2. Loading an Existing Vocabulary
+You can load an existing base vocabulary from a JSON file:
+```python
+from telugu_tokenizer import load_base_vocab
+vocab = load_base_vocab('telugu_base_vocab.json')
+```
+### 3. Training the Tokenizer
+The `BPETokenizer` class can be used to train a tokenizer on a given text input:
+```python
+from telugu_tokenizer import BPETokenizer
+text = "మీరు ఎలా ఉన్నారు?"  # Sample Telugu text
+tokenizer = BPETokenizer(vocab_size=5000)
+tokenizer.fit(text)
+```
+### 4. Saving and Loading the Tokenizer
+After training, save the tokenizer's vocabulary and merges:
+```python
+tokenizer.save('telugu_tokenizer')
+```
+Load the trained tokenizer:
+```python
+tokenizer.load('telugu_tokenizer')
+```
+## Telugu Unicode Support
+The tokenizer covers the full range of Telugu Unicode characters, including vowels, consonants, vowel signs, digits, and fraction symbols. Additionally, it supports:
+- Common ligatures formed with Telugu consonants and vowel signs.
+- Valid consonant combinations like `క్క`, `క్జ`, etc.
+## File Structure
+- **`bpe_tokenizer.py`**: Contains the implementation of the Telugu tokenizer.
+- **`telugu_base_vocab.json`**: JSON file storing the base vocabulary.
+- **`telugu_tokenizer_vocab.json`**: JSON file storing the trained vocabulary and merges (generated after training).
+## Results
+- **Final vocabulary size**: 4,999
+- **Final compression ratio**: 8.63x
+## Logs
+- [View Training Logs ](./training_logs.log)
+## Performance
+The tokenizer uses multiprocessing to handle large datasets efficiently. It processes text in chunks and merges token pairs iteratively to optimize the vocabulary size. This is a simple implementation and can be improved for large-scale datasets.
+## Future Enhancements
+- Extend support for additional Telugu ligatures and symbols.
+- Optimize BPE training for large-scale datasets.
+- Provide pre-trained models for common Telugu NLP tasks.
+## License
+This project is licensed under the MIT License. See the LICENSE file for more details.
+## Contributing
+Contributions are welcome! Feel free to submit a pull request or open an issue if you encounter bugs or have suggestions for improvement.
+## Acknowledgments
+- Unicode Consortium for Telugu Unicode character information.
+- Community contributions to Telugu NLP development.
+---
+Feel free to explore the tokenizer and adapt it for your Telugu language processing needs. Happy coding!

requirements.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+fastapi==0.68.0
+uvicorn==0.15.0
+jinja2==3.0.1
+python-multipart==0.0.5
+datasets==2.12.0
+tqdm==4.65.0
+aiofiles==0.8.0
+python-multipart==0.0.5
+pandas==2.2.3

src/__pycache__/bpe_tokenizer.cpython-312.pyc ADDED Viewed

Binary file (42.6 kB). View file

src/app.py ADDED Viewed

	@@ -0,0 +1,123 @@

+from fastapi import FastAPI, Request
+from fastapi.responses import HTMLResponse
+from fastapi.templating import Jinja2Templates
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel
+from bpe_tokenizer import BPETokenizer, create_base_vocab
+import os
+import json
+# Get the absolute path to the templates directory
+TEMPLATES_DIR = os.path.join(os.path.dirname(__file__), "templates")
+app = FastAPI(title="Telugu BPE Tokenizer")
+# Add CORS middleware
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# Templates with absolute path
+templates = Jinja2Templates(directory=TEMPLATES_DIR)
+# Initialize tokenizer
+tokenizer = BPETokenizer(vocab_size=5000)
+# Load the vocabulary file directly
+print("Loading vocabulary...")
+vocab_file = 'telugu_tokenizer_vocab.json'
+with open(vocab_file, 'r', encoding='utf-8') as f:
+    vocab_data = json.load(f)
+class TokenizeRequest(BaseModel):
+    text: str
+@app.get("/", response_class=HTMLResponse)
+async def home(request: Request):
+    return templates.TemplateResponse(
+        "index.html",
+        {"request": request, "title": "Telugu BPE Tokenizer"}
+    )
+@app.post("/tokenize")
+async def tokenize(request: TokenizeRequest):
+    text = request.text
+    try:
+        tokens = tokenizer.encode(text)
+        decoded = tokenizer.decode(tokens)
+        # Get token details from vocabulary for display
+        token_details = []
+        current_position = 0
+        current_byte_position = 0
+        text_bytes = text.encode('utf-8')
+        while current_position < len(tokens):
+            # Skip leading spaces in original text
+            while current_byte_position < len(text_bytes) and text_bytes[current_byte_position] == 32:
+                current_byte_position += 1
+            # Get next word from original text
+            word_start = current_byte_position
+            word_end = word_start
+            while word_end < len(text_bytes) and text_bytes[word_end] != 32:
+                word_end += 1
+            word_bytes = text_bytes[word_start:word_end]
+            word = word_bytes.decode('utf-8')
+            # Collect tokens for this word
+            word_tokens = []
+            decoded_bytes = b''
+            while current_position < len(tokens):
+                token = tokens[current_position]
+                token_bytes = tokenizer.vocab[token]
+                # If we've collected enough bytes for the word (plus possible space)
+                if len(decoded_bytes) >= len(word_bytes):
+                    break
+                word_tokens.append(token)
+                decoded_bytes += token_bytes
+                current_position += 1
+            # Update byte position for next word
+            current_byte_position = word_end
+            # Add word and its tokens to details
+            token_details.append({
+                "word": word,
+                "type": "subword_tokens",
+                "tokens": [{
+                    "id": t,
+                    "text": vocab_data.get(str(t), {}).get('text', '[UNKNOWN]')
+                } for t in word_tokens]
+            })
+        return {
+            "original": text,
+            "tokens": tokens,
+            "token_details": token_details,
+            "decoded": decoded,
+            "matches": text == decoded
+        }
+    except Exception as e:
+        print(f"Error: {str(e)}")
+        return {"error": str(e)}
+@app.get("/vocab")
+async def get_vocab():
+    return {
+        "vocab_size": len(vocab_data),
+        "base_vocab_size": sum(1 for info in vocab_data.values() if info.get('is_base', False)),
+        "num_merges": len(getattr(tokenizer, 'merges', {}))
+    }
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="127.0.0.1", port=8001)

src/bpe_tokenizer.py ADDED Viewed

	@@ -0,0 +1,660 @@

+from tqdm import tqdm
+from collections import Counter
+import json
+from datasets import load_dataset
+import time
+import os
+import re
+import pandas as pd
+from multiprocessing import Pool
+import array
+def get_telugu_char_info():
+    """
+    Returns a dictionary of Telugu Unicode ranges with their descriptions.
+    Based on Unicode 13.0 Telugu block (0C00-0C7F).
+    """
+    return {
+        (0x0C00, 0x0C03): "Various forms of Telugu anusvara and visarga",
+        (0x0C05, 0x0C14): "Telugu vowels (అ to ఔ)",
+        (0x0C15, 0x0C39): "Telugu consonants (క to హ)",
+        (0x0C3D, 0x0C44): "Telugu vowel signs (ఽ to ౄ)",
+        (0x0C46, 0x0C48): "Telugu vowel signs (ె to ై)",
+        (0x0C4A, 0x0C4D): "Telugu vowel signs and virama (ొ to ్)",
+        (0x0C55, 0x0C56): "Telugu length marks",
+        (0x0C58, 0x0C5A): "Additional Telugu consonants",
+        (0x0C60, 0x0C63): "Telugu vocalic letters",
+        (0x0C66, 0x0C6F): "Telugu digits (౦ to ౯)",
+        (0x0C78, 0x0C7F): "Telugu fraction symbols"
+    }
+def create_base_vocab():
+    """Create a base vocabulary with ASCII, Telugu characters, and common ligatures."""
+    vocab = {}
+    token_id = 0
+    existing_tokens = set()  # Set to track existing tokens
+    # Add ASCII characters (0-127)
+    print("Adding ASCII characters...")
+    for i in range(128):
+        char_bytes = bytes([i])
+        try:
+            char = char_bytes.decode('utf-8', errors='strict')
+            vocab[token_id] = {
+                'text': char,
+                'bytes': list(char_bytes),
+                'type': 'ASCII',
+                'description': f"ASCII character: {repr(char)}"
+            }
+            token_id += 1
+        except UnicodeDecodeError:
+            continue
+    # Add Extended ASCII characters (128-255)
+    print("Adding Extended ASCII characters...")
+    for i in range(128, 256):
+        char_bytes = bytes([i])
+        try:
+            # Try to decode as UTF-8 first
+            char = char_bytes.decode('utf-8', errors='strict')
+            vocab[token_id] = {
+                'text': char if char.isprintable() else f"<{hex(i)[2:].upper()}>",
+                'bytes': list(char_bytes),
+                'type': 'Extended ASCII',
+                'description': f"Extended ASCII character: {char} ({hex(i)})"
+            }
+        except UnicodeDecodeError:
+            # If not valid UTF-8, store as bytes representation
+            vocab[token_id] = {
+                'text': f"[Bytes: {list(char_bytes)}]",
+                'bytes': list(char_bytes),
+                'type': 'Extended ASCII',
+                'description': f"Extended ASCII byte: {hex(i)}"
+            }
+        token_id += 1
+    # Add Telugu Unicode characters (0C00-0C7F)
+    print("Adding Telugu characters...")
+    telugu_info = get_telugu_char_info()
+    for i in range(0x0C00, 0x0C7F + 1):
+        try:
+            char = chr(i)
+            char_bytes = char.encode('utf-8')
+            # Only add if it's a valid character
+            char.encode('utf-8').decode('utf-8')
+            # Find the character's category
+            char_type = "Other Telugu Character"
+            char_description = "Telugu character"
+            for (start, end), desc in telugu_info.items():
+                if start <= i <= end:
+                    char_type = desc
+                    char_description = f"Telugu character: {char} ({hex(i)})"
+                    break
+            vocab[token_id] = {
+                'text': char,
+                'bytes': list(char_bytes),
+                'type': char_type,
+                'description': char_description
+            }
+            token_id += 1
+        except UnicodeEncodeError:
+            continue
+    # Define Telugu consonants and vowel signs
+    consonants = [
+        'క', 'ఖ', 'గ', 'ఘ', 'ఙ', 'చ', 'ఛ', 'జ', 'ఝ', 'ఞ',
+        'ట', 'ఠ', 'డ', 'ఢ', 'ణ', 'త', 'థ', 'ద', 'ధ', 'న',
+        'ప', 'ఫ', 'బ', 'భ', 'మ', 'య', 'ర', 'ల', 'వ', 'శ',
+        'ష', 'స', 'హ', 'ళ', 'క్ష', 'ఱ'
+    ]
+    vowel_signs = [
+        '', 'ా', 'ి', 'ీ', 'ు', 'ూ', 'ృ', 'ౄ', 'ౢ', 'ౣ', 'ె', 'ే', 'ై', 'ొ', 'ో', 'ౌ', 'ం', 'ః', 'ఁ', '్'
+    ]
+    # Add common Telugu ligatures with existing vowel signs
+    print("Adding common Telugu ligatures with existing vowel signs...")
+    for consonant in consonants:
+        for vowel_sign in vowel_signs:
+            ligature = consonant + vowel_sign
+            if ligature not in existing_tokens:  # Check for duplicates
+                char_bytes = ligature.encode('utf-8')
+                vocab[token_id] = {
+                    'text': ligature,
+                    'bytes': list(char_bytes),
+                    'type': 'Ligature',
+                    'description': f"Telugu ligature: {ligature}"
+                }
+                existing_tokens.add(ligature)  # Add to the set
+                token_id += 1
+    # Add valid consonant combinations
+    print("Adding valid consonant combinations...")
+    valid_consonant_combinations = [
+        'క్క', 'క్ఖ', 'క్గ', 'క్ఘ', 'క్ఙ', 'క్చ', 'క్ఛ', 'క్జ', 'క్ఝ', 'క్ఞ',
+        'క్ట', 'క్ఠ', 'క్డ', 'క్ఢ', 'క్ణ', 'క్త', 'క్థ', 'క్ద', 'క్ధ', 'క్న',
+        'క్ప', 'క్ఫ', 'క్బ', 'క్భ', 'క్మ', 'క్య', 'క్ర', 'క్ల', 'క్వ', 'క్శ',
+        'క్ష', 'క్స', 'క్హ', 'క్ళ', 'క్క్ష', 'క్ఱ',
+        'ఖ్క', 'ఖ్ఖ', 'ఖ్గ', 'ఖ్ఘ', 'ఖ్ఙ', 'ఖ్చ', 'ఖ్ఛ', 'ఖ్జ', 'ఖ్ఝ', 'ఖ్ఞ',
+        'ఖ్ట', 'ఖ్ఠ', 'ఖ్డ', 'ఖ్ఢ', 'ఖ్ణ', 'ఖ్త', 'ఖ్థ', 'ఖ్ద', 'ఖ్ధ', 'ఖ్న',
+        'ఖ్ప', 'ఖ్ఫ', 'ఖ్బ', 'ఖ్భ', 'ఖ్మ', 'ఖ్య', 'ఖ్ర', 'ఖ్ల', 'ఖ్వ', 'ఖ్శ',
+        'ఖ్ష', 'ఖ్స', 'ఖ్హ', 'ఖ్ళ', 'ఖ్క్ష', 'ఖ్ఱ',
+        'గ్క', 'గ్ఖ', 'గ్గ', 'గ్ఘ', 'గ్ఙ', 'గ్చ', 'గ్ఛ', 'గ్జ', 'గ్ఝ', 'గ్ఞ',
+        'గ్ట', 'గ్ఠ', 'గ్డ', 'గ్ఢ', 'గ్ణ', 'గ్త', 'గ్థ', 'గ్ద', 'గ్ధ', 'గ్న',
+        'గ్ప', 'గ్ఫ', 'గ్బ', 'గ్భ', 'గ్మ', 'గ్య', 'గ్ర', 'గ్ల', 'గ్వ', 'గ్శ',
+        'గ్ష', 'గ్స', 'గ్హ', 'గ్ళ', 'గ్క్ష', 'గ్ఱ',
+        'ఘ్క', 'ఘ్ఖ', 'ఘ్గ', 'ఘ్ఘ', 'ఘ్ఙ', 'ఘ్చ', 'ఘ్ఛ', 'ఘ్జ', 'ఘ్ఝ', 'ఘ్ఞ',
+        'ఘ్ట', 'ఘ్ఠ', 'ఘ్డ', 'ఘ్ఢ', 'ఘ్ణ', 'ఘ్త', 'ఘ్థ', 'ఘ్ద', 'ఘ్ధ', 'ఘ్న',
+        'ఘ్ప', 'ఘ్ఫ', 'ఘ్బ', 'ఘ్భ', 'ఘ్మ', 'ఘ్య', 'ఘ్ర', 'ఘ్ల', 'ఘ్వ', 'ఘ్శ',
+        'ఘ్ష', 'ఘ్స', 'ఘ్హ', 'ఘ్ళ', 'ఘ్క్ష', 'ఘ్ఱ',
+        'ఙ్క', 'ఙ్ఖ', 'ఙ్గ', 'ఙ్ఘ', 'ఙ్ఙ', 'ఙ్చ', 'ఙ్ఛ', 'ఙ్జ', 'ఙ్ఝ', 'ఙ్ఞ',
+        'ఙ్ట', 'ఙ్ఠ', 'ఙ్డ', 'ఙ్ఢ', 'ఙ్ణ', 'ఙ్త', 'ఙ్థ', 'ఙ్ద', 'ఙ్ధ', 'ఙ్న',
+        'ఙ్ప', 'ఙ్ఫ', 'ఙ్బ', 'ఙ్భ', 'ఙ్మ', 'ఙ్య', 'ఙ్ర', 'ఙ్ల', 'ఙ్వ', 'ఙ్శ',
+        'ఙ్ష', 'ఙ్స', 'ఙ్హ', 'ఙ్ళ', 'ఙ్క్ష', 'ఙ్ఱ',
+        'చ్క', 'చ్ఖ', 'చ్గ', 'చ్ఘ', 'చ్ఙ', 'చ్చ', 'చ్ఛ', 'చ్జ', 'చ్ఝ', 'చ్ఞ',
+        'చ్ట', 'చ్ఠ', 'చ్డ', 'చ్ఢ', 'చ్ణ', 'చ్త', 'చ్థ', 'చ్ద', 'చ్ధ', 'చ్న',
+        'చ్ప', 'చ్ఫ', 'చ్బ', 'చ్భ', 'చ్మ', 'చ్య', 'చ్ర', 'చ్ల', 'చ్వ', 'చ్శ',
+        'చ్ష', 'చ్స', 'చ్హ', 'చ్ళ', 'చ్క్ష', 'చ్ఱ',
+        'ఛ్క', 'ఛ్ఖ', 'ఛ్గ', 'ఛ్ఘ', 'ఛ్ఙ', 'ఛ్చ', 'ఛ్ఛ', 'ఛ్జ', 'ఛ్ఝ', 'ఛ్ఞ',
+        'ఛ్ట', 'ఛ్ఠ', 'ఛ్డ', 'ఛ్ఢ', 'ఛ్ణ', 'ఛ్త', 'ఛ్థ', 'ఛ్ద', 'ఛ్ధ', 'ఛ్న',
+        'ఛ్ప', 'ఛ్ఫ', 'ఛ్బ', 'ఛ్భ', 'ఛ్మ', 'ఛ్య', 'ఛ్ర', 'ఛ్ల', 'ఛ్వ', 'ఛ్శ',
+        'ఛ్ష', 'ఛ్స', 'ఛ్హ', 'ఛ్ళ', 'ఛ్క్ష', 'ఛ్ఱ',
+        'జ్క', 'జ్ఖ', 'జ్గ', 'జ్ఘ', 'జ్ఙ', 'జ్చ', 'జ్ఛ', 'జ్జ', 'జ్ఝ', 'జ్ఞ',
+        'జ్ట', 'జ్ఠ', 'జ్డ', 'జ్ఢ', 'జ్ణ', 'జ్త', 'జ్థ', 'జ్ద', 'జ్ధ', 'జ్న',
+        'జ్ప', 'జ్ఫ', 'జ్బ', 'జ్భ', 'జ్మ', 'జ్య', 'జ్ర', 'జ్ల', 'జ్వ', 'జ్శ',
+        'జ్ష', 'జ్స', 'జ్హ', 'జ్ళ', 'జ్క్ష', 'జ్ఱ',
+        'ఝ్క', 'ఝ్ఖ', 'ఝ్గ', 'ఝ్ఘ', 'ఝ్ఙ', 'ఝ్చ', 'ఝ్ఛ', 'ఝ్జ', 'ఝ్ఝ', 'ఝ్ఞ',
+        'ఝ్ట', 'ఝ్ఠ', 'ఝ్డ', 'ఝ్ఢ', 'ఝ్ణ', 'ఝ్త', 'ఝ్థ', 'ఝ్ద', 'ఝ్ధ', 'ఝ్న',
+        'ఝ్ప', 'ఝ్ఫ', 'ఝ్బ', 'ఝ్భ', 'ఝ్మ', 'ఝ్య', 'ఝ్ర', 'ఝ్ల', 'ఝ్వ', 'ఝ్శ',
+        'ఝ్ష', 'ఝ్స', 'ఝ్హ', 'ఝ్ళ', 'ఝ్క్ష', 'ఝ్ఱ',
+        'ఞ్క', 'ఞ్ఖ', 'ఞ్గ', 'ఞ్ఘ', 'ఞ్ఙ', 'ఞ్చ', 'ఞ్ఛ', 'ఞ్జ', 'ఞ్ఝ', 'ఞ్ఞ',
+        'ఞ్ట', 'ఞ్ఠ', 'ఞ్డ', 'ఞ్ఢ', 'ఞ్ణ', 'ఞ్త', 'ఞ్థ', 'ఞ్ద', 'ఞ్ధ', 'ఞ్న',
+        'ఞ్ప', 'ఞ్ఫ', 'ఞ్బ', 'ఞ్భ', 'ఞ్మ', 'ఞ్య', 'ఞ్ర', 'ఞ్ల', 'ఞ్వ', 'ఞ్శ',
+        'ఞ్ష', 'ఞ్స', 'ఞ్హ', 'ఞ్ళ', 'ఞ్క్ష', 'ఞ్ఱ',
+        'ట్క', 'ట్ఖ', 'ట్గ', 'ట్ఘ', 'ట్ఙ', 'ట్చ', 'ట్ఛ', 'ట్జ', 'ట్ఝ', 'ట్ఞ',
+        'ట్ట', 'ట్ఠ', 'ట్డ', 'ట్ఢ', 'ట్ణ', 'ట్త', 'ట్థ', 'ట్ద', 'ట్ధ', 'ట్న',
+        'ట్ప', 'ట్ఫ', 'ట్బ', 'ట్భ', 'ట్మ', 'ట్య', 'ట్ర', 'ట్ల', 'ట్వ', 'ట్శ',
+        'ట్ష', 'ట్స', 'ట్హ', 'ట్ళ', 'ట్క్ష', 'ట్ఱ',
+        'ఠ్క', 'ఠ్ఖ', 'ఠ్గ', 'ఠ్ఘ', 'ఠ్ఙ', 'ఠ్చ', 'ఠ్ఛ', 'ఠ్జ', 'ఠ్ఝ', 'ఠ్ఞ',
+        'ఠ్ట', 'ఠ్ఠ', 'ఠ్డ', 'ఠ్ఢ', 'ఠ్ణ', 'ఠ్త', 'ఠ్థ', 'ఠ్ద', 'ఠ్ధ', 'ఠ్న',
+        'ఠ్ప', 'ఠ్ఫ', 'ఠ్బ', 'ఠ్భ', 'ఠ్మ', 'ఠ్య', 'ఠ్ర', 'ఠ్ల', 'ఠ్వ', 'ఠ్శ',
+        'ఠ్ష', 'ఠ్స', 'ఠ్హ', 'ఠ్ళ', 'ఠ్క్ష', 'ఠ్ఱ',
+        'డ్క', 'డ్ఖ', 'డ్గ', 'డ్ఘ', 'డ్ఙ', 'డ్చ', 'డ్ఛ', 'డ్జ', 'డ్ఝ', 'డ్ఞ',
+        'డ్ట', 'డ్ఠ', 'డ్డ', 'డ్ఢ', 'డ్ణ', 'డ్త', 'డ్థ', 'డ్ద', 'డ్ధ', 'డ్న',
+        'డ్ప', 'డ్ఫ', 'డ్బ', 'డ్భ', 'డ్మ', 'డ్య', 'డ్ర', 'డ్ల', 'డ్వ', 'డ్శ',
+        'డ్ష', 'డ్స', 'డ్హ', 'డ్ళ', 'డ్క్ష', 'డ్ఱ',
+        'ఢ్క', 'ఢ్ఖ', 'ఢ్గ', 'ఢ్ఘ', 'ఢ్ఙ', 'ఢ్చ', 'ఢ్ఛ', 'ఢ్జ', 'ఢ్ఝ', 'ఢ్ఞ',
+        'ఢ్ట', 'ఢ్ఠ', 'ఢ్డ', 'ఢ్ఢ', 'ఢ్ణ', 'ఢ్త', 'ఢ్థ', 'ఢ్ద', 'ఢ్ధ', 'ఢ్న',
+        'ఢ్ప', 'ఢ్ఫ', 'ఢ్బ', 'ఢ్భ', 'ఢ్మ', 'ఢ్య', 'ఢ్ర', 'ఢ్ల', 'ఢ్వ', 'ఢ్శ',
+        'ఢ్ష', 'ఢ్స', 'ఢ్హ', 'ఢ్ళ', 'ఢ్క్ష', 'ఢ్ఱ',
+        'ణ్క', 'ణ్ఖ', 'ణ్గ', 'ణ్ఘ', 'ణ్ఙ', 'ణ్చ', 'ణ్ఛ', 'ణ్జ', 'ణ్ఝ', 'ణ్ఞ',
+        'ణ్ట', 'ణ్ఠ', 'ణ్డ', 'ణ్ఢ', 'ణ్ణ', 'ణ్త', 'ణ్థ', 'ణ్ద', 'ణ్ధ', 'ణ్న',
+        'ణ్ప', 'ణ్ఫ', 'ణ్బ', 'ణ్భ', 'ణ్మ', 'ణ్య', 'ణ్ర', 'ణ్ల', 'ణ్వ', 'ణ్శ',
+        'ణ్ష', 'ణ్స', 'ణ్హ', 'ణ్ళ', 'ణ్క్ష', 'ణ్ఱ',
+        'త్క', 'త్ఖ', 'త్గ', 'త్ఘ', 'త్ఙ', 'త్చ', 'త్ఛ', 'త్జ', 'త్ఝ', 'త్ఞ',
+        'త్ట', 'త్ఠ', 'త్డ', 'త్ఢ', 'త్ణ', 'త్త', 'త్థ', 'త్ద', 'త్ధ', 'త్న',
+        'త్ప', 'త్ఫ', 'త్బ', 'త్భ', 'త్మ', 'త్య', 'త్ర', 'త్ల', 'త్వ', 'త్శ',
+        'త్ష', 'త్స', 'త్హ', 'త్ళ', 'త్క్ష', 'త్ఱ',
+        'థ్క', 'థ్ఖ', 'థ్గ', 'థ్ఘ', 'థ్ఙ', 'థ్చ', 'థ్ఛ', 'థ్జ', 'థ్ఝ', 'థ్ఞ',
+        'థ్ట', 'థ్ఠ', 'థ్డ', 'థ్ఢ', 'థ్ణ', 'థ్త', 'థ్థ', 'థ్ద', 'థ్ధ', 'థ్న',
+        'థ్ప', 'థ్ఫ', 'థ్బ', 'థ్భ', 'థ్మ', 'థ్య', 'థ్ర', 'థ్ల', 'థ్వ', 'థ్శ',
+        'థ్ష', 'థ్స', 'థ్హ', 'థ్ళ', 'థ్క్ష', 'థ్ఱ',
+        'ద్క', 'ద్ఖ', 'ద్గ', 'ద్ఘ', 'ద్ఙ', 'ద్చ', 'ద్ఛ', 'ద్జ', 'ద్ఝ', 'ద్ఞ',
+        'ద్ట', 'ద్ఠ', 'ద్డ', 'ద్ఢ', 'ద్ణ', 'ద్త', 'ద్థ', 'ద్ద', 'ద్ధ', 'ద్న',
+        'ద్ప', 'ద్ఫ', 'ద్బ', 'ద్భ', 'ద్మ', 'ద్య', 'ద్ర', 'ద్ల', 'ద్వ', 'ద్శ',
+        'ద్ష', 'ద్స', 'ద్హ', 'ద్ళ', 'ద్క్ష', 'ద్ఱ',
+        'ధ్క', 'ధ్ఖ', 'ధ్గ', 'ధ్ఘ', 'ధ్ఙ', 'ధ్చ', 'ధ్ఛ', 'ధ్జ', 'ధ్ఝ', 'ధ్ఞ',
+        'ధ్ట', 'ధ్ఠ', 'ధ్డ', 'ధ్ఢ', 'ధ్ణ', 'ధ్త', 'ధ్థ', 'ధ్ద', 'ధ్ధ', 'ధ్న',
+        'ధ్ప', 'ధ్ఫ', 'ధ్బ', 'ధ్భ', 'ధ్మ', 'ధ్య', 'ధ్ర', 'ధ్ల', 'ధ్వ', 'ధ్శ',
+        'ధ్ష', 'ధ్స', 'ధ్హ', 'ధ్ళ', 'ధ్క్ష', 'ధ్ఱ',
+        'న్క', 'న్ఖ', 'న్గ', 'న్ఘ', 'న్ఙ', 'న్చ', 'న్ఛ', 'న్జ', 'న్ఝ', 'న్ఞ',
+        'న్ట', 'న్ఠ', 'న్డ', 'న్ఢ', 'న్ణ', 'న్త', 'న్థ', 'న్ద', 'న్ధ', 'న్న',
+        'న్ప', 'న్ఫ', 'న్బ', 'న్భ', 'న్మ', 'న్య', 'న్ర', 'న్ల', 'న్వ', 'న్శ',
+        'న్ష', 'న్స', 'న్హ', 'న్ళ', 'న్క్ష', 'న్ఱ',
+        'ప్క', 'ప్ఖ', 'ప్గ', 'ప్ఘ', 'ప్ఙ', 'ప్చ', 'ప్ఛ', 'ప్జ', 'ప్ఝ', 'ప్ఞ',
+        'ప్ట', 'ప్ఠ', 'ప్డ', 'ప్ఢ', 'ప్ణ', 'ప్త', 'ప్థ', 'ప్ద', 'ప్ధ', 'ప్న',
+        'ప్ప', 'ప్ఫ', 'ప్బ', 'ప్భ', 'ప్మ', 'ప్య', 'ప్ర', 'ప్ల', 'ప్వ', 'ప్శ',
+        'ప్ష', 'ప్స', 'ప్హ', 'ప్ళ', 'ప్క్ష', 'ప్ఱ',
+        'ఫ్క', 'ఫ్ఖ', 'ఫ్గ', 'ఫ్ఘ', 'ఫ్ఙ', 'ఫ్చ', 'ఫ్ఛ', 'ఫ్జ', 'ఫ్ఝ', 'ఫ్ఞ',
+        'ఫ్ట', 'ఫ్ఠ', 'ఫ్డ', 'ఫ్ఢ', 'ఫ్ణ', 'ఫ్త', 'ఫ్థ', 'ఫ్ద', 'ఫ్ధ', 'ఫ్న',
+        'ఫ్ప', 'ఫ్ఫ', 'ఫ్బ', 'ఫ్భ', 'ఫ్మ', 'ఫ్య', 'ఫ్ర', 'ఫ్ల', 'ఫ్వ', 'ఫ్శ',
+        'ఫ్ష', 'ఫ్స', 'ఫ్హ', 'ఫ్ళ', 'ఫ్క్ష', 'ఫ్ఱ',
+        'బ్క', 'బ్ఖ', 'బ్గ', 'బ్ఘ', 'బ్ఙ', 'బ్చ', 'బ్ఛ', 'బ్జ', 'బ్ఝ', 'బ్ఞ',
+        'బ్ట', 'బ్ఠ', 'బ్డ', 'బ్ఢ', 'బ్ణ', 'బ్త', 'బ్థ', 'బ్ద', 'బ్ధ', 'బ్న',
+        'బ్ప', 'బ్ఫ', 'బ్బ', 'బ్భ', 'బ్మ', 'బ్య', 'బ్ర', 'బ్ల', 'బ్వ', 'బ్శ',
+        'బ్ష', 'బ్స', 'బ్హ', 'బ్ళ', 'బ్క్ష', 'బ్ఱ',
+        'భ్క', 'భ్ఖ', 'భ్గ', 'భ్ఘ', 'భ్ఙ', 'భ్చ', 'భ్ఛ', 'భ్జ', 'భ్ఝ', 'భ్ఞ',
+        'భ్ట', 'భ్ఠ', 'భ్డ', 'భ్ఢ', 'భ్ణ', 'భ్త', 'భ్థ', 'భ్ద', 'భ్ధ', 'భ్న',
+        'భ్ప', 'భ్ఫ', 'భ్బ', 'భ్భ', 'భ్మ', 'భ్య', 'భ్ర', 'భ్ల', 'భ్వ', 'భ్శ',
+        'భ్ష', 'భ్స', 'భ్హ', 'భ్ళ', 'భ్క్ష', 'భ్ఱ',
+        'మ్క', 'మ్ఖ', 'మ్గ', 'మ్ఘ', 'మ్ఙ', 'మ్చ', 'మ్ఛ', 'మ్జ', 'మ్ఝ', 'మ్ఞ',
+        'మ్ట', 'మ్ఠ', 'మ్డ', 'మ్ఢ', 'మ్ణ', 'మ్త', 'మ్థ', 'మ్ద', 'మ్ధ', 'మ్న',
+        'మ్ప', 'మ్ఫ', 'మ్బ', 'మ్భ', 'మ్మ', 'మ్య', 'మ్ర', 'మ్ల', 'మ్వ', 'మ్శ',
+        'మ్ష', 'మ్స', 'మ్హ', 'మ్ళ', 'మ్క్ష', 'మ్ఱ',
+        'య్క', 'య్ఖ', 'య్గ', 'య్ఘ', 'య్ఙ', 'య్చ', 'య్ఛ', 'య్జ', 'య్ఝ', 'య్ఞ',
+        'య్ట', 'య్ఠ', 'య్డ', 'య్ఢ', 'య్ణ', 'య్త', 'య్థ', 'య్ద', 'య్ధ', 'య్న',
+        'య్ప', 'య్ఫ', 'య్బ', 'య్భ', 'య్మ', 'య్య', 'య్ర', 'య్ల', 'య్వ', 'య్శ',
+        'య్ష', 'య్స', 'య్హ', 'య్ళ', 'య్క్ష', 'య్ఱ',
+        'ర్క', 'ర్ఖ', 'ర్గ', 'ర్ఘ', 'ర్ఙ', 'ర్చ', 'ర్ఛ', 'ర్జ', 'ర్ఝ', 'ర్ఞ',
+        'ర్ట', 'ర్ఠ', 'ర్డ', 'ర్ఢ', 'ర్ణ', 'ర్త', 'ర్థ', 'ర్ద', 'ర్ధ', 'ర్న',
+        'ర్ప', 'ర్ఫ', 'ర్బ', 'ర్భ', 'ర్మ', 'ర్య', 'ర్ర', 'ర్ల', 'ర్వ', 'ర్శ',
+        'ర్ష', 'ర్స', 'ర్హ', 'ర్ళ', 'ర్క్ష', 'ర్ఱ',
+        'ల్క', 'ల్ఖ', 'ల్గ', 'ల్ఘ', 'ల్ఙ', 'ల్చ', 'ల్ఛ', 'ల్జ', 'ల్ఝ', 'ల్ఞ',
+        'ల్ట', 'ల్ఠ', 'ల్డ', 'ల్ఢ', 'ల్ణ', 'ల్త', 'ల్థ', 'ల్ద', 'ల్ధ', 'ల్న',
+        'ల్ప', 'ల్ఫ', 'ల్బ', 'ల్భ', 'ల్మ', 'ల్య', 'ల్ర', 'ల్ల', 'ల్వ', 'ల్శ',
+        'ల్ష', 'ల్స', 'ల్హ', 'ల్ళ', 'ల్క్ష', 'ల్ఱ',
+        'వ్క', 'వ్ఖ', 'వ్గ', 'వ్ఘ', 'వ్ఙ', 'వ్చ', 'వ్ఛ', 'వ్జ', 'వ్ఝ', 'వ్ఞ',
+        'వ్ట', 'వ్ఠ', 'వ్డ', 'వ్ఢ', 'వ్ణ', 'వ్త', 'వ్థ', 'వ్ద', 'వ్ధ', 'వ్న',
+        'వ్ప', 'వ్ఫ', 'వ్బ', 'వ్భ', 'వ్మ', 'వ్య', 'వ్ర', 'వ్ల', 'వ్వ', 'వ్శ',
+        'వ్ష', 'వ్స', 'వ్హ', 'వ్ళ', 'వ్క్ష', 'వ్ఱ',
+        'శ్క', 'శ్ఖ', 'శ్గ', 'శ్ఘ', 'శ్ఙ', 'శ్చ', 'శ్ఛ', 'శ్జ', 'శ్ఝ', 'శ్ఞ',
+        'శ్ట', 'శ్ఠ', 'శ్డ', 'శ్ఢ', 'శ్ణ', 'శ్త', 'శ్థ', 'శ్ద', 'శ్ధ', 'శ్న',
+        'శ్ప', 'శ్ఫ', 'శ్బ', 'శ్భ', 'శ్మ', 'శ్య', 'శ్ర', 'శ���ల', 'శ్వ', 'శ్శ',
+        'శ్ష', 'శ్స', 'శ్హ', 'శ్ళ', 'శ్క్ష', 'శ్ఱ',
+        'ష్క', 'ష్ఖ', 'ష్గ', 'ష్ఘ', 'ష్ఙ', 'ష్చ', 'ష్ఛ', 'ష్జ', 'ష్ఝ', 'ష్ఞ',
+        'ష్ట', 'ష్ఠ', 'ష్డ', 'ష్ఢ', 'ష్ణ', 'ష్త', 'ష్థ', 'ష్ద', 'ష్ధ', 'ష్న',
+        'ష్ప', 'ష్ఫ', 'ష్బ', 'ష్భ', 'ష్మ', 'ష్య', 'ష్ర', 'ష్ల', 'ష్వ', 'ష్శ',
+        'ష్ష', 'ష్స', 'ష్హ', 'ష్ళ', 'ష్క్ష', 'ష్ఱ',
+        'స్క', 'స్ఖ', 'స్గ', 'స్ఘ', 'స్ఙ', 'స్చ', 'స్ఛ', 'స్జ', 'స్ఝ', 'స్ఞ',
+        'స్ట', 'స్ఠ', 'స్డ', 'స్ఢ', 'స్ణ', 'స్త', 'స్థ', 'స్ద', 'స్ధ', 'స్న',
+        'స్ప', 'స్ఫ', 'స్బ', 'స్భ', 'స్మ', 'స్య', 'స్ర', 'స్ల', 'స్వ', 'స్శ',
+        'స్ష', 'స్స', 'స్హ', 'స్ళ', 'స్క్ష', 'స్ఱ',
+        'హ్క', 'హ్ఖ', 'హ్గ', 'హ్ఘ', 'హ్ఙ', 'హ్చ', 'హ్ఛ', 'హ్జ', 'హ్ఝ', 'హ్ఞ',
+        'హ్ట', 'హ్ఠ', 'హ్డ', 'హ్ఢ', 'హ్ణ', 'హ్త', 'హ్థ', 'హ్ద', 'హ్ధ', 'హ్న',
+        'హ్ప', 'హ్ఫ', 'హ్బ', 'హ్భ', 'హ్మ', 'హ్య', 'హ్ర', 'హ్ల', 'హ్వ', 'హ్శ',
+        'హ్ష', 'హ్స', 'హ్హ', 'హ్ళ', 'హ్క్ష', 'హ్ఱ',
+        'ళ్క', 'ళ్ఖ', 'ళ్గ', 'ళ్ఘ', 'ళ్ఙ', 'ళ్చ', 'ళ్ఛ', 'ళ్జ', 'ళ్ఝ', 'ళ్ఞ',
+        'ళ్ట', 'ళ్ఠ', 'ళ్డ', 'ళ్ఢ', 'ళ్ణ', 'ళ్త', 'ళ్థ', 'ళ్ద', 'ళ్ధ', 'ళ్న',
+        'ళ్ప', 'ళ్ఫ', 'ళ్బ', 'ళ్భ', 'ళ్మ', 'ళ్య', 'ళ్ర', 'ళ్ల', 'ళ్వ', 'ళ్శ',
+        'ళ్ష', 'ళ్స', 'ళ్హ', 'ళ్ళ', 'ళ్క్ష', 'ళ్ఱ',
+        'క్ష్క', 'క్ష్ఖ', 'క్ష్గ', 'క్ష్ఘ', 'క్ష్ఙ', 'క్ష్చ', 'క్ష్ఛ', 'క్ష్జ', 'క్ష్ఝ', 'క్ష్ఞ',
+        'క్ష్ట', 'క్ష్ఠ', 'క్ష్డ', 'క్ష్ఢ', 'క్ష్ణ', 'క్ష్త', 'క్ష్థ', 'క్ష్ద', 'క్ష్ధ', 'క్ష్న',
+        'క్ష్ప', 'క్ష్ఫ', 'క్ష్బ', 'క్ష్భ', 'క్ష్మ', 'క్ష్య', 'క్ష్ర', 'క్ష్ల', 'క్ష్వ', 'క్ష్శ',
+        'క్ష్ష', 'క్ష్స', 'క్ష్హ', 'క్ష్ళ', 'క్ష్క్ష', 'క్ష్ఱ',
+        'ఱ్క', 'ఱ్ఖ', 'ఱ్గ', 'ఱ్ఘ', 'ఱ్ఙ', 'ఱ్చ', 'ఱ్ఛ', 'ఱ్జ', 'ఱ్ఝ', 'ఱ్ఞ',
+        'ఱ్ట', 'ఱ్ఠ', 'ఱ్డ', 'ఱ్ఢ', 'ఱ్ణ', 'ఱ్త', 'ఱ్థ', 'ఱ్ద', 'ఱ్ధ', 'ఱ్న',
+        'ఱ్ప', 'ఱ్ఫ', 'ఱ్బ', 'ఱ్భ', 'ఱ్మ', 'ఱ్య', 'ఱ్ర', 'ఱ్ల', 'ఱ్వ', 'ఱ్శ',
+        'ఱ్ష', 'ఱ్స', 'ఱ్హ', 'ఱ్ళ', 'ఱ్క్ష', 'ఱ్ఱ'
+        # Add more valid combinations as needed
+    ]
+    for combination in valid_consonant_combinations:
+        if combination not in existing_tokens:  # Check for duplicates
+            char_bytes = combination.encode('utf-8')
+            vocab[token_id] = {
+                'text': combination,
+                'bytes': list(char_bytes),
+                'type': 'Ligature',
+                'description': f"Telugu ligature: {combination}"
+            }
+            existing_tokens.add(combination)  # Add to the set
+            token_id += 1
+    print(f"Created base vocabulary with {len(vocab)} tokens")
+    return vocab
+def save_base_vocab(vocab, path='telugu_base_vocab.json'):
+    """Save the base vocabulary with character information."""
+    # Sort by character type for better readability
+    sorted_vocab = {}
+    for k, v in sorted(vocab.items(), key=lambda x: (x[1]['type'], x[0])):
+        sorted_vocab[str(k)] = v
+    with open(path, 'w', encoding='utf-8') as f:
+        json.dump(sorted_vocab, f, ensure_ascii=False, indent=2)
+    print(f"Base vocabulary saved to {path}")
+def load_base_vocab(path='telugu_base_vocab.json'):
+    """Load the base vocabulary."""
+    with open(path, 'r', encoding='utf-8') as f:
+        vocab = json.load(f)
+    return {int(k): bytes(v['bytes']) for k, v in vocab.items()}
+class BPETokenizer:
+    def __init__(self, vocab_size=5000, sample_size=None):
+        self.vocab_size = vocab_size
+        self.sample_size = sample_size
+        # First try to load trained vocabulary
+        trained_vocab_path = 'telugu_tokenizer_vocab.json'
+        if os.path.exists(trained_vocab_path):
+            print("Loading trained vocabulary...")
+            self.load('telugu_tokenizer')  # This loads both vocab and merges
+            return
+        # If no trained vocab exists, fall back to base vocabulary
+        base_vocab_path = 'telugu_base_vocab.json'
+        if os.path.exists(base_vocab_path):
+            print("Loading existing base vocabulary...")
+            self.vocab = load_base_vocab(base_vocab_path)
+        else:
+            print("Creating new base vocabulary...")
+            base_vocab = create_base_vocab()
+            save_base_vocab(base_vocab)
+            self.vocab = load_base_vocab(base_vocab_path)
+        self.base_vocab_size = len(self.vocab)
+        self.merges = {}
+    def get_stats(self, ids):
+        """Count token pair frequencies."""
+        counts = {}
+        for pair in zip(ids, ids[1:]):
+            counts[pair] = counts.get(pair, 0) + 1
+        return counts
+    def merge(self, ids, pair, idx):
+        """Merge all occurrences of a token pair."""
+        # Create the merged token
+        merged_token = self.vocab[pair[0]] + self.vocab[pair[1]]
+        # Check if the merged token already exists in the vocabulary
+        for existing_id, existing_token in self.vocab.items():
+            if existing_token == merged_token:
+                # Instead of skipping, use the existing token ID for merging
+                print(f"Merge for {pair} already exists in the vocabulary.")
+                newids = []
+                i = 0
+                while i < len(ids):
+                    if i < len(ids) - 1 and ids[i] == pair[0] and ids[i + 1] == pair[1]:
+                        newids.append(existing_id)
+                        i += 2
+                    else:
+                        newids.append(ids[i])
+                        i += 1
+                return newids
+        # If we get here, the merged token doesn't exist yet
+        newids = []
+        i = 0
+        while i < len(ids):
+            if i < len(ids) - 1 and ids[i] == pair[0] and ids[i + 1] == pair[1]:
+                newids.append(idx)
+                i += 2
+            else:
+                newids.append(ids[i])
+                i += 1
+        return newids
+    def _process_chunk(self, args):
+        """Process a chunk of text for parallel processing."""
+        chunk, byte_to_token = args
+        ids = array.array('I')  # Unsigned int array
+        j = 0
+        while j < len(chunk):
+            if chunk[j] == 32:  # Space
+                ids.append(32)
+                j += 1
+                continue
+            found = False
+            for length in [3, 2, 1]:
+                if j + length <= len(chunk):
+                    char_bytes = bytes(chunk[j:j+length])
+                    if char_bytes in byte_to_token:
+                        ids.append(byte_to_token[char_bytes])
+                        j += length
+                        found = True
+                        break
+            if not found:
+                j += 1
+        return ids
+    def fit(self, text):
+        """Train the BPE tokenizer."""
+        print("Converting text to token IDs using base vocabulary...")
+        original_bytes = text.encode('utf-8')
+        original_length = len(original_bytes)
+        print(f"\nBefore training: text bytes length: {original_length:,}")
+        # Pre-compute byte sequences for faster lookup
+        byte_to_token = {token_bytes: token_id for token_id, token_bytes in self.vocab.items()}
+        # Parallel processing of chunks
+        num_cores = os.cpu_count() or 1
+        chunk_size = max(1024 * 64, len(original_bytes) // (num_cores * 4))  # Larger chunks
+        chunks = [original_bytes[i:i + chunk_size] for i in range(0, len(original_bytes), chunk_size)]
+        print(f"Processing {len(chunks)} chunks using {num_cores} cores...")
+        # Process chunks in parallel
+        with Pool(num_cores) as pool:
+            chunk_results = list(tqdm(
+                pool.imap(self._process_chunk, [(chunk, byte_to_token) for chunk in chunks]),
+                total=len(chunks),
+                desc="Initial tokenization"
+            ))
+        # Combine results
+        ids = array.array('I')
+        for result in chunk_results:
+            ids.extend(result)
+        print(f"\nBase vocabulary size: {self.base_vocab_size}")
+        print(f"Initial sequence length: {len(ids)}")
+        # Keep training until we reach the target vocab size
+        target_vocab_size = self.vocab_size
+        pbar = tqdm(total=target_vocab_size - self.base_vocab_size, desc="Training BPE")
+        last_vocab_size = len(self.vocab)
+        while len(self.vocab) < target_vocab_size:
+            stats = self.get_stats(ids)
+            if not stats:
+                print("No more pairs to merge.")
+                break
+            pair = max(stats, key=stats.get)
+            idx = len(self.vocab)
+            ids = self.merge(ids, pair, idx)
+            # Only update progress when vocabulary actually grows
+            if len(self.vocab) > last_vocab_size:
+                pbar.update(len(self.vocab) - last_vocab_size)
+                last_vocab_size = len(self.vocab)
+            # Add the merged token to the vocabulary
+            if pair not in self.merges:  # Ensure we don't overwrite existing merges
+                self.merges[pair] = idx
+                self.vocab[idx] = self.vocab[pair[0]] + self.vocab[pair[1]]
+            # Print progress periodically
+            if len(self.vocab) % 100 == 0:
+                try:
+                    text0 = self.vocab[pair[0]].decode('utf-8')
+                    text1 = self.vocab[pair[1]].decode('utf-8')
+                    merged = self.vocab[idx].decode('utf-8')
+                    print(f"\nVocab size: {len(self.vocab)}: {text0} + {text1} = {merged}")
+                except UnicodeDecodeError:
+                    continue
+        pbar.close()
+        print("\nFinal statistics:")
+        print(f"Final vocabulary size: {len(self.vocab):,}")
+        print(f"Number of merges: {len(self.merges):,}")
+        print(f"Final compression ratio: {original_length / len(ids):.2f}x")
+    def encode(self, text):
+        """Encode text to token IDs."""
+        final_tokens = []
+        i = 0
+        text_bytes = text.encode('utf-8')
+        while i < len(text_bytes):
+            # If we're at a leading space, encode it separately
+            if text_bytes[i] == 32:  # ASCII space
+                final_tokens.append(32)  # Space token
+                i += 1
+                continue
+            # Try to find the longest matching sequence (including potential trailing spaces)
+            longest_match = None
+            longest_length = 0
+            matched_token = None
+            # Sort vocab items by length (longest first)
+            for token_id, token_bytes in sorted(self.vocab.items(),
+                                              key=lambda x: len(x[1]),
+                                              reverse=True):
+                if (i + len(token_bytes) <= len(text_bytes) and
+                    text_bytes[i:i+len(token_bytes)] == token_bytes):
+                    longest_length = len(token_bytes)
+                    longest_match = token_bytes
+                    matched_token = token_id
+                    break
+            if longest_match:
+                final_tokens.append(matched_token)
+                i += longest_length
+            else:
+                # If no match found, fall back to single byte
+                for token_id, token_bytes in self.vocab.items():
+                    if token_bytes == bytes([text_bytes[i]]):
+                        final_tokens.append(token_id)
+                        break
+                i += 1
+        return final_tokens
+    def decode(self, tokens):
+        """Decode token IDs back to text."""
+        bytes_tokens = b''.join(self.vocab[idx] for idx in tokens)
+        return bytes_tokens.decode('utf-8')
+    def save(self, path):
+        """Save the tokenizer mappings to files."""
+        base_path = path.rsplit('.', 1)[0]
+        # Save vocabulary with human-readable form
+        vocab_mapping = {}
+        for token_id, byte_seq in self.vocab.items():
+            try:
+                text = byte_seq.decode('utf-8')
+                vocab_mapping[token_id] = {
+                    'text': text,
+                    'bytes': list(byte_seq),
+                    'is_base': token_id < self.base_vocab_size
+                }
+            except UnicodeDecodeError:
+                vocab_mapping[token_id] = {
+                    'text': f"[Bytes: {list(byte_seq)}]",
+                    'bytes': list(byte_seq),
+                    'is_base': token_id < self.base_vocab_size
+                }
+        # Save merge patterns with human-readable form
+        merge_patterns = {}
+        for (p0, p1), idx in self.merges.items():
+            try:
+                text0 = self.vocab[p0].decode('utf-8')
+                text1 = self.vocab[p1].decode('utf-8')
+                merged = self.vocab[idx].decode('utf-8')
+                merge_patterns[idx] = {
+                    'parts': [text0, text1],
+                    'result': merged,
+                    'token_ids': [p0, p1]
+                }
+            except UnicodeDecodeError:
+                merge_patterns[idx] = {
+                    'parts': [f"Token_{p0}", f"Token_{p1}"],
+                    'result': f"Token_{idx}",
+                    'token_ids': [p0, p1]
+                }
+        with open(f"{base_path}_vocab.json", 'w', encoding='utf-8') as f:
+            json.dump(vocab_mapping, f, ensure_ascii=False, indent=2)
+        with open(f"{base_path}_merges.json", 'w', encoding='utf-8') as f:
+            json.dump(merge_patterns, f, ensure_ascii=False, indent=2)
+        print(f"\nTokenizer mappings saved to {base_path}_vocab.json and {base_path}_merges.json")
+    def load(self, path):
+        """Load the tokenizer from mapping files."""
+        base_path = path.rsplit('.', 1)[0]
+        with open(f"{base_path}_vocab.json", 'r', encoding='utf-8') as f:
+            vocab_mapping = json.load(f)
+            self.vocab = {
+                int(k): bytes(v['bytes'])
+                for k, v in vocab_mapping.items()
+            }
+            # Find base vocabulary size
+            self.base_vocab_size = sum(1 for k, v in vocab_mapping.items() if v['is_base'])
+        with open(f"{base_path}_merges.json", 'r', encoding='utf-8') as f:
+            merge_patterns = json.load(f)
+            self.merges = {
+                tuple(v['token_ids']): int(k)
+                for k, v in merge_patterns.items()
+            }
+        self.vocab_size = len(self.vocab)
+        print(f"Loaded tokenizer from {base_path}_*.json files")
+    def train_on_dataset(self):
+        """Train tokenizer on the Telugu news dataset."""
+        print("Loading dataset...")
+        try:
+            # Load the local parquet file
+            dataset = pd.read_parquet('telugu_news_dataset.parquet')
+            print("Preparing training text...")
+            training_text = []
+            for _, row in tqdm(dataset.iterrows(), desc="Loading documents", total=len(dataset)):
+                if not pd.isna(row["headline"]): training_text.append(row["headline"])
+                if not pd.isna(row["article"]): training_text.append(row["article"])
+                if self.sample_size and len(training_text) >= self.sample_size:
+                    print(f"Using first {self.sample_size} documents for training")
+                    break
+            full_text = "\n".join(training_text)
+            print(f"\nTraining on {len(training_text)} documents...")
+            print(f"Total characters in training data: {len(full_text):,}")
+            start_time = time.time()
+            self.fit(full_text)
+            print(f"Training time: {time.time() - start_time:.2f} seconds")
+        except Exception as e:
+            print(f"Error loading dataset: {str(e)}")
+            print("Falling back to sample text...")
+            sample_text = """
+            తెలుగు భాష దక్షిణ భారతదేశంలోని ద్రావిడ భాషల్లో ఒకటి.
+            ఆంధ్ర ప్రదేశ్ మరియు తెలంగాణ రాష్ట్రాల అధికార భాష.
+            """
+            self.fit(sample_text)
+if __name__ == "__main__":
+    # For quick testing, use a small sample
+    tokenizer = BPETokenizer(vocab_size=4999, sample_size=None)
+    vocab_file = 'telugu_tokenizer_vocab.json'
+    merges_file = 'telugu_tokenizer_merges.json'
+    if os.path.exists(vocab_file) and os.path.exists(merges_file):
+        print("Loading pre-trained tokenizer...")
+        tokenizer.load('telugu_tokenizer')
+    else:
+        print("Training new tokenizer...")
+        tokenizer.train_on_dataset()
+        tokenizer.save('telugu_tokenizer')
+    # Test the tokenizer
+    test_text = "తెలుగు భాష"
+    encoded = tokenizer.encode(test_text)
+    decoded = tokenizer.decode(encoded)
+    print("\nTest Results:")
+    print(f"Original: {test_text}")
+    print(f"Encoded: {encoded}")
+    print(f"Decoded: {decoded}")
+    print(f"Matches original: {test_text == decoded}")

src/templates/index.html ADDED Viewed

	@@ -0,0 +1,134 @@

+<!DOCTYPE html>
+<html>
+<head>
+    <title>{{ title }}</title>
+    <script src="https://cdn.tailwindcss.com"></script>
+</head>
+<body class="bg-gray-100">
+    <div class="container mx-auto px-4 py-8">
+        <h1 class="text-3xl font-bold mb-8">Telugu BPE Tokenizer</h1>
+        <div class="bg-white rounded-lg shadow p-6">
+            <textarea
+                id="input-text"
+                class="w-full p-2 border rounded mb-4"
+                rows="4"
+                placeholder="Enter Telugu text here..."></textarea>
+            <button
+                onclick="tokenize()"
+                class="bg-blue-500 text-white px-4 py-2 rounded hover:bg-blue-600">
+                Tokenize
+            </button>
+            <div id="result" class="mt-6 hidden">
+                <h2 class="text-xl font-semibold mb-2">Results:</h2>
+                <div class="space-y-4">
+                    <div>
+                        <span class="font-medium">Tokens:</span>
+                        <pre id="tokens" class="bg-gray-100 p-2 rounded mt-1"></pre>
+                    </div>
+                    <div>
+                        <span class="font-medium">Decoded:</span>
+                        <pre id="decoded" class="bg-gray-100 p-2 rounded mt-1"></pre>
+                    </div>
+                    <div>
+                        <span class="font-medium">Token Details:</span>
+                        <div id="token-details" class="bg-gray-100 p-2 rounded mt-1 overflow-x-auto">
+                            <table class="min-w-full bg-white border rounded-lg overflow-hidden table-fixed">
+                                <thead class="bg-gray-100">
+                                    <tr>
+                                        <th class="px-4 py-2 text-left w-1/4">Word</th>
+                                        <th class="px-4 py-2 text-left w-1/4">Type</th>
+                                        <th class="px-4 py-2 text-left w-2/4">Token Details</th>
+                                    </tr>
+                                </thead>
+                                <tbody id="token-details-body">
+                                    <!-- Token details will be inserted here -->
+                                </tbody>
+                            </table>
+                        </div>
+                    </div>
+                    <div id="match-result"></div>
+                </div>
+            </div>
+        </div>
+    </div>
+    <script>
+        async function tokenize() {
+            const text = document.getElementById('input-text').value;
+            try {
+                const response = await fetch('/tokenize', {
+                    method: 'POST',
+                    headers: {
+                        'Content-Type': 'application/json',
+                    },
+                    body: JSON.stringify({ text }),
+                });
+                const data = await response.json();
+                document.getElementById('result').classList.remove('hidden');
+                document.getElementById('tokens').textContent = JSON.stringify(data.tokens, null, 2);
+                document.getElementById('decoded').textContent = data.decoded;
+                // Display token details
+                const detailsBody = document.getElementById('token-details-body');
+                detailsBody.innerHTML = '';
+                data.token_details.forEach(detail => {
+                    const row = document.createElement('tr');
+                    row.className = 'border-b hover:bg-gray-50';
+                    // Create table cells
+                    const wordCell = document.createElement('td');
+                    const typeCell = document.createElement('td');
+                    const tokenCell = document.createElement('td');
+                    // Set cell classes for vertical alignment and wrapping
+                    wordCell.className = 'px-4 py-2 align-top font-mono border-r';
+                    typeCell.className = 'px-4 py-2 align-top border-r';
+                    tokenCell.className = 'px-4 py-2 align-top font-mono';
+                    // Set content
+                    wordCell.textContent = detail.word;
+                    typeCell.textContent = detail.type;
+                    // Create a container for token details to ensure proper spacing
+                    const tokenList = document.createElement('div');
+                    tokenList.className = 'space-y-1';
+                    if (detail.type === 'complete_word') {
+                        const tokenDiv = document.createElement('div');
+                        tokenDiv.textContent = `ID ${detail.token_id}: "${detail.text}"`;
+                        tokenList.appendChild(tokenDiv);
+                    } else if (detail.type === 'subword_tokens') {
+                        detail.tokens.forEach(t => {
+                            const tokenDiv = document.createElement('div');
+                            tokenDiv.textContent = `ID ${t.id}: "${t.text}"`;
+                            tokenList.appendChild(tokenDiv);
+                        });
+                    }
+                    tokenCell.appendChild(tokenList);
+                    // Add cells to row
+                    row.appendChild(wordCell);
+                    row.appendChild(typeCell);
+                    row.appendChild(tokenCell);
+                    detailsBody.appendChild(row);
+                });
+                const matchEl = document.getElementById('match-result');
+                matchEl.textContent = data.matches ? '✅ Perfect match!' : '❌ Mismatch';
+                matchEl.className = data.matches ? 'text-green-600' : 'text-red-600';
+            } catch (error) {
+                console.error('Error:', error);
+                alert('Error tokenizing text: ' + error.message);
+            }
+        }
+    </script>
+</body>
+</html>

telugu_base_vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

telugu_tokenizer_merges.json ADDED Viewed

The diff for this file is too large to render. See raw diff

telugu_tokenizer_vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

training_logs.log ADDED Viewed

	@@ -0,0 +1,376 @@

+(session10) (base) Chaitanyas-MacBook-Pro:telugu-tokenizer chaitanyasagargurujula$ python src/bpe_tokenizer.py
+Loading existing base vocabulary...
+Training new tokenizer...
+Loading dataset...
+Preparing training text...
+Loading documents: 100%|█████████████████████████████████████████████████████████████████████| 83866/83866 [00:00<00:00, 88094.70it/s]
+Training on 167732 documents...
+Total characters in training data: 105,279,512
+Converting text to token IDs using base vocabulary...
+Before training: text bytes length: 283,496,279
+Processing 45 chunks using 11 cores...
+Initial tokenization: 100%|███████████████████████████████████████████████████████████████████████████| 45/45 [00:04<00:00,  9.95it/s]
+Base vocabulary size: 2400
+Initial sequence length: 105836015
+Training BPE:   0%|                                                                               | 1/2599 [00:37<26:47:26, 37.12s/it]Merge for (304, 333) already exists in the vocabulary.
+Training BPE:   0%|                                                                               | 4/2599 [01:26<13:45:51, 19.10s/it]Merge for (296, 333) already exists in the vocabulary.
+Training BPE:   0%|▏                                                                              | 6/2599 [01:57<12:20:12, 17.13s/it]Merge for (312, 333) already exists in the vocabulary.
+Training BPE:   1%|▍                                                                             | 16/2599 [04:29<10:44:21, 14.97s/it]Merge for (783, 296) already exists in the vocabulary.
+Training BPE:   1%|▌                                                                             | 19/2599 [05:13<10:29:00, 14.63s/it]Merge for (296, 319) already exists in the vocabulary.
+Training BPE:   1%|▋                                                                             | 23/2599 [06:10<10:13:44, 14.30s/it]Merge for (277, 333) already exists in the vocabulary.
+Training BPE:   1%|▊                                                                             | 27/2599 [07:06<10:01:51, 14.04s/it]Merge for (309, 319) already exists in the vocabulary.
+Training BPE:   1%|▉                                                                              | 29/2599 [07:33<9:54:13, 13.87s/it]Merge for (282, 327) already exists in the vocabulary.
+Training BPE:   1%|█                                                                              | 34/2599 [08:41<9:39:29, 13.56s/it]Merge for (302, 318) already exists in the vocabulary.
+Training BPE:   1%|█▏                                                                             | 38/2599 [09:35<9:36:41, 13.51s/it]Merge for (304, 318) already exists in the vocabulary.
+Training BPE:   2%|█▏                                                                             | 39/2599 [09:48<9:34:36, 13.47s/it]Merge for (298, 2403) already exists in the vocabulary.
+Training BPE:   2%|█▏                                                                             | 41/2599 [10:15<9:31:03, 13.39s/it]Merge for (1023, 292) already exists in the vocabulary.
+Training BPE:   2%|█▎                                                                             | 43/2599 [10:41<9:25:50, 13.28s/it]Merge for (292, 333) already exists in the vocabulary.
+Training BPE:   2%|█▍                                                                             | 48/2599 [11:46<9:13:28, 13.02s/it]Merge for (277, 321) already exists in the vocabulary.
+Training BPE:   2%|█▌                                                                             | 50/2599 [12:12<9:08:03, 12.90s/it]Merge for (304, 319) already exists in the vocabulary.
+Training BPE:   2%|█▋                                                                             | 55/2599 [13:16<9:04:11, 12.83s/it]Merge for (309, 318) already exists in the vocabulary.
+Training BPE:   2%|█▊                                                                             | 58/2599 [13:54<8:59:41, 12.74s/it]Merge for (294, 333) already exists in the vocabulary.
+Training BPE:   2%|█▊                                                                             | 61/2599 [14:32<8:56:47, 12.69s/it]Merge for (306, 2412) already exists in the vocabulary.
+Training BPE:   3%|██                                                                             | 66/2599 [15:34<8:46:39, 12.47s/it]Merge for (292, 319) already exists in the vocabulary.
+Training BPE:   3%|██                                                                             | 68/2599 [15:59<8:43:38, 12.41s/it]Merge for (287, 2412) already exists in the vocabulary.
+Training BPE:   3%|██                                                                             | 69/2599 [16:12<8:43:06, 12.41s/it]Merge for (304, 321) already exists in the vocabulary.
+Training BPE:   3%|██▏                                                                            | 70/2599 [16:24<8:41:57, 12.38s/it]Merge for (287, 2438) already exists in the vocabulary.
+Training BPE:   3%|██▏                                                                            | 72/2599 [16:48<8:38:32, 12.31s/it]Merge for (403, 311) already exists in the vocabulary.
+Training BPE:   3%|██▎                                                                            | 75/2599 [17:25<8:35:33, 12.26s/it]Merge for (296, 321) already exists in the vocabulary.
+Training BPE:   3%|██▎                                                                            | 76/2599 [17:37<8:34:11, 12.23s/it]Merge for (289, 319) already exists in the vocabulary.
+Training BPE:   3%|██▎                                                                            | 77/2599 [17:49<8:30:31, 12.15s/it]Merge for (309, 327) already exists in the vocabulary.
+Training BPE:   3%|██▎                                                                            | 78/2599 [18:01<8:30:18, 12.15s/it]Merge for (298, 2457) already exists in the vocabulary.
+Training BPE:   3%|██▍                                                                            | 80/2599 [18:26<8:28:40, 12.12s/it]Merge for (277, 318) already exists in the vocabulary.
+Training BPE:   3%|██▌                                                                            | 83/2599 [19:02<8:32:24, 12.22s/it]Merge for (282, 333) already exists in the vocabulary.
+Training BPE:   3%|██▌                                                                            | 84/2599 [19:15<8:33:27, 12.25s/it]Merge for (277, 331) already exists in the vocabulary.
+Training BPE:   3%|██▌                                                                            | 86/2599 [19:39<8:31:13, 12.21s/it]Merge for (289, 333) already exists in the vocabulary.
+Training BPE:   3%|██▋                                                                            | 90/2599 [20:27<8:25:58, 12.10s/it]Merge for (277, 330) already exists in the vocabulary.
+Training BPE:   4%|██▊                                                                            | 91/2599 [20:39<8:25:16, 12.09s/it]Merge for (300, 318) already exists in the vocabulary.
+Training BPE:   4%|██▊                                                                            | 94/2599 [21:15<8:23:51, 12.07s/it]Merge for (298, 328) already exists in the vocabulary.
+Training BPE:   4%|██▉                                                                            | 96/2599 [21:39<8:21:06, 12.01s/it]Merge for (1023, 287) already exists in the vocabulary.
+Training BPE:   4%|███                                                                            | 99/2599 [22:15<8:13:43, 11.85s/it]
+Vocab size: 2500: ం + బ = ంబ
+Merge for (298, 318) already exists in the vocabulary.
+Training BPE:   4%|███                                                                           | 100/2599 [22:27<8:15:33, 11.90s/it]Merge for (306, 331) already exists in the vocabulary.
+Training BPE:   4%|███                                                                           | 104/2599 [23:14<8:14:28, 11.89s/it]Merge for (298, 331) already exists in the vocabulary.
+Training BPE:   4%|███▏                                                                          | 106/2599 [23:38<8:11:43, 11.83s/it]Merge for (307, 2412) already exists in the vocabulary.
+Training BPE:   4%|███▎                                                                          | 110/2599 [24:24<8:05:58, 11.71s/it]Merge for (1023, 293) already exists in the vocabulary.
+Training BPE:   4%|███▎                                                                          | 111/2599 [24:36<8:04:27, 11.68s/it]Merge for (503, 282) already exists in the vocabulary.
+Training BPE:   4%|███▎                                                                          | 112/2599 [24:47<7:59:29, 11.57s/it]Merge for (311, 2438) already exists in the vocabulary.
+Training BPE:   4%|███▍                                                                          | 113/2599 [24:59<7:58:49, 11.56s/it]Merge for (279, 321) already exists in the vocabulary.
+Training BPE:   4%|███▍                                                                          | 115/2599 [25:22<7:56:27, 11.51s/it]Merge for (303, 318) already exists in the vocabulary.
+Training BPE:   4%|███▍                                                                          | 116/2599 [25:33<7:56:28, 11.51s/it]Merge for (312, 320) already exists in the vocabulary.
+Training BPE:   5%|███▌                                                                          | 117/2599 [25:45<7:55:38, 11.50s/it]Merge for (306, 327) already exists in the vocabulary.
+Training BPE:   5%|███▌                                                                          | 118/2599 [25:56<7:54:21, 11.47s/it]Merge for (296, 327) already exists in the vocabulary.
+Training BPE:   5%|███▋                                                                          | 121/2599 [26:32<8:06:55, 11.79s/it]Merge for (282, 326) already exists in the vocabulary.
+Training BPE:   5%|███▋                                                                          | 122/2599 [26:44<8:02:47, 11.69s/it]Merge for (298, 326) already exists in the vocabulary.
+Training BPE:   5%|███▋                                                                          | 124/2599 [27:06<7:55:34, 11.53s/it]Merge for (287, 320) already exists in the vocabulary.
+Training BPE:   5%|███▊                                                                          | 126/2599 [27:29<7:50:27, 11.41s/it]Merge for (304, 326) already exists in the vocabulary.
+Training BPE:   5%|███▊                                                                          | 127/2599 [27:40<7:47:41, 11.35s/it]Merge for (294, 327) already exists in the vocabulary.
+Training BPE:   5%|███▊                                                                          | 129/2599 [28:03<7:45:52, 11.32s/it]Merge for (312, 319) already exists in the vocabulary.
+Training BPE:   5%|███▉                                                                          | 133/2599 [28:49<7:50:25, 11.45s/it]Merge for (304, 331) already exists in the vocabulary.
+Training BPE:   5%|████                                                                          | 134/2599 [29:00<7:45:20, 11.33s/it]Merge for (703, 292) already exists in the vocabulary.
+Training BPE:   5%|████                                                                          | 137/2599 [29:33<7:42:08, 11.26s/it]Merge for (277, 327) already exists in the vocabulary.
+Training BPE:   5%|████▎                                                                         | 142/2599 [30:29<7:37:31, 11.17s/it]Merge for (306, 333) already exists in the vocabulary.
+Training BPE:   6%|████▎                                                                         | 144/2599 [30:51<7:34:32, 11.11s/it]Merge for (302, 319) already exists in the vocabulary.
+Training BPE:   6%|████▎                                                                         | 145/2599 [31:03<7:34:21, 11.11s/it]Merge for (310, 318) already exists in the vocabulary.
+Training BPE:   6%|████▍                                                                         | 148/2599 [31:36<7:29:56, 11.01s/it]Merge for (277, 2403) already exists in the vocabulary.
+Training BPE:   6%|████▍                                                                         | 149/2599 [31:47<7:29:45, 11.01s/it]Merge for (304, 322) already exists in the vocabulary.
+Training BPE:   6%|████▌                                                                         | 150/2599 [31:58<7:29:13, 11.01s/it]Merge for (302, 321) already exists in the vocabulary.
+Training BPE:   6%|████▌                                                                         | 152/2599 [32:20<7:29:14, 11.02s/it]Merge for (743, 294) already exists in the vocabulary.
+Training BPE:   6%|████▋                                                                         | 156/2599 [33:03<7:21:49, 10.85s/it]Merge for (294, 2414) already exists in the vocabulary.
+Training BPE:   6%|████▋                                                                         | 157/2599 [33:14<7:23:35, 10.90s/it]Merge for (403, 277) already exists in the vocabulary.
+Training BPE:   6%|████▋                                                                         | 158/2599 [33:25<7:23:33, 10.90s/it]Merge for (643, 289) already exists in the vocabulary.
+Training BPE:   6%|████▊                                                                         | 159/2599 [33:35<7:20:09, 10.82s/it]Merge for (306, 319) already exists in the vocabulary.
+Training BPE:   6%|████▊                                                                         | 162/2599 [34:08<7:19:58, 10.83s/it]Merge for (277, 322) already exists in the vocabulary.
+Training BPE:   6%|████▉                                                                         | 164/2599 [34:30<7:17:59, 10.79s/it]Merge for (703, 309) already exists in the vocabulary.
+Training BPE:   6%|████▉                                                                         | 166/2599 [34:51<7:16:49, 10.77s/it]Merge for (292, 2403) already exists in the vocabulary.
+Training BPE:   6%|█████                                                                         | 168/2599 [35:13<7:15:41, 10.75s/it]Merge for (304, 327) already exists in the vocabulary.
+Training BPE:   7%|█████                                                                         | 170/2599 [35:34<7:16:07, 10.77s/it]Merge for (403, 287) already exists in the vocabulary.
+Training BPE:   7%|█████▏                                                                        | 174/2599 [36:17<7:13:15, 10.72s/it]Merge for (309, 326) already exists in the vocabulary.
+Training BPE:   7%|█████▎                                                                        | 175/2599 [36:28<7:13:08, 10.72s/it]Merge for (301, 321) already exists in the vocabulary.
+Training BPE:   7%|█████▎                                                                        | 179/2599 [37:10<7:10:36, 10.68s/it]Merge for (294, 319) already exists in the vocabulary.
+Training BPE:   7%|█████▍                                                                        | 181/2599 [37:32<7:11:27, 10.71s/it]Merge for (284, 320) already exists in the vocabulary.
+Training BPE:   7%|█████▋                                                                        | 189/2599 [38:58<7:19:21, 10.94s/it]Merge for (296, 318) already exists in the vocabulary.
+Training BPE:   7%|█████▋                                                                        | 191/2599 [39:20<7:14:50, 10.83s/it]Merge for (302, 2537) already exists in the vocabulary.
+Training BPE:   7%|█████▊                                                                        | 192/2599 [39:30<7:09:39, 10.71s/it]Merge for (302, 326) already exists in the vocabulary.
+Training BPE:   7%|█████▊                                                                        | 193/2599 [39:41<7:07:00, 10.65s/it]Merge for (306, 321) already exists in the vocabulary.
+Training BPE:   7%|█████▊                                                                        | 194/2599 [39:51<7:04:38, 10.59s/it]Merge for (279, 318) already exists in the vocabulary.
+Training BPE:   8%|█████▊                                                                        | 195/2599 [40:02<7:03:07, 10.56s/it]Merge for (279, 2403) already exists in the vocabulary.
+Training BPE:   8%|█████▉                                                                        | 196/2599 [40:12<7:03:23, 10.57s/it]Merge for (294, 318) already exists in the vocabulary.
+Training BPE:   8%|█████▉                                                                        | 197/2599 [40:23<7:02:25, 10.55s/it]Merge for (284, 2414) already exists in the vocabulary.
+Training BPE:   8%|█████▉                                                                        | 199/2599 [40:43<6:56:46, 10.42s/it]
+Vocab size: 2600: ష్ట + ్ర = ష్ట్ర
+Training BPE:   8%|██████                                                                        | 200/2599 [40:54<6:56:33, 10.42s/it]Merge for (294, 321) already exists in the vocabulary.
+Training BPE:   8%|██████                                                                        | 202/2599 [41:15<6:55:18, 10.40s/it]Merge for (312, 326) already exists in the vocabulary.
+Training BPE:   8%|██████                                                                        | 204/2599 [41:35<6:54:53, 10.39s/it]Merge for (313, 328) already exists in the vocabulary.
+Training BPE:   8%|██████▏                                                                       | 205/2599 [41:46<6:54:28, 10.39s/it]Merge for (289, 318) already exists in the vocabulary.
+Training BPE:   8%|██████▏                                                                       | 208/2599 [42:17<6:55:27, 10.43s/it]Merge for (292, 320) already exists in the vocabulary.
+Training BPE:   8%|██████▍                                                                       | 214/2599 [43:19<6:49:01, 10.29s/it]Merge for (296, 320) already exists in the vocabulary.
+Training BPE:   8%|██████▍                                                                       | 215/2599 [43:29<6:49:19, 10.30s/it]Merge for (294, 320) already exists in the vocabulary.
+Training BPE:   8%|██████▍                                                                       | 216/2599 [43:40<6:49:14, 10.30s/it]Merge for (287, 319) already exists in the vocabulary.
+Training BPE:   8%|██████▌                                                                       | 220/2599 [44:21<6:43:30, 10.18s/it]Merge for (309, 320) already exists in the vocabulary.
+Training BPE:   9%|██████▋                                                                       | 222/2599 [44:41<6:43:40, 10.19s/it]Merge for (295, 2414) already exists in the vocabulary.
+Training BPE:   9%|██████▉                                                                       | 230/2599 [46:03<6:43:24, 10.22s/it]Merge for (300, 320) already exists in the vocabulary.
+Training BPE:   9%|██████▉                                                                       | 231/2599 [46:13<6:43:10, 10.22s/it]Merge for (310, 2403) already exists in the vocabulary.
+Training BPE:   9%|███████                                                                       | 234/2599 [46:43<6:42:05, 10.20s/it]Merge for (783, 303) already exists in the vocabulary.
+Training BPE:   9%|███████▏                                                                      | 240/2599 [47:45<6:38:23, 10.13s/it]Merge for (298, 327) already exists in the vocabulary.
+Training BPE:   9%|███████▎                                                                      | 243/2599 [48:15<6:35:26, 10.07s/it]Merge for (310, 333) already exists in the vocabulary.
+Training BPE:   9%|███████▍                                                                      | 246/2599 [48:45<6:33:10, 10.03s/it]Merge for (312, 318) already exists in the vocabulary.
+Training BPE:  10%|███████▌                                                                      | 250/2599 [49:25<6:30:14,  9.97s/it]Merge for (306, 318) already exists in the vocabulary.
+Training BPE:  10%|███████▌                                                                      | 251/2599 [49:35<6:31:08, 10.00s/it]Merge for (302, 328) already exists in the vocabulary.
+Training BPE:  10%|███████▌                                                                      | 252/2599 [49:45<6:31:56, 10.02s/it]Merge for (309, 2414) already exists in the vocabulary.
+Training BPE:  10%|███████▋                                                                      | 257/2599 [50:35<6:30:24, 10.00s/it]Merge for (298, 320) already exists in the vocabulary.
+Training BPE:  10%|███████▋                                                                      | 258/2599 [50:45<6:30:24, 10.01s/it]Merge for (289, 321) already exists in the vocabulary.
+Training BPE:  10%|███████▊                                                                      | 260/2599 [51:04<6:28:06,  9.96s/it]Merge for (300, 333) already exists in the vocabulary.
+Training BPE:  10%|███████▉                                                                      | 263/2599 [51:34<6:26:37,  9.93s/it]Merge for (312, 321) already exists in the vocabulary.
+Training BPE:  10%|███████▉                                                                      | 266/2599 [52:04<6:25:29,  9.91s/it]Merge for (311, 333) already exists in the vocabulary.
+Training BPE:  10%|████████                                                                      | 268/2599 [52:24<6:24:38,  9.90s/it]Merge for (298, 321) already exists in the vocabulary.
+Training BPE:  10%|████████                                                                      | 269/2599 [52:34<6:24:45,  9.91s/it]Merge for (312, 258) already exists in the vocabulary.
+Training BPE:  10%|████████▏                                                                     | 271/2599 [52:54<6:31:20, 10.09s/it]Merge for (284, 318) already exists in the vocabulary.
+Training BPE:  11%|████████▎                                                                     | 275/2599 [53:35<6:30:29, 10.08s/it]Merge for (302, 331) already exists in the vocabulary.
+Training BPE:  11%|████████▎                                                                     | 278/2599 [54:04<6:19:09,  9.80s/it]Merge for (923, 310) already exists in the vocabulary.
+Training BPE:  11%|████████▍                                                                     | 283/2599 [54:53<6:17:56,  9.79s/it]Merge for (743, 295) already exists in the vocabulary.
+Training BPE:  11%|████████▌                                                                     | 286/2599 [55:22<6:16:20,  9.76s/it]Merge for (304, 320) already exists in the vocabulary.
+Training BPE:  11%|████████▋                                                                     | 289/2599 [55:52<6:13:59,  9.71s/it]Merge for (309, 328) already exists in the vocabulary.
+Training BPE:  11%|████████▋                                                                     | 291/2599 [56:11<6:13:59,  9.72s/it]Merge for (282, 319) already exists in the vocabulary.
+Training BPE:  11%|████████▊                                                                     | 293/2599 [56:30<6:13:50,  9.73s/it]Merge for (279, 333) already exists in the vocabulary.
+Training BPE:  11%|████████▉                                                                     | 297/2599 [57:09<6:09:01,  9.62s/it]Merge for (292, 2414) already exists in the vocabulary.
+Training BPE:  12%|████████▉                                                                     | 299/2599 [57:28<6:09:03,  9.63s/it]
+Vocab size: 2700: చ + ార = చార
+Training BPE:  12%|█████████                                                                     | 302/2599 [57:57<6:07:00,  9.59s/it]Merge for (302, 320) already exists in the vocabulary.
+Training BPE:  12%|█████████▏                                                                    | 308/2599 [58:54<6:05:50,  9.58s/it]Merge for (302, 327) already exists in the vocabulary.
+Training BPE:  12%|█████████▎                                                                    | 310/2599 [59:13<6:03:35,  9.53s/it]Merge for (304, 328) already exists in the vocabulary.
+Training BPE:  12%|█████████▍                                                                  | 321/2599 [1:00:58<5:58:43,  9.45s/it]Merge for (303, 2414) already exists in the vocabulary.
+Training BPE:  12%|█████████▍                                                                  | 322/2599 [1:01:07<5:58:57,  9.46s/it]Merge for (292, 318) already exists in the vocabulary.
+Training BPE:  13%|█████████▌                                                                  | 326/2599 [1:01:45<5:55:14,  9.38s/it]Merge for (289, 320) already exists in the vocabulary.
+Training BPE:  13%|█████████▋                                                                  | 331/2599 [1:02:32<5:54:40,  9.38s/it]Merge for (287, 333) already exists in the vocabulary.
+Training BPE:  13%|█████████▋                                                                  | 332/2599 [1:02:41<5:54:26,  9.38s/it]Merge for (287, 321) already exists in the vocabulary.
+Training BPE:  13%|█████████▉                                                                  | 341/2599 [1:04:05<5:50:36,  9.32s/it]Merge for (284, 327) already exists in the vocabulary.
+Training BPE:  14%|██████████▎                                                                 | 351/2599 [1:05:38<5:50:00,  9.34s/it]Merge for (277, 326) already exists in the vocabulary.
+Training BPE:  14%|██████████▎                                                                 | 353/2599 [1:05:57<5:48:10,  9.30s/it]Merge for (1023, 298) already exists in the vocabulary.
+Training BPE:  14%|██████████▌                                                                 | 361/2599 [1:07:11<5:45:45,  9.27s/it]Merge for (302, 322) already exists in the vocabulary.
+Training BPE:  14%|██████████▌                                                                 | 363/2599 [1:07:30<5:44:06,  9.23s/it]Merge for (287, 2403) already exists in the vocabulary.
+Training BPE:  14%|██████████▉                                                                 | 372/2599 [1:08:52<5:41:05,  9.19s/it]Merge for (295, 318) already exists in the vocabulary.
+Training BPE:  15%|███████████                                                                 | 377/2599 [1:09:38<5:38:03,  9.13s/it]Merge for (279, 330) already exists in the vocabulary.
+Training BPE:  15%|███████████                                                                 | 380/2599 [1:10:06<5:37:53,  9.14s/it]Merge for (298, 333) already exists in the vocabulary.
+Training BPE:  15%|███████████▎                                                                | 385/2599 [1:10:51<5:34:22,  9.06s/it]Merge for (309, 333) already exists in the vocabulary.
+Training BPE:  15%|███████████▎                                                                | 388/2599 [1:11:18<5:34:17,  9.07s/it]Merge for (302, 330) already exists in the vocabulary.
+Training BPE:  15%|███████████▍                                                                | 391/2599 [1:11:45<5:32:07,  9.03s/it]Merge for (278, 2414) already exists in the vocabulary.
+Training BPE:  15%|███████████▍                                                                | 392/2599 [1:11:54<5:30:20,  8.98s/it]Merge for (301, 318) already exists in the vocabulary.
+Training BPE:  15%|███████████▋                                                                | 399/2599 [1:12:57<5:30:04,  9.00s/it]
+Vocab size: 2800: ల + ీస = లీస
+Training BPE:  15%|███████████▋                                                                | 400/2599 [1:13:06<5:30:41,  9.02s/it]Merge for (843, 300) already exists in the vocabulary.
+Training BPE:  16%|███████████▊                                                                | 405/2599 [1:13:52<5:30:04,  9.03s/it]Merge for (298, 330) already exists in the vocabulary.
+Training BPE:  16%|████████████▎                                                               | 420/2599 [1:16:05<5:24:36,  8.94s/it]Merge for (282, 322) already exists in the vocabulary.
+Training BPE:  16%|████████████▎                                                               | 422/2599 [1:16:23<5:23:56,  8.93s/it]Merge for (923, 292) already exists in the vocabulary.
+Training BPE:  16%|████████████▎                                                               | 423/2599 [1:16:32<5:23:55,  8.93s/it]Merge for (301, 2414) already exists in the vocabulary.
+Training BPE:  16%|████████████▍                                                               | 425/2599 [1:16:50<5:22:38,  8.90s/it]Merge for (279, 331) already exists in the vocabulary.
+Training BPE:  16%|████████████▌                                                               | 428/2599 [1:17:17<5:24:58,  8.98s/it]Merge for (303, 319) already exists in the vocabulary.
+Training BPE:  17%|█████████████                                                               | 446/2599 [1:19:56<5:15:42,  8.80s/it]Merge for (313, 318) already exists in the vocabulary.
+Training BPE:  17%|█████████████▏                                                              | 449/2599 [1:20:23<5:17:26,  8.86s/it]Merge for (301, 319) already exists in the vocabulary.
+Training BPE:  17%|█████████████▏                                                              | 450/2599 [1:20:31<5:16:23,  8.83s/it]Merge for (277, 319) already exists in the vocabulary.
+Training BPE:  17%|█████████████▏                                                              | 452/2599 [1:20:49<5:15:23,  8.81s/it]Merge for (312, 331) already exists in the vocabulary.
+Training BPE:  17%|█████████████▏                                                              | 453/2599 [1:20:58<5:15:21,  8.82s/it]Merge for (284, 319) already exists in the vocabulary.
+Training BPE:  17%|█████████████▎                                                              | 454/2599 [1:21:07<5:14:49,  8.81s/it]Merge for (312, 327) already exists in the vocabulary.
+Training BPE:  18%|█████████████▍                                                              | 460/2599 [1:21:59<5:12:18,  8.76s/it]Merge for (287, 326) already exists in the vocabulary.
+Training BPE:  18%|█████████████▌                                                              | 462/2599 [1:22:17<5:13:12,  8.79s/it]Merge for (313, 326) already exists in the vocabulary.
+Training BPE:  18%|█████████████▌                                                              | 465/2599 [1:22:43<5:13:17,  8.81s/it]Merge for (284, 326) already exists in the vocabulary.
+Training BPE:  18%|█████████████▊                                                              | 471/2599 [1:23:36<5:09:57,  8.74s/it]Merge for (277, 323) already exists in the vocabulary.
+Training BPE:  18%|█████████████▉                                                              | 475/2599 [1:24:11<5:05:57,  8.64s/it]Merge for (298, 319) already exists in the vocabulary.
+Training BPE:  19%|██████████████▏                                                             | 485/2599 [1:25:37<5:03:47,  8.62s/it]Merge for (310, 319) already exists in the vocabulary.
+Training BPE:  19%|██████████████▎                                                             | 489/2599 [1:26:11<5:03:59,  8.64s/it]Merge for (312, 322) already exists in the vocabulary.
+Training BPE:  19%|██████████████▍                                                             | 494/2599 [1:26:55<5:02:17,  8.62s/it]Merge for (301, 322) already exists in the vocabulary.
+Training BPE:  19%|██████████████▌                                                             | 499/2599 [1:27:38<5:01:00,  8.60s/it]
+Vocab size: 2900: న + ున్న = నున్న
+Training BPE:  19%|██████████████▋                                                             | 503/2599 [1:28:12<4:57:10,  8.51s/it]Merge for (279, 319) already exists in the vocabulary.
+Training BPE:  20%|███████████████                                                             | 515/2599 [1:29:54<4:55:09,  8.50s/it]Merge for (300, 321) already exists in the vocabulary.
+Training BPE:  20%|███████████████▏                                                            | 520/2599 [1:30:37<4:55:10,  8.52s/it]Merge for (312, 328) already exists in the vocabulary.
+Training BPE:  20%|███████████████▎                                                            | 524/2599 [1:31:11<4:57:02,  8.59s/it]Merge for (303, 322) already exists in the vocabulary.
+Training BPE:  20%|███████████████▎                                                            | 525/2599 [1:31:20<4:57:28,  8.61s/it]Merge for (963, 309) already exists in the vocabulary.
+Training BPE:  20%|███████████████▍                                                            | 526/2599 [1:31:28<4:56:30,  8.58s/it]Merge for (299, 319) already exists in the vocabulary.
+Training BPE:  20%|███████████████▍                                                            | 527/2599 [1:31:37<4:56:00,  8.57s/it]Merge for (300, 326) already exists in the vocabulary.
+Training BPE:  21%|███████████████▋                                                            | 535/2599 [1:32:44<4:51:41,  8.48s/it]Merge for (443, 279) already exists in the vocabulary.
+Training BPE:  21%|███████████████▋                                                            | 536/2599 [1:32:53<4:51:43,  8.48s/it]Merge for (300, 331) already exists in the vocabulary.
+Training BPE:  21%|███████████████▋                                                            | 537/2599 [1:33:01<4:52:00,  8.50s/it]Merge for (306, 320) already exists in the vocabulary.
+Training BPE:  21%|███████████████▊                                                            | 539/2599 [1:33:19<4:53:04,  8.54s/it]Merge for (703, 312) already exists in the vocabulary.
+Training BPE:  22%|████████████████▍                                                           | 563/2599 [1:36:40<4:45:12,  8.41s/it]Merge for (1023, 303) already exists in the vocabulary.
+Training BPE:  22%|████████████████▌                                                           | 566/2599 [1:37:06<4:44:06,  8.38s/it]Merge for (292, 330) already exists in the vocabulary.
+Training BPE:  22%|████████████████▌                                                           | 568/2599 [1:37:22<4:44:37,  8.41s/it]Merge for (294, 2403) already exists in the vocabulary.
+Training BPE:  22%|████████████████▉                                                           | 579/2599 [1:38:55<4:43:24,  8.42s/it]Merge for (306, 328) already exists in the vocabulary.
+Training BPE:  22%|████████████████▉                                                           | 581/2599 [1:39:12<4:43:46,  8.44s/it]Merge for (923, 282) already exists in the vocabulary.
+Training BPE:  23%|█████████████████▍                                                          | 597/2599 [1:41:26<4:38:58,  8.36s/it]Merge for (309, 323) already exists in the vocabulary.
+Training BPE:  23%|█████████████████▌                                                          | 599/2599 [1:41:43<4:39:47,  8.39s/it]
+Vocab size: 3000:  (ఆంధ్రజ్యోతి) + :  =  (ఆంధ్రజ్యోతి):
+Training BPE:  23%|█████████████████▌                                                          | 601/2599 [1:42:00<4:38:34,  8.37s/it]Merge for (923, 302) already exists in the vocabulary.
+Training BPE:  23%|█████████████████▊                                                          | 609/2599 [1:43:06<4:36:56,  8.35s/it]Merge for (923, 293) already exists in the vocabulary.
+Training BPE:  23%|█████████████████▊                                                          | 610/2599 [1:43:15<4:37:18,  8.37s/it]Merge for (296, 331) already exists in the vocabulary.
+Training BPE:  24%|█████████████████▉                                                          | 612/2599 [1:43:32<4:37:01,  8.37s/it]Merge for (300, 319) already exists in the vocabulary.
+Training BPE:  24%|█████████████████▉                                                          | 613/2599 [1:43:40<4:35:10,  8.31s/it]Merge for (289, 2403) already exists in the vocabulary.
+Training BPE:  24%|█████████████████▉                                                          | 614/2599 [1:43:48<4:34:45,  8.31s/it]Merge for (296, 326) already exists in the vocabulary.
+Training BPE:  24%|██████████████████                                                          | 616/2599 [1:44:05<4:34:12,  8.30s/it]Merge for (310, 321) already exists in the vocabulary.
+Training BPE:  24%|██████████████████                                                          | 619/2599 [1:44:30<4:36:34,  8.38s/it]Merge for (292, 327) already exists in the vocabulary.
+Training BPE:  24%|██████████████████▎                                                         | 626/2599 [1:45:28<4:33:09,  8.31s/it]Merge for (284, 333) already exists in the vocabulary.
+Training BPE:  24%|██████████████████▌                                                         | 633/2599 [1:46:26<4:31:57,  8.30s/it]Merge for (1003, 291) already exists in the vocabulary.
+Training BPE:  25%|██████████████████▋                                                         | 637/2599 [1:46:59<4:31:02,  8.29s/it]Merge for (295, 319) already exists in the vocabulary.
+Training BPE:  25%|██████████████████▉                                                         | 649/2599 [1:48:38<4:28:15,  8.25s/it]Merge for (278, 318) already exists in the vocabulary.
+Training BPE:  25%|███████████████████▎                                                        | 660/2599 [1:50:09<4:27:54,  8.29s/it]Merge for (282, 328) already exists in the vocabulary.
+Training BPE:  26%|███████████████████▍                                                        | 663/2599 [1:50:34<4:26:33,  8.26s/it]Merge for (313, 319) already exists in the vocabulary.
+Training BPE:  26%|███████████████████▌                                                        | 671/2599 [1:51:40<4:24:56,  8.24s/it]Merge for (292, 321) already exists in the vocabulary.
+Training BPE:  26%|███████████████████▊                                                        | 677/2599 [1:52:30<4:23:41,  8.23s/it]Merge for (292, 331) already exists in the vocabulary.
+Training BPE:  27%|████████████████████▍                                                       | 699/2599 [1:55:28<4:16:20,  8.09s/it]
+Vocab size: 3100: ,  + నవంబరు  = , నవంబరు
+Training BPE:  27%|████████████████████▊                                                       | 712/2599 [1:57:13<4:14:06,  8.08s/it]Merge for (306, 326) already exists in the vocabulary.
+Training BPE:  28%|████████████████████▉                                                       | 716/2599 [1:57:46<4:13:56,  8.09s/it]Merge for (296, 322) already exists in the vocabulary.
+Training BPE:  28%|████████████████████▉                                                       | 717/2599 [1:57:54<4:15:07,  8.13s/it]Merge for (277, 320) already exists in the vocabulary.
+Training BPE:  29%|█████████████████████▉                                                      | 749/2599 [2:02:11<4:06:19,  7.99s/it]Merge for (302, 333) already exists in the vocabulary.
+Training BPE:  29%|██████████████████████                                                      | 756/2599 [2:03:08<4:07:00,  8.04s/it]Merge for (287, 318) already exists in the vocabulary.
+Training BPE:  29%|██████████████████████▎                                                     | 761/2599 [2:03:48<4:06:58,  8.06s/it]Merge for (299, 331) already exists in the vocabulary.
+Training BPE:  29%|██████████████████████▍                                                     | 766/2599 [2:04:28<4:03:32,  7.97s/it]Merge for (292, 326) already exists in the vocabulary.
+Training BPE:  30%|██████████████████████▌                                                     | 771/2599 [2:05:08<4:01:41,  7.93s/it]Merge for (803, 292) already exists in the vocabulary.
+Training BPE:  30%|██████████████████████▋                                                     | 776/2599 [2:05:48<4:02:30,  7.98s/it]Merge for (306, 2457) already exists in the vocabulary.
+Training BPE:  30%|██████████████████████▉                                                     | 783/2599 [2:06:44<4:04:33,  8.08s/it]Merge for (403, 292) already exists in the vocabulary.
+Training BPE:  31%|███████████████████████▏                                                    | 794/2599 [2:08:14<4:04:44,  8.14s/it]Merge for (309, 2403) already exists in the vocabulary.
+Training BPE:  31%|███████████████████████▎                                                    | 799/2599 [2:08:54<4:04:07,  8.14s/it]
+Vocab size: 3200: ో + జ = ోజ
+Training BPE:  31%|███████████████████████▌                                                    | 804/2599 [2:09:35<4:03:09,  8.13s/it]Merge for (291, 318) already exists in the vocabulary.
+Training BPE:  31%|███████████████████████▌                                                    | 807/2599 [2:09:59<4:02:25,  8.12s/it]Merge for (309, 321) already exists in the vocabulary.
+Training BPE:  32%|████████████████████████                                                    | 823/2599 [2:12:07<3:56:31,  7.99s/it]Merge for (923, 309) already exists in the vocabulary.
+Training BPE:  32%|█████████████��██████████▍                                                   | 837/2599 [2:14:00<3:54:25,  7.98s/it]Merge for (309, 331) already exists in the vocabulary.
+Training BPE:  32%|████████████████████████▋                                                   | 844/2599 [2:14:56<3:57:36,  8.12s/it]Merge for (289, 326) already exists in the vocabulary.
+Training BPE:  33%|█████████████████████████                                                   | 856/2599 [2:16:33<3:52:33,  8.01s/it]Merge for (313, 331) already exists in the vocabulary.
+Training BPE:  33%|█████████████████████████▍                                                  | 868/2599 [2:18:09<3:51:08,  8.01s/it]Merge for (298, 2438) already exists in the vocabulary.
+Training BPE:  34%|█████████████████████████▍                                                  | 871/2599 [2:18:33<3:51:14,  8.03s/it]Merge for (295, 333) already exists in the vocabulary.
+Training BPE:  34%|█████████████████████████▊                                                  | 882/2599 [2:20:01<3:50:30,  8.05s/it]Merge for (298, 322) already exists in the vocabulary.
+Training BPE:  34%|█████████████████████████▉                                                  | 889/2599 [2:20:58<3:48:58,  8.03s/it]Merge for (287, 331) already exists in the vocabulary.
+Training BPE:  35%|██████████████████████████▎                                                 | 899/2599 [2:22:18<3:48:23,  8.06s/it]
+Vocab size: 3300: ుగ + ు  = ుగు
+Training BPE:  35%|██████████████████████████▌                                                 | 907/2599 [2:23:23<3:47:26,  8.07s/it]Merge for (299, 333) already exists in the vocabulary.
+Training BPE:  35%|██████████████████████████▋                                                 | 914/2599 [2:24:18<3:42:35,  7.93s/it]Merge for (1023, 309) already exists in the vocabulary.
+Training BPE:  36%|███████████████████████████▎                                                | 933/2599 [2:26:51<3:42:47,  8.02s/it]Merge for (300, 2403) already exists in the vocabulary.
+Training BPE:  36%|███████████████████████████▋                                                | 948/2599 [2:28:52<3:40:22,  8.01s/it]Merge for (279, 332) already exists in the vocabulary.
+Training BPE:  37%|████████████████████████████▍                                               | 974/2599 [2:32:20<3:38:58,  8.09s/it]Merge for (284, 328) already exists in the vocabulary.
+Training BPE:  38%|████████████████████████████▌                                               | 978/2599 [2:32:52<3:37:19,  8.04s/it]Merge for (279, 2414) already exists in the vocabulary.
+Training BPE:  38%|█████████████████████████████▏                                              | 996/2599 [2:35:16<3:34:13,  8.02s/it]Merge for (282, 318) already exists in the vocabulary.
+Training BPE:  38%|█████████████████████████████▏                                              | 999/2599 [2:35:40<3:33:57,  8.02s/it]
+Vocab size: 3400: వ + ్  = వ్
+Training BPE:  38%|████████████████████████████▊                                              | 1000/2599 [2:35:48<3:34:52,  8.06s/it]Merge for (284, 331) already exists in the vocabulary.
+Training BPE:  39%|█████████████████████████████▍                                             | 1019/2599 [2:38:19<3:27:43,  7.89s/it]Merge for (299, 326) already exists in the vocabulary.
+Training BPE:  39%|█████████████████████████████▌                                             | 1025/2599 [2:39:07<3:29:29,  7.99s/it]Merge for (307, 318) already exists in the vocabulary.
+Training BPE:  40%|█████████████████████████████▊                                             | 1034/2599 [2:40:19<3:25:44,  7.89s/it]Merge for (313, 320) already exists in the vocabulary.
+Training BPE:  40%|█████████████████████████████▉                                             | 1039/2599 [2:40:58<3:25:01,  7.89s/it]Merge for (983, 296) already exists in the vocabulary.
+Training BPE:  40%|███████��██████████████████████▏                                            | 1048/2599 [2:42:10<3:26:27,  7.99s/it]Merge for (289, 327) already exists in the vocabulary.
+Training BPE:  41%|██████████████████████████████▉                                            | 1074/2599 [2:45:37<3:19:53,  7.86s/it]Merge for (287, 328) already exists in the vocabulary.
+Training BPE:  42%|███████████████████████████████▋                                           | 1097/2599 [2:48:39<3:18:38,  7.94s/it]Merge for (289, 328) already exists in the vocabulary.
+Training BPE:  42%|███████████████████████████████▋                                           | 1098/2599 [2:48:47<3:20:05,  8.00s/it]Merge for (302, 323) already exists in the vocabulary.
+Training BPE:  42%|███████████████████████████████▋                                           | 1099/2599 [2:48:55<3:19:12,  7.97s/it]
+Vocab size: 3500: మ + ృ = మృ
+Training BPE:  43%|████████████████████████████████▏                                          | 1114/2599 [2:50:55<3:18:19,  8.01s/it]Merge for (311, 327) already exists in the vocabulary.
+Training BPE:  44%|████████████████████████████████▉                                          | 1140/2599 [2:54:21<3:13:37,  7.96s/it]Merge for (294, 322) already exists in the vocabulary.
+Training BPE:  45%|█████████████████████████████████▍                                         | 1158/2599 [2:56:44<3:11:07,  7.96s/it]Merge for (299, 320) already exists in the vocabulary.
+Training BPE:  45%|██████████████████████████████████                                         | 1181/2599 [2:59:46<3:06:18,  7.88s/it]Merge for (983, 309) already exists in the vocabulary.
+Training BPE:  46%|██████████████████████████████████▌                                        | 1199/2599 [3:02:09<3:04:53,  7.92s/it]
+Vocab size: 3600: స్ట + ే = స్టే
+Training BPE:  47%|██████████████████████████████████▉                                        | 1210/2599 [3:03:37<3:04:37,  7.98s/it]Merge for (311, 318) already exists in the vocabulary.
+Training BPE:  47%|███████████████████████████████████                                        | 1214/2599 [3:04:08<3:01:11,  7.85s/it]Merge for (300, 2412) already exists in the vocabulary.
+Training BPE:  47%|███████████████████████████████████▎                                       | 1224/2599 [3:05:26<2:59:35,  7.84s/it]Merge for (282, 331) already exists in the vocabulary.
+Training BPE:  47%|███████████████████████████████████▍                                       | 1226/2599 [3:05:42<3:00:33,  7.89s/it]Merge for (299, 328) already exists in the vocabulary.
+Training BPE:  48%|███████████████████████████████████▊                                       | 1241/2599 [3:07:40<2:57:38,  7.85s/it]Merge for (307, 333) already exists in the vocabulary.
+Training BPE:  48%|████████████████████████████████████▏                                      | 1254/2599 [3:09:23<2:57:35,  7.92s/it]Merge for (310, 327) already exists in the vocabulary.
+Training BPE:  49%|████████████████████████████████████▌                                      | 1268/2599 [3:11:12<2:53:54,  7.84s/it]Merge for (923, 279) already exists in the vocabulary.
+Training BPE:  49%|████████████████████████████████████▊                                      | 1274/2599 [3:11:59<2:50:48,  7.73s/it]Merge for (303, 321) already exists in the vocabulary.
+Training BPE:  49%|████████████████████████████████████▉                                      | 1279/2599 [3:12:39<2:53:24,  7.88s/it]Merge for (284, 321) already exists in the vocabulary.
+Training BPE:  49%|████████████████████████████████████▉                                      | 1280/2599 [3:12:47<2:53:37,  7.90s/it]Merge for (294, 331) already exists in the vocabulary.
+Training BPE:  50%|█████████████████████████████████████▍                                     | 1298/2599 [3:15:08<2:49:45,  7.83s/it]Merge for (923, 298) already exists in the vocabulary.
+Training BPE:  50%|█████████████████████████████████████▍                                     | 1299/2599 [3:15:16<2:49:19,  7.82s/it]
+Vocab size: 3700: ర్ + ప = ర్ప
+Training BPE:  50%|█████████████████████████████████████▋                                     | 1306/2599 [3:16:11<2:48:35,  7.82s/it]Merge for (284, 322) already exists in the vocabulary.
+Training BPE:  50%|█████████████████████████████████████▊                                     | 1309/2599 [3:16:34<2:47:53,  7.81s/it]Merge for (310, 331) already exists in the vocabulary.
+Training BPE:  51%|█████████████████████████████████████▉                                     | 1314/2599 [3:17:13<2:48:02,  7.85s/it]Merge for (287, 327) already exists in the vocabulary.
+Training BPE:  51%|██████████████████████████████████████▍                                    | 1334/2599 [3:19:48<2:45:00,  7.83s/it]Merge for (282, 321) already exists in the vocabulary.
+Training BPE:  52%|██████████████████████████████████████▋                                    | 1341/2599 [3:20:43<2:42:21,  7.74s/it]Merge for (277, 332) already exists in the vocabulary.
+Training BPE:  52%|██████████████████████████████████████▊                                    | 1343/2599 [3:20:58<2:42:16,  7.75s/it]Merge for (277, 2412) already exists in the vocabulary.
+Training BPE:  52%|██████████████████████████████████████▊                                    | 1346/2599 [3:21:22<2:42:56,  7.80s/it]Merge for (282, 320) already exists in the vocabulary.
+Training BPE:  52%|██████████████████████████████████████▉                                    | 1349/2599 [3:21:45<2:41:49,  7.77s/it]Merge for (299, 2438) already exists in the vocabulary.
+Training BPE:  54%|████████████████████████████████████████▏                                  | 1392/2599 [3:27:18<2:36:01,  7.76s/it]Merge for (289, 2412) already exists in the vocabulary.
+Training BPE:  54%|████████████████████████████████████████▎                                  | 1399/2599 [3:28:12<2:33:45,  7.69s/it]
+Vocab size: 3800: ఫిర్యా + దు  = ఫిర్యాదు
+Training BPE:  55%|█████████████████████████████████████████                                  | 1421/2599 [3:31:01<2:29:57,  7.64s/it]Merge for (294, 330) already exists in the vocabulary.
+Training BPE:  55%|█████████████████████████████████████████▌                                 | 1442/2599 [3:33:44<2:29:12,  7.74s/it]Merge for (313, 333) already exists in the vocabulary.
+Training BPE:  57%|██████████████████████████████████████████▋                                | 1478/2599 [3:38:20<2:23:52,  7.70s/it]Merge for (783, 312) already exists in the vocabulary.
+Training BPE:  57%|██████████████████████████████████████████▊                                | 1483/2599 [3:38:57<2:20:01,  7.53s/it]Merge for (1003, 288) already exists in the vocabulary.
+Training BPE:  57%|███████████████████████████████████████████                                | 1491/2599 [3:39:59<2:21:18,  7.65s/it]Merge for (703, 302) already exists in the vocabulary.
+Training BPE:  58%|███████████████████████████████████████████▎                               | 1499/2599 [3:41:01<2:20:24,  7.66s/it]
+Vocab size: 3900: మాట్లాడ + ారు.  = మాట్లాడారు.
+Training BPE:  58%|███████████████████████████████████████████▌                               | 1508/2599 [3:42:10<2:18:48,  7.63s/it]Merge for (312, 330) already exists in the vocabulary.
+Training BPE:  60%|█████████████████████████████████████████████▏                             | 1565/2599 [3:49:23<2:10:01,  7.55s/it]Merge for (298, 2412) already exists in the vocabulary.
+Training BPE:  61%|█████████████████████████████████████████████▋                             | 1585/2599 [3:51:55<2:07:34,  7.55s/it]Merge for (300, 328) already exists in the vocabulary.
+Training BPE:  61%|██████████████████████████████████████████████                             | 1597/2599 [3:53:26<2:04:38,  7.46s/it]Merge for (296, 328) already exists in the vocabulary.
+Training BPE:  62%|██████████████████████████████████████████████▏                            | 1599/2599 [3:53:41<2:05:58,  7.56s/it]
+Vocab size: 4000: తి + ని  = తిని
+Training BPE:  62%|██████████████████████████████████████████████▍                            | 1611/2599 [3:55:11<2:01:21,  7.37s/it]Merge for (284, 2403) already exists in the vocabulary.
+Training BPE:  62%|██████████████████████████████████████████████▌                            | 1613/2599 [3:55:25<2:00:58,  7.36s/it]Merge for (303, 331) already exists in the vocabulary.
+Training BPE:  63%|███████████████████████████████████████████████                            | 1630/2599 [3:57:33<2:00:49,  7.48s/it]Merge for (543, 286) already exists in the vocabulary.
+Training BPE:  63%|███████████████████████████████████████████████▎                           | 1639/2599 [3:58:40<1:59:28,  7.47s/it]Merge for (1023, 312) already exists in the vocabulary.
+Training BPE:  64%|███████████████████████████████████████████████▋                           | 1653/2599 [4:00:24<1:57:20,  7.44s/it]Merge for (300, 327) already exists in the vocabulary.
+Training BPE:  64%|████████████████████████████████████████████████                           | 1664/2599 [4:01:47<1:57:03,  7.51s/it]Merge for (295, 321) already exists in the vocabulary.
+Training BPE:  65%|████████████████████████████████████████████████▉                          | 1698/2599 [4:05:59<1:51:53,  7.45s/it]Merge for (313, 321) already exists in the vocabulary.
+Training BPE:  65%|█████████████████████████████████████████████████                          | 1699/2599 [4:06:07<1:51:03,  7.40s/it]
+Vocab size: 4100: హ + ు = హు
+Training BPE:  67%|██████████████████████████████████████████████████▌                        | 1750/2599 [4:12:21<1:42:27,  7.24s/it]Merge for (277, 2414) already exists in the vocabulary.
+Training BPE:  69%|███████████████████████████████████████████████████▉                       | 1799/2599 [4:18:21<1:36:27,  7.23s/it]
+Vocab size: 4200: పార్ + టీ  = పార్టీ
+Training BPE:  70%|████████████████████████████████████████████████████▌                      | 1822/2599 [4:21:08<1:34:43,  7.31s/it]Merge for (294, 326) already exists in the vocabulary.
+Training BPE:  70%|████████████████████████████████████████████████████▊                      | 1830/2599 [4:22:07<1:35:05,  7.42s/it]Merge for (300, 330) already exists in the vocabulary.
+Training BPE:  72%|█████████████████████████████████████████████████████▋                     | 1862/2599 [4:26:01<1:29:42,  7.30s/it]Merge for (923, 311) already exists in the vocabulary.
+Training BPE:  73%|██████████████████████████████████████████████████████▍                    | 1888/2599 [4:29:10<1:26:21,  7.29s/it]Merge for (299, 2403) already exists in the vocabulary.
+Training BPE:  73%|██████████████████████████████████████████████████████▊                    | 1899/2599 [4:30:30<1:24:22,  7.23s/it]
+Vocab size: 4300:  సమ + యంలో  =  సమయంలో
+Training BPE:  74%|███████████████████████████████████████████████████████▎                   | 1918/2599 [4:32:47<1:22:06,  7.23s/it]Merge for (310, 258) already exists in the vocabulary.
+Training BPE:  74%|███████████████████████████████████████████████████████▋                   | 1929/2599 [4:34:06<1:20:24,  7.20s/it]Merge for (300, 332) already exists in the vocabulary.
+Training BPE:  77%|█████████████████████████████████████████████████████████▋                 | 1999/2599 [4:42:31<1:11:27,  7.15s/it]
+Vocab size: 4400: ప్ర + సా = ప్రసా
+Merge for (295, 320) already exists in the vocabulary.
+Training BPE:  78%|██████████████████████████████████████████████████████████▏                | 2016/2599 [4:44:34<1:10:21,  7.24s/it]Merge for (923, 306) already exists in the vocabulary.
+Training BPE:  78%|██████████████████████████████████████████████████████████▎                | 2019/2599 [4:44:55<1:08:38,  7.10s/it]Merge for (1064, 327) already exists in the vocabulary.
+Training BPE:  79%|██████████████████████████████████████████████████████████▉                | 2043/2599 [4:47:47<1:06:17,  7.15s/it]Merge for (300, 322) already exists in the vocabulary.
+Training BPE:  80%|███████████████████████████████████████████████████████████▊               | 2074/2599 [4:51:29<1:03:09,  7.22s/it]Merge for (943, 282) already exists in the vocabulary.
+Training BPE:  81%|██████████████████████████████████████████████████████████████▏              | 2099/2599 [4:54:26<58:57,  7.08s/it]
+Vocab size: 4500:  అ + ధ్యక్షుడు  =  అధ్యక్షుడు
+Training BPE:  81%|██████████████████████████████████████████████████████████████▍              | 2106/2599 [4:55:16<58:50,  7.16s/it]Merge for (299, 321) already exists in the vocabulary.
+Training BPE:  82%|███████████████████████████████████████████████████████████████▍             | 2140/2599 [4:59:19<55:56,  7.31s/it]Merge for (291, 2414) already exists in the vocabulary.
+Training BPE:  83%|███████████████████████████████████████████████████████████████▉             | 2158/2599 [5:01:26<52:14,  7.11s/it]Merge for (279, 327) already exists in the vocabulary.
+Training BPE:  84%|████████████████████████████████████████████████████████████████▋            | 2182/2599 [5:04:16<49:28,  7.12s/it]Merge for (311, 2414) already exists in the vocabulary.
+Training BPE:  85%|█████████████████████████████████████████████████████████████████▏           | 2199/2599 [5:06:16<46:36,  6.99s/it]
+Vocab size: 4600: దే + శా = దేశా
+Training BPE:  85%|█████████████████████████████████████████████████████████████████▍           | 2207/2599 [5:07:12<45:58,  7.04s/it]Merge for (312, 332) already exists in the vocabulary.
+Training BPE:  86%|██████████████████████████████████████████████████████████��███████▏          | 2234/2599 [5:10:23<43:07,  7.09s/it]Merge for (503, 283) already exists in the vocabulary.
+Training BPE:  87%|██████████████████████████████████████████████████████████████████▋          | 2250/2599 [5:12:15<40:04,  6.89s/it]Merge for (310, 320) already exists in the vocabulary.
+Training BPE:  87%|██████████████████████████████████████████████████████████████████▉          | 2258/2599 [5:13:11<40:00,  7.04s/it]Merge for (299, 318) already exists in the vocabulary.
+Training BPE:  88%|████████████████████████████████████████████████████████████████████         | 2299/2599 [5:17:58<34:31,  6.91s/it]
+Vocab size: 4700: ర్ + లో  = ర్లో
+Training BPE:  91%|██████████████████████████████████████████████████████████████████████▏      | 2367/2599 [5:25:54<26:50,  6.94s/it]Merge for (843, 294) already exists in the vocabulary.
+Training BPE:  92%|███████████████████████████████████████████████████████████████████████      | 2399/2599 [5:29:38<23:13,  6.97s/it]
+Vocab size: 4800: స + ద = సద
+Training BPE:  93%|███████████████████████████████████████████████████████████████████████▌     | 2414/2599 [5:31:22<21:35,  7.01s/it]Merge for (280, 318) already exists in the vocabulary.
+Training BPE:  96%|██████████████████████████████████████████████████████████████████████████   | 2499/2599 [5:41:03<11:12,  6.73s/it]
+Vocab size: 4900: భవిష + ్య = భవిష్య
+Training BPE:  96%|██████████████████████████████████████████████████████████████████████████▏  | 2505/2599 [5:41:43<10:27,  6.67s/it]Merge for (763, 309) already exists in the vocabulary.
+Training BPE:  99%|████████████████████████████████████████████████████████████████████████████▎| 2574/2599 [5:49:29<02:50,  6.84s/it]Merge for (313, 332) already exists in the vocabulary.
+Training BPE: 100%|████████████████████████████████████████████████████████████████████████████▉| 2598/2599 [5:52:10<00:08,  8.13s/it]
+Final statistics:
+Final vocabulary size: 4,999
+Number of merges: 2,599
+Final compression ratio: 8.63x
+Training time: 21135.62 seconds
+Tokenizer mappings saved to telugu_tokenizer_vocab.json and telugu_tokenizer_merges.json
+Test Results:
+Original: తెలుగు భాష
+Encoded: [4149, 4717]
+Decoded: తెలుగు భాష
+Matches original: True