Spaces:
Sleeping
Sleeping
Chaitanya Sagar Gurujula
commited on
Commit
ยท
496ac89
1
Parent(s):
bc28434
Add application file
Browse files- Dockerfile +12 -0
- README.md +139 -6
- requirements.txt +9 -0
- src/__pycache__/bpe_tokenizer.cpython-312.pyc +0 -0
- src/app.py +123 -0
- src/bpe_tokenizer.py +660 -0
- src/templates/index.html +134 -0
- telugu_base_vocab.json +0 -0
- telugu_tokenizer_merges.json +0 -0
- telugu_tokenizer_vocab.json +0 -0
- training_logs.log +376 -0
Dockerfile
ADDED
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
FROM python:3.9-slim
|
2 |
+
|
3 |
+
WORKDIR /app
|
4 |
+
|
5 |
+
COPY requirements.txt .
|
6 |
+
RUN pip install -r requirements.txt
|
7 |
+
|
8 |
+
COPY src/ .
|
9 |
+
COPY telugu_tokenizer_vocab.json .
|
10 |
+
COPY telugu_tokenizer_merges.json .
|
11 |
+
|
12 |
+
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
|
README.md
CHANGED
@@ -1,11 +1,144 @@
|
|
1 |
---
|
2 |
-
title: Telugu Tokenizer
|
3 |
-
emoji:
|
4 |
-
colorFrom:
|
5 |
-
colorTo:
|
6 |
sdk: docker
|
|
|
|
|
7 |
pinned: false
|
8 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
---
|
10 |
|
11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
title: Telugu Tokenizer App
|
3 |
+
emoji: เฐ
|
4 |
+
colorFrom: indigo
|
5 |
+
colorTo: blue
|
6 |
sdk: docker
|
7 |
+
sdk_version: "1.0"
|
8 |
+
app_file: app:app
|
9 |
pinned: false
|
10 |
+
description: A tokenizer app for tokenizing Telugu text. It uses BPE (Byte Pair Encoding) to tokenize Telugu text. 5k is the vocab size.
|
11 |
+
tags:
|
12 |
+
- telugu
|
13 |
+
- tokenizer
|
14 |
+
- NLP
|
15 |
+
- transformers
|
16 |
+
license: apache-2.0
|
17 |
+
model: telugu-tokenizer-model
|
18 |
+
datasets:
|
19 |
+
- telugu-dataset
|
20 |
+
isPrivate: false
|
21 |
---
|
22 |
|
23 |
+
# Telugu Tokenizer
|
24 |
+
|
25 |
+
This repository provides a tokenizer implementation for processing Telugu text, designed to handle both Telugu Unicode characters and ASCII characters. It uses a Byte Pair Encoding (BPE) approach to efficiently tokenize text and create a vocabulary optimized for Telugu language processing.
|
26 |
+
|
27 |
+
## Features
|
28 |
+
|
29 |
+
- **Comprehensive Telugu Support**: Includes all Telugu Unicode characters (0C00-0C7F), common ligatures, and valid consonant combinations.
|
30 |
+
- **Base Vocabulary Creation**: Generates a base vocabulary containing ASCII, Extended ASCII, and Telugu characters.
|
31 |
+
- **Byte Pair Encoding (BPE)**: Trains the tokenizer to merge frequently occurring token pairs, creating an optimized vocabulary.
|
32 |
+
- **Parallel Processing**: Utilizes multiprocessing for efficient tokenization of large text datasets.
|
33 |
+
- **Persistence**: Supports saving and loading the vocabulary to/from JSON files.
|
34 |
+
|
35 |
+
## Requirements
|
36 |
+
|
37 |
+
The tokenizer requires the following dependencies:
|
38 |
+
|
39 |
+
- Python 3.7+
|
40 |
+
- tqdm
|
41 |
+
- pandas
|
42 |
+
- datasets
|
43 |
+
|
44 |
+
Install the required packages using pip:
|
45 |
+
```bash
|
46 |
+
pip install tqdm pandas datasets
|
47 |
+
```
|
48 |
+
|
49 |
+
## Usage
|
50 |
+
|
51 |
+
### 1. Base Vocabulary Creation
|
52 |
+
|
53 |
+
The tokenizer first generates a base vocabulary containing ASCII, Extended ASCII, and Telugu characters.
|
54 |
+
|
55 |
+
```python
|
56 |
+
from telugu_tokenizer import create_base_vocab, save_base_vocab
|
57 |
+
|
58 |
+
base_vocab = create_base_vocab()
|
59 |
+
save_base_vocab(base_vocab, path='telugu_base_vocab.json')
|
60 |
+
```
|
61 |
+
|
62 |
+
### 2. Loading an Existing Vocabulary
|
63 |
+
|
64 |
+
You can load an existing base vocabulary from a JSON file:
|
65 |
+
|
66 |
+
```python
|
67 |
+
from telugu_tokenizer import load_base_vocab
|
68 |
+
|
69 |
+
vocab = load_base_vocab('telugu_base_vocab.json')
|
70 |
+
```
|
71 |
+
|
72 |
+
### 3. Training the Tokenizer
|
73 |
+
|
74 |
+
The `BPETokenizer` class can be used to train a tokenizer on a given text input:
|
75 |
+
|
76 |
+
```python
|
77 |
+
from telugu_tokenizer import BPETokenizer
|
78 |
+
|
79 |
+
text = "เฐฎเฑเฐฐเฑ เฐเฐฒเฐพ เฐเฐจเฑเฐจเฐพเฐฐเฑ?" # Sample Telugu text
|
80 |
+
tokenizer = BPETokenizer(vocab_size=5000)
|
81 |
+
tokenizer.fit(text)
|
82 |
+
```
|
83 |
+
|
84 |
+
### 4. Saving and Loading the Tokenizer
|
85 |
+
|
86 |
+
After training, save the tokenizer's vocabulary and merges:
|
87 |
+
|
88 |
+
```python
|
89 |
+
tokenizer.save('telugu_tokenizer')
|
90 |
+
```
|
91 |
+
|
92 |
+
Load the trained tokenizer:
|
93 |
+
|
94 |
+
```python
|
95 |
+
tokenizer.load('telugu_tokenizer')
|
96 |
+
```
|
97 |
+
|
98 |
+
## Telugu Unicode Support
|
99 |
+
|
100 |
+
The tokenizer covers the full range of Telugu Unicode characters, including vowels, consonants, vowel signs, digits, and fraction symbols. Additionally, it supports:
|
101 |
+
|
102 |
+
- Common ligatures formed with Telugu consonants and vowel signs.
|
103 |
+
- Valid consonant combinations like `เฐเฑเฐ`, `เฐเฑเฐ`, etc.
|
104 |
+
|
105 |
+
## File Structure
|
106 |
+
|
107 |
+
- **`bpe_tokenizer.py`**: Contains the implementation of the Telugu tokenizer.
|
108 |
+
- **`telugu_base_vocab.json`**: JSON file storing the base vocabulary.
|
109 |
+
- **`telugu_tokenizer_vocab.json`**: JSON file storing the trained vocabulary and merges (generated after training).
|
110 |
+
|
111 |
+
## Results
|
112 |
+
|
113 |
+
- **Final vocabulary size**: 4,999
|
114 |
+
- **Final compression ratio**: 8.63x
|
115 |
+
|
116 |
+
## Logs
|
117 |
+
- [View Training Logs ](./training_logs.log)
|
118 |
+
|
119 |
+
## Performance
|
120 |
+
|
121 |
+
The tokenizer uses multiprocessing to handle large datasets efficiently. It processes text in chunks and merges token pairs iteratively to optimize the vocabulary size. This is a simple implementation and can be improved for large-scale datasets.
|
122 |
+
## Future Enhancements
|
123 |
+
|
124 |
+
- Extend support for additional Telugu ligatures and symbols.
|
125 |
+
- Optimize BPE training for large-scale datasets.
|
126 |
+
- Provide pre-trained models for common Telugu NLP tasks.
|
127 |
+
|
128 |
+
## License
|
129 |
+
|
130 |
+
This project is licensed under the MIT License. See the LICENSE file for more details.
|
131 |
+
|
132 |
+
## Contributing
|
133 |
+
|
134 |
+
Contributions are welcome! Feel free to submit a pull request or open an issue if you encounter bugs or have suggestions for improvement.
|
135 |
+
|
136 |
+
## Acknowledgments
|
137 |
+
|
138 |
+
- Unicode Consortium for Telugu Unicode character information.
|
139 |
+
- Community contributions to Telugu NLP development.
|
140 |
+
|
141 |
+
---
|
142 |
+
|
143 |
+
Feel free to explore the tokenizer and adapt it for your Telugu language processing needs. Happy coding!
|
144 |
+
|
requirements.txt
ADDED
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
fastapi==0.68.0
|
2 |
+
uvicorn==0.15.0
|
3 |
+
jinja2==3.0.1
|
4 |
+
python-multipart==0.0.5
|
5 |
+
datasets==2.12.0
|
6 |
+
tqdm==4.65.0
|
7 |
+
aiofiles==0.8.0
|
8 |
+
python-multipart==0.0.5
|
9 |
+
pandas==2.2.3
|
src/__pycache__/bpe_tokenizer.cpython-312.pyc
ADDED
Binary file (42.6 kB). View file
|
|
src/app.py
ADDED
@@ -0,0 +1,123 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from fastapi import FastAPI, Request
|
2 |
+
from fastapi.responses import HTMLResponse
|
3 |
+
from fastapi.templating import Jinja2Templates
|
4 |
+
from fastapi.middleware.cors import CORSMiddleware
|
5 |
+
from pydantic import BaseModel
|
6 |
+
from bpe_tokenizer import BPETokenizer, create_base_vocab
|
7 |
+
import os
|
8 |
+
import json
|
9 |
+
|
10 |
+
# Get the absolute path to the templates directory
|
11 |
+
TEMPLATES_DIR = os.path.join(os.path.dirname(__file__), "templates")
|
12 |
+
|
13 |
+
app = FastAPI(title="Telugu BPE Tokenizer")
|
14 |
+
|
15 |
+
# Add CORS middleware
|
16 |
+
app.add_middleware(
|
17 |
+
CORSMiddleware,
|
18 |
+
allow_origins=["*"],
|
19 |
+
allow_credentials=True,
|
20 |
+
allow_methods=["*"],
|
21 |
+
allow_headers=["*"],
|
22 |
+
)
|
23 |
+
|
24 |
+
# Templates with absolute path
|
25 |
+
templates = Jinja2Templates(directory=TEMPLATES_DIR)
|
26 |
+
|
27 |
+
# Initialize tokenizer
|
28 |
+
tokenizer = BPETokenizer(vocab_size=5000)
|
29 |
+
|
30 |
+
# Load the vocabulary file directly
|
31 |
+
print("Loading vocabulary...")
|
32 |
+
vocab_file = 'telugu_tokenizer_vocab.json'
|
33 |
+
with open(vocab_file, 'r', encoding='utf-8') as f:
|
34 |
+
vocab_data = json.load(f)
|
35 |
+
|
36 |
+
class TokenizeRequest(BaseModel):
|
37 |
+
text: str
|
38 |
+
|
39 |
+
@app.get("/", response_class=HTMLResponse)
|
40 |
+
async def home(request: Request):
|
41 |
+
return templates.TemplateResponse(
|
42 |
+
"index.html",
|
43 |
+
{"request": request, "title": "Telugu BPE Tokenizer"}
|
44 |
+
)
|
45 |
+
|
46 |
+
@app.post("/tokenize")
|
47 |
+
async def tokenize(request: TokenizeRequest):
|
48 |
+
text = request.text
|
49 |
+
try:
|
50 |
+
tokens = tokenizer.encode(text)
|
51 |
+
decoded = tokenizer.decode(tokens)
|
52 |
+
|
53 |
+
# Get token details from vocabulary for display
|
54 |
+
token_details = []
|
55 |
+
current_position = 0
|
56 |
+
current_byte_position = 0
|
57 |
+
text_bytes = text.encode('utf-8')
|
58 |
+
|
59 |
+
while current_position < len(tokens):
|
60 |
+
# Skip leading spaces in original text
|
61 |
+
while current_byte_position < len(text_bytes) and text_bytes[current_byte_position] == 32:
|
62 |
+
current_byte_position += 1
|
63 |
+
|
64 |
+
# Get next word from original text
|
65 |
+
word_start = current_byte_position
|
66 |
+
word_end = word_start
|
67 |
+
while word_end < len(text_bytes) and text_bytes[word_end] != 32:
|
68 |
+
word_end += 1
|
69 |
+
|
70 |
+
word_bytes = text_bytes[word_start:word_end]
|
71 |
+
word = word_bytes.decode('utf-8')
|
72 |
+
|
73 |
+
# Collect tokens for this word
|
74 |
+
word_tokens = []
|
75 |
+
decoded_bytes = b''
|
76 |
+
|
77 |
+
while current_position < len(tokens):
|
78 |
+
token = tokens[current_position]
|
79 |
+
token_bytes = tokenizer.vocab[token]
|
80 |
+
|
81 |
+
# If we've collected enough bytes for the word (plus possible space)
|
82 |
+
if len(decoded_bytes) >= len(word_bytes):
|
83 |
+
break
|
84 |
+
|
85 |
+
word_tokens.append(token)
|
86 |
+
decoded_bytes += token_bytes
|
87 |
+
current_position += 1
|
88 |
+
|
89 |
+
# Update byte position for next word
|
90 |
+
current_byte_position = word_end
|
91 |
+
|
92 |
+
# Add word and its tokens to details
|
93 |
+
token_details.append({
|
94 |
+
"word": word,
|
95 |
+
"type": "subword_tokens",
|
96 |
+
"tokens": [{
|
97 |
+
"id": t,
|
98 |
+
"text": vocab_data.get(str(t), {}).get('text', '[UNKNOWN]')
|
99 |
+
} for t in word_tokens]
|
100 |
+
})
|
101 |
+
|
102 |
+
return {
|
103 |
+
"original": text,
|
104 |
+
"tokens": tokens,
|
105 |
+
"token_details": token_details,
|
106 |
+
"decoded": decoded,
|
107 |
+
"matches": text == decoded
|
108 |
+
}
|
109 |
+
except Exception as e:
|
110 |
+
print(f"Error: {str(e)}")
|
111 |
+
return {"error": str(e)}
|
112 |
+
|
113 |
+
@app.get("/vocab")
|
114 |
+
async def get_vocab():
|
115 |
+
return {
|
116 |
+
"vocab_size": len(vocab_data),
|
117 |
+
"base_vocab_size": sum(1 for info in vocab_data.values() if info.get('is_base', False)),
|
118 |
+
"num_merges": len(getattr(tokenizer, 'merges', {}))
|
119 |
+
}
|
120 |
+
|
121 |
+
if __name__ == "__main__":
|
122 |
+
import uvicorn
|
123 |
+
uvicorn.run(app, host="127.0.0.1", port=8001)
|
src/bpe_tokenizer.py
ADDED
@@ -0,0 +1,660 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from tqdm import tqdm
|
2 |
+
from collections import Counter
|
3 |
+
import json
|
4 |
+
from datasets import load_dataset
|
5 |
+
import time
|
6 |
+
import os
|
7 |
+
import re
|
8 |
+
import pandas as pd
|
9 |
+
from multiprocessing import Pool
|
10 |
+
import array
|
11 |
+
|
12 |
+
def get_telugu_char_info():
|
13 |
+
"""
|
14 |
+
Returns a dictionary of Telugu Unicode ranges with their descriptions.
|
15 |
+
Based on Unicode 13.0 Telugu block (0C00-0C7F).
|
16 |
+
"""
|
17 |
+
return {
|
18 |
+
(0x0C00, 0x0C03): "Various forms of Telugu anusvara and visarga",
|
19 |
+
(0x0C05, 0x0C14): "Telugu vowels (เฐ
to เฐ)",
|
20 |
+
(0x0C15, 0x0C39): "Telugu consonants (เฐ to เฐน)",
|
21 |
+
(0x0C3D, 0x0C44): "Telugu vowel signs (เฐฝ to เฑ)",
|
22 |
+
(0x0C46, 0x0C48): "Telugu vowel signs (เฑ to เฑ)",
|
23 |
+
(0x0C4A, 0x0C4D): "Telugu vowel signs and virama (เฑ to เฑ)",
|
24 |
+
(0x0C55, 0x0C56): "Telugu length marks",
|
25 |
+
(0x0C58, 0x0C5A): "Additional Telugu consonants",
|
26 |
+
(0x0C60, 0x0C63): "Telugu vocalic letters",
|
27 |
+
(0x0C66, 0x0C6F): "Telugu digits (เฑฆ to เฑฏ)",
|
28 |
+
(0x0C78, 0x0C7F): "Telugu fraction symbols"
|
29 |
+
}
|
30 |
+
|
31 |
+
def create_base_vocab():
|
32 |
+
"""Create a base vocabulary with ASCII, Telugu characters, and common ligatures."""
|
33 |
+
vocab = {}
|
34 |
+
token_id = 0
|
35 |
+
existing_tokens = set() # Set to track existing tokens
|
36 |
+
|
37 |
+
# Add ASCII characters (0-127)
|
38 |
+
print("Adding ASCII characters...")
|
39 |
+
for i in range(128):
|
40 |
+
char_bytes = bytes([i])
|
41 |
+
try:
|
42 |
+
char = char_bytes.decode('utf-8', errors='strict')
|
43 |
+
vocab[token_id] = {
|
44 |
+
'text': char,
|
45 |
+
'bytes': list(char_bytes),
|
46 |
+
'type': 'ASCII',
|
47 |
+
'description': f"ASCII character: {repr(char)}"
|
48 |
+
}
|
49 |
+
token_id += 1
|
50 |
+
except UnicodeDecodeError:
|
51 |
+
continue
|
52 |
+
|
53 |
+
# Add Extended ASCII characters (128-255)
|
54 |
+
print("Adding Extended ASCII characters...")
|
55 |
+
for i in range(128, 256):
|
56 |
+
char_bytes = bytes([i])
|
57 |
+
try:
|
58 |
+
# Try to decode as UTF-8 first
|
59 |
+
char = char_bytes.decode('utf-8', errors='strict')
|
60 |
+
vocab[token_id] = {
|
61 |
+
'text': char if char.isprintable() else f"<{hex(i)[2:].upper()}>",
|
62 |
+
'bytes': list(char_bytes),
|
63 |
+
'type': 'Extended ASCII',
|
64 |
+
'description': f"Extended ASCII character: {char} ({hex(i)})"
|
65 |
+
}
|
66 |
+
except UnicodeDecodeError:
|
67 |
+
# If not valid UTF-8, store as bytes representation
|
68 |
+
vocab[token_id] = {
|
69 |
+
'text': f"[Bytes: {list(char_bytes)}]",
|
70 |
+
'bytes': list(char_bytes),
|
71 |
+
'type': 'Extended ASCII',
|
72 |
+
'description': f"Extended ASCII byte: {hex(i)}"
|
73 |
+
}
|
74 |
+
token_id += 1
|
75 |
+
|
76 |
+
# Add Telugu Unicode characters (0C00-0C7F)
|
77 |
+
print("Adding Telugu characters...")
|
78 |
+
telugu_info = get_telugu_char_info()
|
79 |
+
|
80 |
+
for i in range(0x0C00, 0x0C7F + 1):
|
81 |
+
try:
|
82 |
+
char = chr(i)
|
83 |
+
char_bytes = char.encode('utf-8')
|
84 |
+
# Only add if it's a valid character
|
85 |
+
char.encode('utf-8').decode('utf-8')
|
86 |
+
|
87 |
+
# Find the character's category
|
88 |
+
char_type = "Other Telugu Character"
|
89 |
+
char_description = "Telugu character"
|
90 |
+
for (start, end), desc in telugu_info.items():
|
91 |
+
if start <= i <= end:
|
92 |
+
char_type = desc
|
93 |
+
char_description = f"Telugu character: {char} ({hex(i)})"
|
94 |
+
break
|
95 |
+
|
96 |
+
vocab[token_id] = {
|
97 |
+
'text': char,
|
98 |
+
'bytes': list(char_bytes),
|
99 |
+
'type': char_type,
|
100 |
+
'description': char_description
|
101 |
+
}
|
102 |
+
token_id += 1
|
103 |
+
except UnicodeEncodeError:
|
104 |
+
continue
|
105 |
+
|
106 |
+
# Define Telugu consonants and vowel signs
|
107 |
+
consonants = [
|
108 |
+
'เฐ', 'เฐ', 'เฐ', 'เฐ', 'เฐ', 'เฐ', 'เฐ', 'เฐ', 'เฐ', 'เฐ',
|
109 |
+
'เฐ', 'เฐ ', 'เฐก', 'เฐข', 'เฐฃ', 'เฐค', 'เฐฅ', 'เฐฆ', 'เฐง', 'เฐจ',
|
110 |
+
'เฐช', 'เฐซ', 'เฐฌ', 'เฐญ', 'เฐฎ', 'เฐฏ', 'เฐฐ', 'เฐฒ', 'เฐต', 'เฐถ',
|
111 |
+
'เฐท', 'เฐธ', 'เฐน', 'เฐณ', 'เฐเฑเฐท', 'เฐฑ'
|
112 |
+
]
|
113 |
+
|
114 |
+
vowel_signs = [
|
115 |
+
'', 'เฐพ', 'เฐฟ', 'เฑ', 'เฑ', 'เฑ', 'เฑ', 'เฑ', 'เฑข', 'เฑฃ', 'เฑ', 'เฑ', 'เฑ', 'เฑ', 'เฑ', 'เฑ', 'เฐ', 'เฐ', 'เฐ', 'เฑ'
|
116 |
+
]
|
117 |
+
|
118 |
+
|
119 |
+
# Add common Telugu ligatures with existing vowel signs
|
120 |
+
print("Adding common Telugu ligatures with existing vowel signs...")
|
121 |
+
for consonant in consonants:
|
122 |
+
for vowel_sign in vowel_signs:
|
123 |
+
ligature = consonant + vowel_sign
|
124 |
+
if ligature not in existing_tokens: # Check for duplicates
|
125 |
+
char_bytes = ligature.encode('utf-8')
|
126 |
+
vocab[token_id] = {
|
127 |
+
'text': ligature,
|
128 |
+
'bytes': list(char_bytes),
|
129 |
+
'type': 'Ligature',
|
130 |
+
'description': f"Telugu ligature: {ligature}"
|
131 |
+
}
|
132 |
+
existing_tokens.add(ligature) # Add to the set
|
133 |
+
token_id += 1
|
134 |
+
|
135 |
+
# Add valid consonant combinations
|
136 |
+
print("Adding valid consonant combinations...")
|
137 |
+
valid_consonant_combinations = [
|
138 |
+
'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ',
|
139 |
+
'เฐเฑเฐ', 'เฐเฑเฐ ', 'เฐเฑเฐก', 'เฐเฑเฐข', 'เฐเฑเฐฃ', 'เฐเฑเฐค', 'เฐเฑเฐฅ', 'เฐเฑเฐฆ', 'เฐเฑเฐง', 'เฐเฑเฐจ',
|
140 |
+
'เฐเฑเฐช', 'เฐเฑเฐซ', 'เฐเฑเฐฌ', 'เฐเฑเฐญ', 'เฐเฑเฐฎ', 'เฐเฑเฐฏ', 'เฐเฑเฐฐ', 'เฐเฑเฐฒ', 'เฐเฑเฐต', 'เฐเฑเฐถ',
|
141 |
+
'เฐเฑเฐท', 'เฐเฑเฐธ', 'เฐเฑเฐน', 'เฐเฑเฐณ', 'เฐเฑเฐเฑเฐท', 'เฐเฑเฐฑ',
|
142 |
+
'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ',
|
143 |
+
'เฐเฑเฐ', 'เฐเฑเฐ ', 'เฐเฑเฐก', 'เฐเฑเฐข', 'เฐเฑเฐฃ', 'เฐเฑเฐค', 'เฐเฑเฐฅ', 'เฐเฑเฐฆ', 'เฐเฑเฐง', 'เฐเฑเฐจ',
|
144 |
+
'เฐเฑเฐช', 'เฐเฑเฐซ', 'เฐเฑเฐฌ', 'เฐเฑเฐญ', 'เฐเฑเฐฎ', 'เฐเฑเฐฏ', 'เฐเฑเฐฐ', 'เฐเฑเฐฒ', 'เฐเฑเฐต', 'เฐเฑเฐถ',
|
145 |
+
'เฐเฑเฐท', 'เฐเฑเฐธ', 'เฐเฑเฐน', 'เฐเฑเฐณ', 'เฐเฑเฐเฑเฐท', 'เฐเฑเฐฑ',
|
146 |
+
'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ',
|
147 |
+
'เฐเฑเฐ', 'เฐเฑเฐ ', 'เฐเฑเฐก', 'เฐเฑเฐข', 'เฐเฑเฐฃ', 'เฐเฑเฐค', 'เฐเฑเฐฅ', 'เฐเฑเฐฆ', 'เฐเฑเฐง', 'เฐเฑเฐจ',
|
148 |
+
'เฐเฑเฐช', 'เฐเฑเฐซ', 'เฐเฑเฐฌ', 'เฐเฑเฐญ', 'เฐเฑเฐฎ', 'เฐเฑเฐฏ', 'เฐเฑเฐฐ', 'เฐเฑเฐฒ', 'เฐเฑเฐต', 'เฐเฑเฐถ',
|
149 |
+
'เฐเฑเฐท', 'เฐเฑเฐธ', 'เฐเฑเฐน', 'เฐเฑเฐณ', 'เฐเฑเฐเฑเฐท', 'เฐเฑเฐฑ',
|
150 |
+
'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ',
|
151 |
+
'เฐเฑเฐ', 'เฐเฑเฐ ', 'เฐเฑเฐก', 'เฐเฑเฐข', 'เฐเฑเฐฃ', 'เฐเฑเฐค', 'เฐเฑเฐฅ', 'เฐเฑเฐฆ', 'เฐเฑเฐง', 'เฐเฑเฐจ',
|
152 |
+
'เฐเฑเฐช', 'เฐเฑเฐซ', 'เฐเฑเฐฌ', 'เฐเฑเฐญ', 'เฐเฑเฐฎ', 'เฐเฑเฐฏ', 'เฐเฑเฐฐ', 'เฐเฑเฐฒ', 'เฐเฑเฐต', 'เฐเฑเฐถ',
|
153 |
+
'เฐเฑเฐท', 'เฐเฑเฐธ', 'เฐเฑเฐน', 'เฐเฑเฐณ', 'เฐเฑเฐเฑเฐท', 'เฐเฑเฐฑ',
|
154 |
+
'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ',
|
155 |
+
'เฐเฑเฐ', 'เฐเฑเฐ ', 'เฐเฑเฐก', 'เฐเฑเฐข', 'เฐเฑเฐฃ', 'เฐเฑเฐค', 'เฐเฑเฐฅ', 'เฐเฑเฐฆ', 'เฐเฑเฐง', 'เฐเฑเฐจ',
|
156 |
+
'เฐเฑเฐช', 'เฐเฑเฐซ', 'เฐเฑเฐฌ', 'เฐเฑเฐญ', 'เฐเฑเฐฎ', 'เฐเฑเฐฏ', 'เฐเฑเฐฐ', 'เฐเฑเฐฒ', 'เฐเฑเฐต', 'เฐเฑเฐถ',
|
157 |
+
'เฐเฑเฐท', 'เฐเฑเฐธ', 'เฐเฑเฐน', 'เฐเฑเฐณ', 'เฐเฑเฐเฑเฐท', 'เฐเฑเฐฑ',
|
158 |
+
'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ',
|
159 |
+
'เฐเฑเฐ', 'เฐเฑเฐ ', 'เฐเฑเฐก', 'เฐเฑเฐข', 'เฐเฑเฐฃ', 'เฐเฑเฐค', 'เฐเฑเฐฅ', 'เฐเฑเฐฆ', 'เฐเฑเฐง', 'เฐเฑเฐจ',
|
160 |
+
'เฐเฑเฐช', 'เฐเฑเฐซ', 'เฐเฑเฐฌ', 'เฐเฑเฐญ', 'เฐเฑเฐฎ', 'เฐเฑเฐฏ', 'เฐเฑเฐฐ', 'เฐเฑเฐฒ', 'เฐเฑเฐต', 'เฐเฑเฐถ',
|
161 |
+
'เฐเฑเฐท', 'เฐเฑเฐธ', 'เฐเฑเฐน', 'เฐเฑเฐณ', 'เฐเฑเฐเฑเฐท', 'เฐเฑเฐฑ',
|
162 |
+
'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ',
|
163 |
+
'เฐเฑเฐ', 'เฐเฑเฐ ', 'เฐเฑเฐก', 'เฐเฑเฐข', 'เฐเฑเฐฃ', 'เฐเฑเฐค', 'เฐเฑเฐฅ', 'เฐเฑเฐฆ', 'เฐเฑเฐง', 'เฐเฑเฐจ',
|
164 |
+
'เฐเฑเฐช', 'เฐเฑเฐซ', 'เฐเฑเฐฌ', 'เฐเฑเฐญ', 'เฐเฑเฐฎ', 'เฐเฑเฐฏ', 'เฐเฑเฐฐ', 'เฐเฑเฐฒ', 'เฐเฑเฐต', 'เฐเฑเฐถ',
|
165 |
+
'เฐเฑเฐท', 'เฐเฑเฐธ', 'เฐเฑเฐน', 'เฐเฑเฐณ', 'เฐเฑเฐเฑเฐท', 'เฐเฑเฐฑ',
|
166 |
+
'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ',
|
167 |
+
'เฐเฑเฐ', 'เฐเฑเฐ ', 'เฐเฑเฐก', 'เฐเฑเฐข', 'เฐเฑเฐฃ', 'เฐเฑเฐค', 'เฐเฑเฐฅ', 'เฐเฑเฐฆ', 'เฐเฑเฐง', 'เฐเฑเฐจ',
|
168 |
+
'เฐเฑเฐช', 'เฐเฑเฐซ', 'เฐเฑเฐฌ', 'เฐเฑเฐญ', 'เฐเฑเฐฎ', 'เฐเฑเฐฏ', 'เฐเฑเฐฐ', 'เฐเฑเฐฒ', 'เฐเฑเฐต', 'เฐเฑเฐถ',
|
169 |
+
'เฐเฑเฐท', 'เฐเฑเฐธ', 'เฐเฑเฐน', 'เฐเฑเฐณ', 'เฐเฑเฐเฑเฐท', 'เฐเฑเฐฑ',
|
170 |
+
'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ',
|
171 |
+
'เฐเฑเฐ', 'เฐเฑเฐ ', 'เฐเฑเฐก', 'เฐเฑเฐข', 'เฐเฑเฐฃ', 'เฐเฑเฐค', 'เฐเฑเฐฅ', 'เฐเฑเฐฆ', 'เฐเฑเฐง', 'เฐเฑเฐจ',
|
172 |
+
'เฐเฑเฐช', 'เฐเฑเฐซ', 'เฐเฑเฐฌ', 'เฐเฑเฐญ', 'เฐเฑเฐฎ', 'เฐเฑเฐฏ', 'เฐเฑเฐฐ', 'เฐเฑเฐฒ', 'เฐเฑเฐต', 'เฐเฑเฐถ',
|
173 |
+
'เฐเฑเฐท', 'เฐเฑเฐธ', 'เฐเฑเฐน', 'เฐเฑเฐณ', 'เฐเฑเฐเฑเฐท', 'เฐเฑเฐฑ',
|
174 |
+
'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ',
|
175 |
+
'เฐเฑเฐ', 'เฐเฑเฐ ', 'เฐเฑเฐก', 'เฐเฑเฐข', 'เฐเฑเฐฃ', 'เฐเฑเฐค', 'เฐเฑเฐฅ', 'เฐเฑเฐฆ', 'เฐเฑเฐง', 'เฐเฑเฐจ',
|
176 |
+
'เฐเฑเฐช', 'เฐเฑเฐซ', 'เฐเฑเฐฌ', 'เฐเฑเฐญ', 'เฐเฑเฐฎ', 'เฐเฑเฐฏ', 'เฐเฑเฐฐ', 'เฐเฑเฐฒ', 'เฐเฑเฐต', 'เฐเฑเฐถ',
|
177 |
+
'เฐเฑเฐท', 'เฐเฑเฐธ', 'เฐเฑเฐน', 'เฐเฑเฐณ', 'เฐเฑเฐเฑเฐท', 'เฐเฑเฐฑ',
|
178 |
+
'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ', 'เฐเฑเฐ',
|
179 |
+
'เฐเฑเฐ', 'เฐเฑเฐ ', 'เฐเฑเฐก', 'เฐเฑเฐข', 'เฐเฑเฐฃ', 'เฐเฑเฐค', 'เฐเฑเฐฅ', 'เฐเฑเฐฆ', 'เฐเฑเฐง', 'เฐเฑเฐจ',
|
180 |
+
'เฐเฑเฐช', 'เฐเฑเฐซ', 'เฐเฑเฐฌ', 'เฐเฑเฐญ', 'เฐเฑเฐฎ', 'เฐเฑเฐฏ', 'เฐเฑเฐฐ', 'เฐเฑเฐฒ', 'เฐเฑเฐต', 'เฐเฑเฐถ',
|
181 |
+
'เฐเฑเฐท', 'เฐเฑเฐธ', 'เฐเฑเฐน', 'เฐเฑเฐณ', 'เฐเฑเฐเฑเฐท', 'เฐเฑเฐฑ',
|
182 |
+
'เฐ เฑเฐ', 'เฐ เฑเฐ', 'เฐ เฑเฐ', 'เฐ เฑเฐ', 'เฐ เฑเฐ', 'เฐ เฑเฐ', 'เฐ เฑเฐ', 'เฐ เฑเฐ', 'เฐ เฑเฐ', 'เฐ เฑเฐ',
|
183 |
+
'เฐ เฑเฐ', 'เฐ เฑเฐ ', 'เฐ เฑเฐก', 'เฐ เฑเฐข', 'เฐ เฑเฐฃ', 'เฐ เฑเฐค', 'เฐ เฑเฐฅ', 'เฐ เฑเฐฆ', 'เฐ เฑเฐง', 'เฐ เฑเฐจ',
|
184 |
+
'เฐ เฑเฐช', 'เฐ เฑเฐซ', 'เฐ เฑเฐฌ', 'เฐ เฑเฐญ', 'เฐ เฑเฐฎ', 'เฐ เฑเฐฏ', 'เฐ เฑเฐฐ', 'เฐ เฑเฐฒ', 'เฐ เฑเฐต', 'เฐ เฑเฐถ',
|
185 |
+
'เฐ เฑเฐท', 'เฐ เฑเฐธ', 'เฐ เฑเฐน', 'เฐ เฑเฐณ', 'เฐ เฑเฐเฑเฐท', 'เฐ เฑเฐฑ',
|
186 |
+
'เฐกเฑเฐ', 'เฐกเฑเฐ', 'เฐกเฑเฐ', 'เฐกเฑเฐ', 'เฐกเฑเฐ', 'เฐกเฑเฐ', 'เฐกเฑเฐ', 'เฐกเฑเฐ', 'เฐกเฑเฐ', 'เฐกเฑเฐ',
|
187 |
+
'เฐกเฑเฐ', 'เฐกเฑเฐ ', 'เฐกเฑเฐก', 'เฐกเฑเฐข', 'เฐกเฑเฐฃ', 'เฐกเฑเฐค', 'เฐกเฑเฐฅ', 'เฐกเฑเฐฆ', 'เฐกเฑเฐง', 'เฐกเฑเฐจ',
|
188 |
+
'เฐกเฑเฐช', 'เฐกเฑเฐซ', 'เฐกเฑเฐฌ', 'เฐกเฑเฐญ', 'เฐกเฑเฐฎ', 'เฐกเฑเฐฏ', 'เฐกเฑเฐฐ', 'เฐกเฑเฐฒ', 'เฐกเฑเฐต', 'เฐกเฑเฐถ',
|
189 |
+
'เฐกเฑเฐท', 'เฐกเฑเฐธ', 'เฐกเฑเฐน', 'เฐกเฑเฐณ', 'เฐกเฑเฐเฑเฐท', 'เฐกเฑเฐฑ',
|
190 |
+
'เฐขเฑเฐ', 'เฐขเฑเฐ', 'เฐขเฑเฐ', 'เฐขเฑเฐ', 'เฐขเฑเฐ', 'เฐขเฑเฐ', 'เฐขเฑเฐ', 'เฐขเฑเฐ', 'เฐขเฑเฐ', 'เฐขเฑเฐ',
|
191 |
+
'เฐขเฑเฐ', 'เฐขเฑเฐ ', 'เฐขเฑเฐก', 'เฐขเฑเฐข', 'เฐขเฑเฐฃ', 'เฐขเฑเฐค', 'เฐขเฑเฐฅ', 'เฐขเฑเฐฆ', 'เฐขเฑเฐง', 'เฐขเฑเฐจ',
|
192 |
+
'เฐขเฑเฐช', 'เฐขเฑเฐซ', 'เฐขเฑเฐฌ', 'เฐขเฑเฐญ', 'เฐขเฑเฐฎ', 'เฐขเฑเฐฏ', 'เฐขเฑเฐฐ', 'เฐขเฑเฐฒ', 'เฐขเฑเฐต', 'เฐขเฑเฐถ',
|
193 |
+
'เฐขเฑเฐท', 'เฐขเฑเฐธ', 'เฐขเฑเฐน', 'เฐขเฑเฐณ', 'เฐขเฑเฐเฑเฐท', 'เฐขเฑเฐฑ',
|
194 |
+
'เฐฃเฑเฐ', 'เฐฃเฑเฐ', 'เฐฃเฑเฐ', 'เฐฃเฑเฐ', 'เฐฃเฑเฐ', 'เฐฃเฑเฐ', 'เฐฃเฑเฐ', 'เฐฃเฑเฐ', 'เฐฃเฑเฐ', 'เฐฃเฑเฐ',
|
195 |
+
'เฐฃเฑเฐ', 'เฐฃเฑเฐ ', 'เฐฃเฑเฐก', 'เฐฃเฑเฐข', 'เฐฃเฑเฐฃ', 'เฐฃเฑเฐค', 'เฐฃเฑเฐฅ', 'เฐฃเฑเฐฆ', 'เฐฃเฑเฐง', 'เฐฃเฑเฐจ',
|
196 |
+
'เฐฃเฑเฐช', 'เฐฃเฑเฐซ', 'เฐฃเฑเฐฌ', 'เฐฃเฑเฐญ', 'เฐฃเฑเฐฎ', 'เฐฃเฑเฐฏ', 'เฐฃเฑเฐฐ', 'เฐฃเฑเฐฒ', 'เฐฃเฑเฐต', 'เฐฃเฑเฐถ',
|
197 |
+
'เฐฃเฑเฐท', 'เฐฃเฑเฐธ', 'เฐฃเฑเฐน', 'เฐฃเฑเฐณ', 'เฐฃเฑเฐเฑเฐท', 'เฐฃเฑเฐฑ',
|
198 |
+
'เฐคเฑเฐ', 'เฐคเฑเฐ', 'เฐคเฑเฐ', 'เฐคเฑเฐ', 'เฐคเฑเฐ', 'เฐคเฑเฐ', 'เฐคเฑเฐ', 'เฐคเฑเฐ', 'เฐคเฑเฐ', 'เฐคเฑเฐ',
|
199 |
+
'เฐคเฑเฐ', 'เฐคเฑเฐ ', 'เฐคเฑเฐก', 'เฐคเฑเฐข', 'เฐคเฑเฐฃ', 'เฐคเฑเฐค', 'เฐคเฑเฐฅ', 'เฐคเฑเฐฆ', 'เฐคเฑเฐง', 'เฐคเฑเฐจ',
|
200 |
+
'เฐคเฑเฐช', 'เฐคเฑเฐซ', 'เฐคเฑเฐฌ', 'เฐคเฑเฐญ', 'เฐคเฑเฐฎ', 'เฐคเฑเฐฏ', 'เฐคเฑเฐฐ', 'เฐคเฑเฐฒ', 'เฐคเฑเฐต', 'เฐคเฑเฐถ',
|
201 |
+
'เฐคเฑเฐท', 'เฐคเฑเฐธ', 'เฐคเฑเฐน', 'เฐคเฑเฐณ', 'เฐคเฑเฐเฑเฐท', 'เฐคเฑเฐฑ',
|
202 |
+
'เฐฅเฑเฐ', 'เฐฅเฑเฐ', 'เฐฅเฑเฐ', 'เฐฅเฑเฐ', 'เฐฅเฑเฐ', 'เฐฅเฑเฐ', 'เฐฅเฑเฐ', 'เฐฅเฑเฐ', 'เฐฅเฑเฐ', 'เฐฅเฑเฐ',
|
203 |
+
'เฐฅเฑเฐ', 'เฐฅเฑเฐ ', 'เฐฅเฑเฐก', 'เฐฅเฑเฐข', 'เฐฅเฑเฐฃ', 'เฐฅเฑเฐค', 'เฐฅเฑเฐฅ', 'เฐฅเฑเฐฆ', 'เฐฅเฑเฐง', 'เฐฅเฑเฐจ',
|
204 |
+
'เฐฅเฑเฐช', 'เฐฅเฑเฐซ', 'เฐฅเฑเฐฌ', 'เฐฅเฑเฐญ', 'เฐฅเฑเฐฎ', 'เฐฅเฑเฐฏ', 'เฐฅเฑเฐฐ', 'เฐฅเฑเฐฒ', 'เฐฅเฑเฐต', 'เฐฅเฑเฐถ',
|
205 |
+
'เฐฅเฑเฐท', 'เฐฅเฑเฐธ', 'เฐฅเฑเฐน', 'เฐฅเฑเฐณ', 'เฐฅเฑเฐเฑเฐท', 'เฐฅเฑเฐฑ',
|
206 |
+
'เฐฆเฑเฐ', 'เฐฆเฑเฐ', 'เฐฆเฑเฐ', 'เฐฆเฑเฐ', 'เฐฆเฑเฐ', 'เฐฆเฑเฐ', 'เฐฆเฑเฐ', 'เฐฆเฑเฐ', 'เฐฆเฑเฐ', 'เฐฆเฑเฐ',
|
207 |
+
'เฐฆเฑเฐ', 'เฐฆเฑเฐ ', 'เฐฆเฑเฐก', 'เฐฆเฑเฐข', 'เฐฆเฑเฐฃ', 'เฐฆเฑเฐค', 'เฐฆเฑเฐฅ', 'เฐฆเฑเฐฆ', 'เฐฆเฑเฐง', 'เฐฆเฑเฐจ',
|
208 |
+
'เฐฆเฑเฐช', 'เฐฆเฑเฐซ', 'เฐฆเฑเฐฌ', 'เฐฆเฑเฐญ', 'เฐฆเฑเฐฎ', 'เฐฆเฑเฐฏ', 'เฐฆเฑเฐฐ', 'เฐฆเฑเฐฒ', 'เฐฆเฑเฐต', 'เฐฆเฑเฐถ',
|
209 |
+
'เฐฆเฑเฐท', 'เฐฆเฑเฐธ', 'เฐฆเฑเฐน', 'เฐฆเฑเฐณ', 'เฐฆเฑเฐเฑเฐท', 'เฐฆเฑเฐฑ',
|
210 |
+
'เฐงเฑเฐ', 'เฐงเฑเฐ', 'เฐงเฑเฐ', 'เฐงเฑเฐ', 'เฐงเฑเฐ', 'เฐงเฑเฐ', 'เฐงเฑเฐ', 'เฐงเฑเฐ', 'เฐงเฑเฐ', 'เฐงเฑเฐ',
|
211 |
+
'เฐงเฑเฐ', 'เฐงเฑเฐ ', 'เฐงเฑเฐก', 'เฐงเฑเฐข', 'เฐงเฑเฐฃ', 'เฐงเฑเฐค', 'เฐงเฑเฐฅ', 'เฐงเฑเฐฆ', 'เฐงเฑเฐง', 'เฐงเฑเฐจ',
|
212 |
+
'เฐงเฑเฐช', 'เฐงเฑเฐซ', 'เฐงเฑเฐฌ', 'เฐงเฑเฐญ', 'เฐงเฑเฐฎ', 'เฐงเฑเฐฏ', 'เฐงเฑเฐฐ', 'เฐงเฑเฐฒ', 'เฐงเฑเฐต', 'เฐงเฑเฐถ',
|
213 |
+
'เฐงเฑเฐท', 'เฐงเฑเฐธ', 'เฐงเฑเฐน', 'เฐงเฑเฐณ', 'เฐงเฑเฐเฑเฐท', 'เฐงเฑเฐฑ',
|
214 |
+
'เฐจเฑเฐ', 'เฐจเฑเฐ', 'เฐจเฑเฐ', 'เฐจเฑเฐ', 'เฐจเฑเฐ', 'เฐจเฑเฐ', 'เฐจเฑเฐ', 'เฐจเฑเฐ', 'เฐจเฑเฐ', 'เฐจเฑเฐ',
|
215 |
+
'เฐจเฑเฐ', 'เฐจเฑเฐ ', 'เฐจเฑเฐก', 'เฐจเฑเฐข', 'เฐจเฑเฐฃ', 'เฐจเฑเฐค', 'เฐจเฑเฐฅ', 'เฐจเฑเฐฆ', 'เฐจเฑเฐง', 'เฐจเฑเฐจ',
|
216 |
+
'เฐจเฑเฐช', 'เฐจเฑเฐซ', 'เฐจเฑเฐฌ', 'เฐจเฑเฐญ', 'เฐจเฑเฐฎ', 'เฐจเฑเฐฏ', 'เฐจเฑเฐฐ', 'เฐจเฑเฐฒ', 'เฐจเฑเฐต', 'เฐจเฑเฐถ',
|
217 |
+
'เฐจเฑเฐท', 'เฐจเฑเฐธ', 'เฐจเฑเฐน', 'เฐจเฑเฐณ', 'เฐจเฑเฐเฑเฐท', 'เฐจเฑเฐฑ',
|
218 |
+
'เฐชเฑเฐ', 'เฐชเฑเฐ', 'เฐชเฑเฐ', 'เฐชเฑเฐ', 'เฐชเฑเฐ', 'เฐชเฑเฐ', 'เฐชเฑเฐ', 'เฐชเฑเฐ', 'เฐชเฑเฐ', 'เฐชเฑเฐ',
|
219 |
+
'เฐชเฑเฐ', 'เฐชเฑเฐ ', 'เฐชเฑเฐก', 'เฐชเฑเฐข', 'เฐชเฑเฐฃ', 'เฐชเฑเฐค', 'เฐชเฑเฐฅ', 'เฐชเฑเฐฆ', 'เฐชเฑเฐง', 'เฐชเฑเฐจ',
|
220 |
+
'เฐชเฑเฐช', 'เฐชเฑเฐซ', 'เฐชเฑเฐฌ', 'เฐชเฑเฐญ', 'เฐชเฑเฐฎ', 'เฐชเฑเฐฏ', 'เฐชเฑเฐฐ', 'เฐชเฑเฐฒ', 'เฐชเฑเฐต', 'เฐชเฑเฐถ',
|
221 |
+
'เฐชเฑเฐท', 'เฐชเฑเฐธ', 'เฐชเฑเฐน', 'เฐชเฑเฐณ', 'เฐชเฑเฐเฑเฐท', 'เฐชเฑเฐฑ',
|
222 |
+
'เฐซเฑเฐ', 'เฐซเฑเฐ', 'เฐซเฑเฐ', 'เฐซเฑเฐ', 'เฐซเฑเฐ', 'เฐซเฑเฐ', 'เฐซเฑเฐ', 'เฐซเฑเฐ', 'เฐซเฑเฐ', 'เฐซเฑเฐ',
|
223 |
+
'เฐซเฑเฐ', 'เฐซเฑเฐ ', 'เฐซเฑเฐก', 'เฐซเฑเฐข', 'เฐซเฑเฐฃ', 'เฐซเฑเฐค', 'เฐซเฑเฐฅ', 'เฐซเฑเฐฆ', 'เฐซเฑเฐง', 'เฐซเฑเฐจ',
|
224 |
+
'เฐซเฑเฐช', 'เฐซเฑเฐซ', 'เฐซเฑเฐฌ', 'เฐซเฑเฐญ', 'เฐซเฑเฐฎ', 'เฐซเฑเฐฏ', 'เฐซเฑเฐฐ', 'เฐซเฑเฐฒ', 'เฐซเฑเฐต', 'เฐซเฑเฐถ',
|
225 |
+
'เฐซเฑเฐท', 'เฐซเฑเฐธ', 'เฐซเฑเฐน', 'เฐซเฑเฐณ', 'เฐซเฑเฐเฑเฐท', 'เฐซเฑเฐฑ',
|
226 |
+
'เฐฌเฑเฐ', 'เฐฌเฑเฐ', 'เฐฌเฑเฐ', 'เฐฌเฑเฐ', 'เฐฌเฑเฐ', 'เฐฌเฑเฐ', 'เฐฌเฑเฐ', 'เฐฌเฑเฐ', 'เฐฌเฑเฐ', 'เฐฌเฑเฐ',
|
227 |
+
'เฐฌเฑเฐ', 'เฐฌเฑเฐ ', 'เฐฌเฑเฐก', 'เฐฌเฑเฐข', 'เฐฌเฑเฐฃ', 'เฐฌเฑเฐค', 'เฐฌเฑเฐฅ', 'เฐฌเฑเฐฆ', 'เฐฌเฑเฐง', 'เฐฌเฑเฐจ',
|
228 |
+
'เฐฌเฑเฐช', 'เฐฌเฑเฐซ', 'เฐฌเฑเฐฌ', 'เฐฌเฑเฐญ', 'เฐฌเฑเฐฎ', 'เฐฌเฑเฐฏ', 'เฐฌเฑเฐฐ', 'เฐฌเฑเฐฒ', 'เฐฌเฑเฐต', 'เฐฌเฑเฐถ',
|
229 |
+
'เฐฌเฑเฐท', 'เฐฌเฑเฐธ', 'เฐฌเฑเฐน', 'เฐฌเฑเฐณ', 'เฐฌเฑเฐเฑเฐท', 'เฐฌเฑเฐฑ',
|
230 |
+
'เฐญเฑเฐ', 'เฐญเฑเฐ', 'เฐญเฑเฐ', 'เฐญเฑเฐ', 'เฐญเฑเฐ', 'เฐญเฑเฐ', 'เฐญเฑเฐ', 'เฐญเฑเฐ', 'เฐญเฑเฐ', 'เฐญเฑเฐ',
|
231 |
+
'เฐญเฑเฐ', 'เฐญเฑเฐ ', 'เฐญเฑเฐก', 'เฐญเฑเฐข', 'เฐญเฑเฐฃ', 'เฐญเฑเฐค', 'เฐญเฑเฐฅ', 'เฐญเฑเฐฆ', 'เฐญเฑเฐง', 'เฐญเฑเฐจ',
|
232 |
+
'เฐญเฑเฐช', 'เฐญเฑเฐซ', 'เฐญเฑเฐฌ', 'เฐญเฑเฐญ', 'เฐญเฑเฐฎ', 'เฐญเฑเฐฏ', 'เฐญเฑเฐฐ', 'เฐญเฑเฐฒ', 'เฐญเฑเฐต', 'เฐญเฑเฐถ',
|
233 |
+
'เฐญเฑเฐท', 'เฐญเฑเฐธ', 'เฐญเฑเฐน', 'เฐญเฑเฐณ', 'เฐญเฑเฐเฑเฐท', 'เฐญเฑเฐฑ',
|
234 |
+
'เฐฎเฑเฐ', 'เฐฎเฑเฐ', 'เฐฎเฑเฐ', 'เฐฎเฑเฐ', 'เฐฎเฑเฐ', 'เฐฎเฑเฐ', 'เฐฎเฑเฐ', 'เฐฎเฑเฐ', 'เฐฎเฑเฐ', 'เฐฎเฑเฐ',
|
235 |
+
'เฐฎเฑเฐ', 'เฐฎเฑเฐ ', 'เฐฎเฑเฐก', 'เฐฎเฑเฐข', 'เฐฎเฑเฐฃ', 'เฐฎเฑเฐค', 'เฐฎเฑเฐฅ', 'เฐฎเฑเฐฆ', 'เฐฎเฑเฐง', 'เฐฎเฑเฐจ',
|
236 |
+
'เฐฎเฑเฐช', 'เฐฎเฑเฐซ', 'เฐฎเฑเฐฌ', 'เฐฎเฑเฐญ', 'เฐฎเฑเฐฎ', 'เฐฎเฑเฐฏ', 'เฐฎเฑเฐฐ', 'เฐฎเฑเฐฒ', 'เฐฎเฑเฐต', 'เฐฎเฑเฐถ',
|
237 |
+
'เฐฎเฑเฐท', 'เฐฎเฑเฐธ', 'เฐฎเฑเฐน', 'เฐฎเฑเฐณ', 'เฐฎเฑเฐเฑเฐท', 'เฐฎเฑเฐฑ',
|
238 |
+
'เฐฏเฑเฐ', 'เฐฏเฑเฐ', 'เฐฏเฑเฐ', 'เฐฏเฑเฐ', 'เฐฏเฑเฐ', 'เฐฏเฑเฐ', 'เฐฏเฑเฐ', 'เฐฏเฑเฐ', 'เฐฏเฑเฐ', 'เฐฏเฑเฐ',
|
239 |
+
'เฐฏเฑเฐ', 'เฐฏเฑเฐ ', 'เฐฏเฑเฐก', 'เฐฏเฑเฐข', 'เฐฏเฑเฐฃ', 'เฐฏเฑเฐค', 'เฐฏเฑเฐฅ', 'เฐฏเฑเฐฆ', 'เฐฏเฑเฐง', 'เฐฏเฑเฐจ',
|
240 |
+
'เฐฏเฑเฐช', 'เฐฏเฑเฐซ', 'เฐฏเฑเฐฌ', 'เฐฏเฑเฐญ', 'เฐฏเฑเฐฎ', 'เฐฏเฑเฐฏ', 'เฐฏเฑเฐฐ', 'เฐฏเฑเฐฒ', 'เฐฏเฑเฐต', 'เฐฏเฑเฐถ',
|
241 |
+
'เฐฏเฑเฐท', 'เฐฏเฑเฐธ', 'เฐฏเฑเฐน', 'เฐฏเฑเฐณ', 'เฐฏเฑเฐเฑเฐท', 'เฐฏเฑเฐฑ',
|
242 |
+
'เฐฐเฑเฐ', 'เฐฐเฑเฐ', 'เฐฐเฑเฐ', 'เฐฐเฑเฐ', 'เฐฐเฑเฐ', 'เฐฐเฑเฐ', 'เฐฐเฑเฐ', 'เฐฐเฑเฐ', 'เฐฐเฑเฐ', 'เฐฐเฑเฐ',
|
243 |
+
'เฐฐเฑเฐ', 'เฐฐเฑเฐ ', 'เฐฐเฑเฐก', 'เฐฐเฑเฐข', 'เฐฐเฑเฐฃ', 'เฐฐเฑเฐค', 'เฐฐเฑเฐฅ', 'เฐฐเฑเฐฆ', 'เฐฐเฑเฐง', 'เฐฐเฑเฐจ',
|
244 |
+
'เฐฐเฑเฐช', 'เฐฐเฑเฐซ', 'เฐฐเฑเฐฌ', 'เฐฐเฑเฐญ', 'เฐฐเฑเฐฎ', 'เฐฐเฑเฐฏ', 'เฐฐเฑเฐฐ', 'เฐฐเฑเฐฒ', 'เฐฐเฑเฐต', 'เฐฐเฑเฐถ',
|
245 |
+
'เฐฐเฑเฐท', 'เฐฐเฑเฐธ', 'เฐฐเฑเฐน', 'เฐฐเฑเฐณ', 'เฐฐเฑเฐเฑเฐท', 'เฐฐเฑเฐฑ',
|
246 |
+
'เฐฒเฑเฐ', 'เฐฒเฑเฐ', 'เฐฒเฑเฐ', 'เฐฒเฑเฐ', 'เฐฒเฑเฐ', 'เฐฒเฑเฐ', 'เฐฒเฑเฐ', 'เฐฒเฑเฐ', 'เฐฒเฑเฐ', 'เฐฒเฑเฐ',
|
247 |
+
'เฐฒเฑเฐ', 'เฐฒเฑเฐ ', 'เฐฒเฑเฐก', 'เฐฒเฑเฐข', 'เฐฒเฑเฐฃ', 'เฐฒเฑเฐค', 'เฐฒเฑเฐฅ', 'เฐฒเฑเฐฆ', 'เฐฒเฑเฐง', 'เฐฒเฑเฐจ',
|
248 |
+
'เฐฒเฑเฐช', 'เฐฒเฑเฐซ', 'เฐฒเฑเฐฌ', 'เฐฒเฑเฐญ', 'เฐฒเฑเฐฎ', 'เฐฒเฑเฐฏ', 'เฐฒเฑเฐฐ', 'เฐฒเฑเฐฒ', 'เฐฒเฑเฐต', 'เฐฒเฑเฐถ',
|
249 |
+
'เฐฒเฑเฐท', 'เฐฒเฑเฐธ', 'เฐฒเฑเฐน', 'เฐฒเฑเฐณ', 'เฐฒเฑเฐเฑเฐท', 'เฐฒเฑเฐฑ',
|
250 |
+
'เฐตเฑเฐ', 'เฐตเฑเฐ', 'เฐตเฑเฐ', 'เฐตเฑเฐ', 'เฐตเฑเฐ', 'เฐตเฑเฐ', 'เฐตเฑเฐ', 'เฐตเฑเฐ', 'เฐตเฑเฐ', 'เฐตเฑเฐ',
|
251 |
+
'เฐตเฑเฐ', 'เฐตเฑเฐ ', 'เฐตเฑเฐก', 'เฐตเฑเฐข', 'เฐตเฑเฐฃ', 'เฐตเฑเฐค', 'เฐตเฑเฐฅ', 'เฐตเฑเฐฆ', 'เฐตเฑเฐง', 'เฐตเฑเฐจ',
|
252 |
+
'เฐตเฑเฐช', 'เฐตเฑเฐซ', 'เฐตเฑเฐฌ', 'เฐตเฑเฐญ', 'เฐตเฑเฐฎ', 'เฐตเฑเฐฏ', 'เฐตเฑเฐฐ', 'เฐตเฑเฐฒ', 'เฐตเฑเฐต', 'เฐตเฑเฐถ',
|
253 |
+
'เฐตเฑเฐท', 'เฐตเฑเฐธ', 'เฐตเฑเฐน', 'เฐตเฑเฐณ', 'เฐตเฑเฐเฑเฐท', 'เฐตเฑเฐฑ',
|
254 |
+
'เฐถเฑเฐ', 'เฐถเฑเฐ', 'เฐถเฑเฐ', 'เฐถเฑเฐ', 'เฐถเฑเฐ', 'เฐถเฑเฐ', 'เฐถเฑเฐ', 'เฐถเฑเฐ', 'เฐถเฑเฐ', 'เฐถเฑเฐ',
|
255 |
+
'เฐถเฑเฐ', 'เฐถเฑเฐ ', 'เฐถเฑเฐก', 'เฐถเฑเฐข', 'เฐถเฑเฐฃ', 'เฐถเฑเฐค', 'เฐถเฑเฐฅ', 'เฐถเฑเฐฆ', 'เฐถเฑเฐง', 'เฐถเฑเฐจ',
|
256 |
+
'เฐถเฑเฐช', 'เฐถเฑเฐซ', 'เฐถเฑเฐฌ', 'เฐถเฑเฐญ', 'เฐถเฑเฐฎ', 'เฐถเฑเฐฏ', 'เฐถเฑเฐฐ', 'เฐถ๏ฟฝ๏ฟฝ๏ฟฝเฐฒ', 'เฐถเฑเฐต', 'เฐถเฑเฐถ',
|
257 |
+
'เฐถเฑเฐท', 'เฐถเฑเฐธ', 'เฐถเฑเฐน', 'เฐถเฑเฐณ', 'เฐถเฑเฐเฑเฐท', 'เฐถเฑเฐฑ',
|
258 |
+
'เฐทเฑเฐ', 'เฐทเฑเฐ', 'เฐทเฑเฐ', 'เฐทเฑเฐ', 'เฐทเฑเฐ', 'เฐทเฑเฐ', 'เฐทเฑเฐ', 'เฐทเฑเฐ', 'เฐทเฑเฐ', 'เฐทเฑเฐ',
|
259 |
+
'เฐทเฑเฐ', 'เฐทเฑเฐ ', 'เฐทเฑเฐก', 'เฐทเฑเฐข', 'เฐทเฑเฐฃ', 'เฐทเฑเฐค', 'เฐทเฑเฐฅ', 'เฐทเฑเฐฆ', 'เฐทเฑเฐง', 'เฐทเฑเฐจ',
|
260 |
+
'เฐทเฑเฐช', 'เฐทเฑเฐซ', 'เฐทเฑเฐฌ', 'เฐทเฑเฐญ', 'เฐทเฑเฐฎ', 'เฐทเฑเฐฏ', 'เฐทเฑเฐฐ', 'เฐทเฑเฐฒ', 'เฐทเฑเฐต', 'เฐทเฑเฐถ',
|
261 |
+
'เฐทเฑเฐท', 'เฐทเฑเฐธ', 'เฐทเฑเฐน', 'เฐทเฑเฐณ', 'เฐทเฑเฐเฑเฐท', 'เฐทเฑเฐฑ',
|
262 |
+
'เฐธเฑเฐ', 'เฐธเฑเฐ', 'เฐธเฑเฐ', 'เฐธเฑเฐ', 'เฐธเฑเฐ', 'เฐธเฑเฐ', 'เฐธเฑเฐ', 'เฐธเฑเฐ', 'เฐธเฑเฐ', 'เฐธเฑเฐ',
|
263 |
+
'เฐธเฑเฐ', 'เฐธเฑเฐ ', 'เฐธเฑเฐก', 'เฐธเฑเฐข', 'เฐธเฑเฐฃ', 'เฐธเฑเฐค', 'เฐธเฑเฐฅ', 'เฐธเฑเฐฆ', 'เฐธเฑเฐง', 'เฐธเฑเฐจ',
|
264 |
+
'เฐธเฑเฐช', 'เฐธเฑเฐซ', 'เฐธเฑเฐฌ', 'เฐธเฑเฐญ', 'เฐธเฑเฐฎ', 'เฐธเฑเฐฏ', 'เฐธเฑเฐฐ', 'เฐธเฑเฐฒ', 'เฐธเฑเฐต', 'เฐธเฑเฐถ',
|
265 |
+
'เฐธเฑเฐท', 'เฐธเฑเฐธ', 'เฐธเฑเฐน', 'เฐธเฑเฐณ', 'เฐธเฑเฐเฑเฐท', 'เฐธเฑเฐฑ',
|
266 |
+
'เฐนเฑเฐ', 'เฐนเฑเฐ', 'เฐนเฑเฐ', 'เฐนเฑเฐ', 'เฐนเฑเฐ', 'เฐนเฑเฐ', 'เฐนเฑเฐ', 'เฐนเฑเฐ', 'เฐนเฑเฐ', 'เฐนเฑเฐ',
|
267 |
+
'เฐนเฑเฐ', 'เฐนเฑเฐ ', 'เฐนเฑเฐก', 'เฐนเฑเฐข', 'เฐนเฑเฐฃ', 'เฐนเฑเฐค', 'เฐนเฑเฐฅ', 'เฐนเฑเฐฆ', 'เฐนเฑเฐง', 'เฐนเฑเฐจ',
|
268 |
+
'เฐนเฑเฐช', 'เฐนเฑเฐซ', 'เฐนเฑเฐฌ', 'เฐนเฑเฐญ', 'เฐนเฑเฐฎ', 'เฐนเฑเฐฏ', 'เฐนเฑเฐฐ', 'เฐนเฑเฐฒ', 'เฐนเฑเฐต', 'เฐนเฑเฐถ',
|
269 |
+
'เฐนเฑเฐท', 'เฐนเฑเฐธ', 'เฐนเฑเฐน', 'เฐนเฑเฐณ', 'เฐนเฑเฐเฑเฐท', 'เฐนเฑเฐฑ',
|
270 |
+
'เฐณเฑเฐ', 'เฐณเฑเฐ', 'เฐณเฑเฐ', 'เฐณเฑเฐ', 'เฐณเฑเฐ', 'เฐณเฑเฐ', 'เฐณเฑเฐ', 'เฐณเฑเฐ', 'เฐณเฑเฐ', 'เฐณเฑเฐ',
|
271 |
+
'เฐณเฑเฐ', 'เฐณเฑเฐ ', 'เฐณเฑเฐก', 'เฐณเฑเฐข', 'เฐณเฑเฐฃ', 'เฐณเฑเฐค', 'เฐณเฑเฐฅ', 'เฐณเฑเฐฆ', 'เฐณเฑเฐง', 'เฐณเฑเฐจ',
|
272 |
+
'เฐณเฑเฐช', 'เฐณเฑเฐซ', 'เฐณเฑเฐฌ', 'เฐณเฑเฐญ', 'เฐณเฑเฐฎ', 'เฐณเฑเฐฏ', 'เฐณเฑเฐฐ', 'เฐณเฑเฐฒ', 'เฐณเฑเฐต', 'เฐณเฑเฐถ',
|
273 |
+
'เฐณเฑเฐท', 'เฐณเฑเฐธ', 'เฐณเฑเฐน', 'เฐณเฑเฐณ', 'เฐณเฑเฐเฑเฐท', 'เฐณเฑเฐฑ',
|
274 |
+
'เฐเฑเฐทเฑเฐ', 'เฐเฑเฐทเฑเฐ', 'เฐเฑเฐทเฑเฐ', 'เฐเฑเฐทเฑเฐ', 'เฐเฑเฐทเฑเฐ', 'เฐเฑเฐทเฑเฐ', 'เฐเฑเฐทเฑเฐ', 'เฐเฑเฐทเฑเฐ', 'เฐเฑเฐทเฑเฐ', 'เฐเฑเฐทเฑเฐ',
|
275 |
+
'เฐเฑเฐทเฑเฐ', 'เฐเฑเฐทเฑเฐ ', 'เฐเฑเฐทเฑเฐก', 'เฐเฑเฐทเฑเฐข', 'เฐเฑเฐทเฑเฐฃ', 'เฐเฑเฐทเฑเฐค', 'เฐเฑเฐทเฑเฐฅ', 'เฐเฑเฐทเฑเฐฆ', 'เฐเฑเฐทเฑเฐง', 'เฐเฑเฐทเฑเฐจ',
|
276 |
+
'เฐเฑเฐทเฑเฐช', 'เฐเฑเฐทเฑเฐซ', 'เฐเฑเฐทเฑเฐฌ', 'เฐเฑเฐทเฑเฐญ', 'เฐเฑเฐทเฑเฐฎ', 'เฐเฑเฐทเฑเฐฏ', 'เฐเฑเฐทเฑเฐฐ', 'เฐเฑเฐทเฑเฐฒ', 'เฐเฑเฐทเฑเฐต', 'เฐเฑเฐทเฑเฐถ',
|
277 |
+
'เฐเฑเฐทเฑเฐท', 'เฐเฑเฐทเฑเฐธ', 'เฐเฑเฐทเฑเฐน', 'เฐเฑเฐทเฑเฐณ', 'เฐเฑเฐทเฑเฐเฑเฐท', 'เฐเฑเฐทเฑเฐฑ',
|
278 |
+
'เฐฑเฑเฐ', 'เฐฑเฑเฐ', 'เฐฑเฑเฐ', 'เฐฑเฑเฐ', 'เฐฑเฑเฐ', 'เฐฑเฑเฐ', 'เฐฑเฑเฐ', 'เฐฑเฑเฐ', 'เฐฑเฑเฐ', 'เฐฑเฑเฐ',
|
279 |
+
'เฐฑเฑเฐ', 'เฐฑเฑเฐ ', 'เฐฑเฑเฐก', 'เฐฑเฑเฐข', 'เฐฑเฑเฐฃ', 'เฐฑเฑเฐค', 'เฐฑเฑเฐฅ', 'เฐฑเฑเฐฆ', 'เฐฑเฑเฐง', 'เฐฑเฑเฐจ',
|
280 |
+
'เฐฑเฑเฐช', 'เฐฑเฑเฐซ', 'เฐฑเฑเฐฌ', 'เฐฑเฑเฐญ', 'เฐฑเฑเฐฎ', 'เฐฑเฑเฐฏ', 'เฐฑเฑเฐฐ', 'เฐฑเฑเฐฒ', 'เฐฑเฑเฐต', 'เฐฑเฑเฐถ',
|
281 |
+
'เฐฑเฑเฐท', 'เฐฑเฑเฐธ', 'เฐฑเฑเฐน', 'เฐฑเฑเฐณ', 'เฐฑเฑเฐเฑเฐท', 'เฐฑเฑเฐฑ'
|
282 |
+
# Add more valid combinations as needed
|
283 |
+
]
|
284 |
+
|
285 |
+
for combination in valid_consonant_combinations:
|
286 |
+
if combination not in existing_tokens: # Check for duplicates
|
287 |
+
char_bytes = combination.encode('utf-8')
|
288 |
+
vocab[token_id] = {
|
289 |
+
'text': combination,
|
290 |
+
'bytes': list(char_bytes),
|
291 |
+
'type': 'Ligature',
|
292 |
+
'description': f"Telugu ligature: {combination}"
|
293 |
+
}
|
294 |
+
existing_tokens.add(combination) # Add to the set
|
295 |
+
token_id += 1
|
296 |
+
|
297 |
+
print(f"Created base vocabulary with {len(vocab)} tokens")
|
298 |
+
return vocab
|
299 |
+
|
300 |
+
def save_base_vocab(vocab, path='telugu_base_vocab.json'):
|
301 |
+
"""Save the base vocabulary with character information."""
|
302 |
+
# Sort by character type for better readability
|
303 |
+
sorted_vocab = {}
|
304 |
+
for k, v in sorted(vocab.items(), key=lambda x: (x[1]['type'], x[0])):
|
305 |
+
sorted_vocab[str(k)] = v
|
306 |
+
|
307 |
+
with open(path, 'w', encoding='utf-8') as f:
|
308 |
+
json.dump(sorted_vocab, f, ensure_ascii=False, indent=2)
|
309 |
+
print(f"Base vocabulary saved to {path}")
|
310 |
+
|
311 |
+
def load_base_vocab(path='telugu_base_vocab.json'):
|
312 |
+
"""Load the base vocabulary."""
|
313 |
+
with open(path, 'r', encoding='utf-8') as f:
|
314 |
+
vocab = json.load(f)
|
315 |
+
return {int(k): bytes(v['bytes']) for k, v in vocab.items()}
|
316 |
+
|
317 |
+
class BPETokenizer:
|
318 |
+
def __init__(self, vocab_size=5000, sample_size=None):
|
319 |
+
self.vocab_size = vocab_size
|
320 |
+
self.sample_size = sample_size
|
321 |
+
|
322 |
+
# First try to load trained vocabulary
|
323 |
+
trained_vocab_path = 'telugu_tokenizer_vocab.json'
|
324 |
+
if os.path.exists(trained_vocab_path):
|
325 |
+
print("Loading trained vocabulary...")
|
326 |
+
self.load('telugu_tokenizer') # This loads both vocab and merges
|
327 |
+
return
|
328 |
+
|
329 |
+
# If no trained vocab exists, fall back to base vocabulary
|
330 |
+
base_vocab_path = 'telugu_base_vocab.json'
|
331 |
+
if os.path.exists(base_vocab_path):
|
332 |
+
print("Loading existing base vocabulary...")
|
333 |
+
self.vocab = load_base_vocab(base_vocab_path)
|
334 |
+
else:
|
335 |
+
print("Creating new base vocabulary...")
|
336 |
+
base_vocab = create_base_vocab()
|
337 |
+
save_base_vocab(base_vocab)
|
338 |
+
self.vocab = load_base_vocab(base_vocab_path)
|
339 |
+
|
340 |
+
self.base_vocab_size = len(self.vocab)
|
341 |
+
self.merges = {}
|
342 |
+
|
343 |
+
def get_stats(self, ids):
|
344 |
+
"""Count token pair frequencies."""
|
345 |
+
counts = {}
|
346 |
+
for pair in zip(ids, ids[1:]):
|
347 |
+
counts[pair] = counts.get(pair, 0) + 1
|
348 |
+
return counts
|
349 |
+
|
350 |
+
def merge(self, ids, pair, idx):
|
351 |
+
"""Merge all occurrences of a token pair."""
|
352 |
+
# Create the merged token
|
353 |
+
merged_token = self.vocab[pair[0]] + self.vocab[pair[1]]
|
354 |
+
|
355 |
+
# Check if the merged token already exists in the vocabulary
|
356 |
+
for existing_id, existing_token in self.vocab.items():
|
357 |
+
if existing_token == merged_token:
|
358 |
+
# Instead of skipping, use the existing token ID for merging
|
359 |
+
print(f"Merge for {pair} already exists in the vocabulary.")
|
360 |
+
newids = []
|
361 |
+
i = 0
|
362 |
+
while i < len(ids):
|
363 |
+
if i < len(ids) - 1 and ids[i] == pair[0] and ids[i + 1] == pair[1]:
|
364 |
+
newids.append(existing_id)
|
365 |
+
i += 2
|
366 |
+
else:
|
367 |
+
newids.append(ids[i])
|
368 |
+
i += 1
|
369 |
+
return newids
|
370 |
+
|
371 |
+
# If we get here, the merged token doesn't exist yet
|
372 |
+
newids = []
|
373 |
+
i = 0
|
374 |
+
while i < len(ids):
|
375 |
+
if i < len(ids) - 1 and ids[i] == pair[0] and ids[i + 1] == pair[1]:
|
376 |
+
newids.append(idx)
|
377 |
+
i += 2
|
378 |
+
else:
|
379 |
+
newids.append(ids[i])
|
380 |
+
i += 1
|
381 |
+
return newids
|
382 |
+
|
383 |
+
def _process_chunk(self, args):
|
384 |
+
"""Process a chunk of text for parallel processing."""
|
385 |
+
chunk, byte_to_token = args
|
386 |
+
ids = array.array('I') # Unsigned int array
|
387 |
+
j = 0
|
388 |
+
while j < len(chunk):
|
389 |
+
if chunk[j] == 32: # Space
|
390 |
+
ids.append(32)
|
391 |
+
j += 1
|
392 |
+
continue
|
393 |
+
|
394 |
+
found = False
|
395 |
+
for length in [3, 2, 1]:
|
396 |
+
if j + length <= len(chunk):
|
397 |
+
char_bytes = bytes(chunk[j:j+length])
|
398 |
+
if char_bytes in byte_to_token:
|
399 |
+
ids.append(byte_to_token[char_bytes])
|
400 |
+
j += length
|
401 |
+
found = True
|
402 |
+
break
|
403 |
+
if not found:
|
404 |
+
j += 1
|
405 |
+
return ids
|
406 |
+
|
407 |
+
def fit(self, text):
|
408 |
+
"""Train the BPE tokenizer."""
|
409 |
+
print("Converting text to token IDs using base vocabulary...")
|
410 |
+
|
411 |
+
original_bytes = text.encode('utf-8')
|
412 |
+
original_length = len(original_bytes)
|
413 |
+
print(f"\nBefore training: text bytes length: {original_length:,}")
|
414 |
+
|
415 |
+
# Pre-compute byte sequences for faster lookup
|
416 |
+
byte_to_token = {token_bytes: token_id for token_id, token_bytes in self.vocab.items()}
|
417 |
+
|
418 |
+
# Parallel processing of chunks
|
419 |
+
num_cores = os.cpu_count() or 1
|
420 |
+
chunk_size = max(1024 * 64, len(original_bytes) // (num_cores * 4)) # Larger chunks
|
421 |
+
chunks = [original_bytes[i:i + chunk_size] for i in range(0, len(original_bytes), chunk_size)]
|
422 |
+
|
423 |
+
print(f"Processing {len(chunks)} chunks using {num_cores} cores...")
|
424 |
+
|
425 |
+
# Process chunks in parallel
|
426 |
+
with Pool(num_cores) as pool:
|
427 |
+
chunk_results = list(tqdm(
|
428 |
+
pool.imap(self._process_chunk, [(chunk, byte_to_token) for chunk in chunks]),
|
429 |
+
total=len(chunks),
|
430 |
+
desc="Initial tokenization"
|
431 |
+
))
|
432 |
+
|
433 |
+
# Combine results
|
434 |
+
ids = array.array('I')
|
435 |
+
for result in chunk_results:
|
436 |
+
ids.extend(result)
|
437 |
+
|
438 |
+
print(f"\nBase vocabulary size: {self.base_vocab_size}")
|
439 |
+
print(f"Initial sequence length: {len(ids)}")
|
440 |
+
|
441 |
+
# Keep training until we reach the target vocab size
|
442 |
+
target_vocab_size = self.vocab_size
|
443 |
+
pbar = tqdm(total=target_vocab_size - self.base_vocab_size, desc="Training BPE")
|
444 |
+
last_vocab_size = len(self.vocab)
|
445 |
+
|
446 |
+
while len(self.vocab) < target_vocab_size:
|
447 |
+
stats = self.get_stats(ids)
|
448 |
+
if not stats:
|
449 |
+
print("No more pairs to merge.")
|
450 |
+
break
|
451 |
+
|
452 |
+
pair = max(stats, key=stats.get)
|
453 |
+
idx = len(self.vocab)
|
454 |
+
ids = self.merge(ids, pair, idx)
|
455 |
+
|
456 |
+
# Only update progress when vocabulary actually grows
|
457 |
+
if len(self.vocab) > last_vocab_size:
|
458 |
+
pbar.update(len(self.vocab) - last_vocab_size)
|
459 |
+
last_vocab_size = len(self.vocab)
|
460 |
+
|
461 |
+
# Add the merged token to the vocabulary
|
462 |
+
if pair not in self.merges: # Ensure we don't overwrite existing merges
|
463 |
+
self.merges[pair] = idx
|
464 |
+
self.vocab[idx] = self.vocab[pair[0]] + self.vocab[pair[1]]
|
465 |
+
|
466 |
+
# Print progress periodically
|
467 |
+
if len(self.vocab) % 100 == 0:
|
468 |
+
try:
|
469 |
+
text0 = self.vocab[pair[0]].decode('utf-8')
|
470 |
+
text1 = self.vocab[pair[1]].decode('utf-8')
|
471 |
+
merged = self.vocab[idx].decode('utf-8')
|
472 |
+
print(f"\nVocab size: {len(self.vocab)}: {text0} + {text1} = {merged}")
|
473 |
+
except UnicodeDecodeError:
|
474 |
+
continue
|
475 |
+
|
476 |
+
pbar.close()
|
477 |
+
print("\nFinal statistics:")
|
478 |
+
print(f"Final vocabulary size: {len(self.vocab):,}")
|
479 |
+
print(f"Number of merges: {len(self.merges):,}")
|
480 |
+
print(f"Final compression ratio: {original_length / len(ids):.2f}x")
|
481 |
+
|
482 |
+
def encode(self, text):
|
483 |
+
"""Encode text to token IDs."""
|
484 |
+
final_tokens = []
|
485 |
+
i = 0
|
486 |
+
text_bytes = text.encode('utf-8')
|
487 |
+
|
488 |
+
while i < len(text_bytes):
|
489 |
+
# If we're at a leading space, encode it separately
|
490 |
+
if text_bytes[i] == 32: # ASCII space
|
491 |
+
final_tokens.append(32) # Space token
|
492 |
+
i += 1
|
493 |
+
continue
|
494 |
+
|
495 |
+
# Try to find the longest matching sequence (including potential trailing spaces)
|
496 |
+
longest_match = None
|
497 |
+
longest_length = 0
|
498 |
+
matched_token = None
|
499 |
+
|
500 |
+
# Sort vocab items by length (longest first)
|
501 |
+
for token_id, token_bytes in sorted(self.vocab.items(),
|
502 |
+
key=lambda x: len(x[1]),
|
503 |
+
reverse=True):
|
504 |
+
if (i + len(token_bytes) <= len(text_bytes) and
|
505 |
+
text_bytes[i:i+len(token_bytes)] == token_bytes):
|
506 |
+
longest_length = len(token_bytes)
|
507 |
+
longest_match = token_bytes
|
508 |
+
matched_token = token_id
|
509 |
+
break
|
510 |
+
|
511 |
+
if longest_match:
|
512 |
+
final_tokens.append(matched_token)
|
513 |
+
i += longest_length
|
514 |
+
else:
|
515 |
+
# If no match found, fall back to single byte
|
516 |
+
for token_id, token_bytes in self.vocab.items():
|
517 |
+
if token_bytes == bytes([text_bytes[i]]):
|
518 |
+
final_tokens.append(token_id)
|
519 |
+
break
|
520 |
+
i += 1
|
521 |
+
|
522 |
+
return final_tokens
|
523 |
+
|
524 |
+
def decode(self, tokens):
|
525 |
+
"""Decode token IDs back to text."""
|
526 |
+
bytes_tokens = b''.join(self.vocab[idx] for idx in tokens)
|
527 |
+
return bytes_tokens.decode('utf-8')
|
528 |
+
|
529 |
+
def save(self, path):
|
530 |
+
"""Save the tokenizer mappings to files."""
|
531 |
+
base_path = path.rsplit('.', 1)[0]
|
532 |
+
|
533 |
+
# Save vocabulary with human-readable form
|
534 |
+
vocab_mapping = {}
|
535 |
+
for token_id, byte_seq in self.vocab.items():
|
536 |
+
try:
|
537 |
+
text = byte_seq.decode('utf-8')
|
538 |
+
vocab_mapping[token_id] = {
|
539 |
+
'text': text,
|
540 |
+
'bytes': list(byte_seq),
|
541 |
+
'is_base': token_id < self.base_vocab_size
|
542 |
+
}
|
543 |
+
except UnicodeDecodeError:
|
544 |
+
vocab_mapping[token_id] = {
|
545 |
+
'text': f"[Bytes: {list(byte_seq)}]",
|
546 |
+
'bytes': list(byte_seq),
|
547 |
+
'is_base': token_id < self.base_vocab_size
|
548 |
+
}
|
549 |
+
|
550 |
+
# Save merge patterns with human-readable form
|
551 |
+
merge_patterns = {}
|
552 |
+
for (p0, p1), idx in self.merges.items():
|
553 |
+
try:
|
554 |
+
text0 = self.vocab[p0].decode('utf-8')
|
555 |
+
text1 = self.vocab[p1].decode('utf-8')
|
556 |
+
merged = self.vocab[idx].decode('utf-8')
|
557 |
+
merge_patterns[idx] = {
|
558 |
+
'parts': [text0, text1],
|
559 |
+
'result': merged,
|
560 |
+
'token_ids': [p0, p1]
|
561 |
+
}
|
562 |
+
except UnicodeDecodeError:
|
563 |
+
merge_patterns[idx] = {
|
564 |
+
'parts': [f"Token_{p0}", f"Token_{p1}"],
|
565 |
+
'result': f"Token_{idx}",
|
566 |
+
'token_ids': [p0, p1]
|
567 |
+
}
|
568 |
+
|
569 |
+
with open(f"{base_path}_vocab.json", 'w', encoding='utf-8') as f:
|
570 |
+
json.dump(vocab_mapping, f, ensure_ascii=False, indent=2)
|
571 |
+
|
572 |
+
with open(f"{base_path}_merges.json", 'w', encoding='utf-8') as f:
|
573 |
+
json.dump(merge_patterns, f, ensure_ascii=False, indent=2)
|
574 |
+
|
575 |
+
print(f"\nTokenizer mappings saved to {base_path}_vocab.json and {base_path}_merges.json")
|
576 |
+
|
577 |
+
def load(self, path):
|
578 |
+
"""Load the tokenizer from mapping files."""
|
579 |
+
base_path = path.rsplit('.', 1)[0]
|
580 |
+
|
581 |
+
with open(f"{base_path}_vocab.json", 'r', encoding='utf-8') as f:
|
582 |
+
vocab_mapping = json.load(f)
|
583 |
+
self.vocab = {
|
584 |
+
int(k): bytes(v['bytes'])
|
585 |
+
for k, v in vocab_mapping.items()
|
586 |
+
}
|
587 |
+
# Find base vocabulary size
|
588 |
+
self.base_vocab_size = sum(1 for k, v in vocab_mapping.items() if v['is_base'])
|
589 |
+
|
590 |
+
with open(f"{base_path}_merges.json", 'r', encoding='utf-8') as f:
|
591 |
+
merge_patterns = json.load(f)
|
592 |
+
self.merges = {
|
593 |
+
tuple(v['token_ids']): int(k)
|
594 |
+
for k, v in merge_patterns.items()
|
595 |
+
}
|
596 |
+
|
597 |
+
self.vocab_size = len(self.vocab)
|
598 |
+
print(f"Loaded tokenizer from {base_path}_*.json files")
|
599 |
+
|
600 |
+
def train_on_dataset(self):
|
601 |
+
"""Train tokenizer on the Telugu news dataset."""
|
602 |
+
print("Loading dataset...")
|
603 |
+
try:
|
604 |
+
# Load the local parquet file
|
605 |
+
dataset = pd.read_parquet('telugu_news_dataset.parquet')
|
606 |
+
|
607 |
+
print("Preparing training text...")
|
608 |
+
training_text = []
|
609 |
+
|
610 |
+
for _, row in tqdm(dataset.iterrows(), desc="Loading documents", total=len(dataset)):
|
611 |
+
if not pd.isna(row["headline"]): training_text.append(row["headline"])
|
612 |
+
if not pd.isna(row["article"]): training_text.append(row["article"])
|
613 |
+
|
614 |
+
if self.sample_size and len(training_text) >= self.sample_size:
|
615 |
+
print(f"Using first {self.sample_size} documents for training")
|
616 |
+
break
|
617 |
+
|
618 |
+
full_text = "\n".join(training_text)
|
619 |
+
print(f"\nTraining on {len(training_text)} documents...")
|
620 |
+
print(f"Total characters in training data: {len(full_text):,}")
|
621 |
+
|
622 |
+
start_time = time.time()
|
623 |
+
self.fit(full_text)
|
624 |
+
print(f"Training time: {time.time() - start_time:.2f} seconds")
|
625 |
+
|
626 |
+
except Exception as e:
|
627 |
+
print(f"Error loading dataset: {str(e)}")
|
628 |
+
print("Falling back to sample text...")
|
629 |
+
sample_text = """
|
630 |
+
เฐคเฑเฐฒเฑเฐเฑ เฐญเฐพเฐท เฐฆเฐเฑเฐทเฐฟเฐฃ เฐญเฐพเฐฐเฐคเฐฆเฑเฐถเฐเฐฒเฑเฐจเฐฟ เฐฆเฑเฐฐเฐพเฐตเฐฟเฐก เฐญเฐพเฐทเฐฒเฑเฐฒเฑ เฐเฐเฐเฐฟ.
|
631 |
+
เฐเฐเฐงเฑเฐฐ เฐชเฑเฐฐเฐฆเฑเฐถเฑ เฐฎเฐฐเฐฟเฐฏเฑ เฐคเฑเฐฒเฐเฐเฐพเฐฃ เฐฐเฐพเฐทเฑเฐเฑเฐฐเฐพเฐฒ เฐ
เฐงเฐฟเฐเฐพเฐฐ เฐญเฐพเฐท.
|
632 |
+
"""
|
633 |
+
self.fit(sample_text)
|
634 |
+
|
635 |
+
|
636 |
+
if __name__ == "__main__":
|
637 |
+
# For quick testing, use a small sample
|
638 |
+
tokenizer = BPETokenizer(vocab_size=4999, sample_size=None)
|
639 |
+
|
640 |
+
vocab_file = 'telugu_tokenizer_vocab.json'
|
641 |
+
merges_file = 'telugu_tokenizer_merges.json'
|
642 |
+
|
643 |
+
if os.path.exists(vocab_file) and os.path.exists(merges_file):
|
644 |
+
print("Loading pre-trained tokenizer...")
|
645 |
+
tokenizer.load('telugu_tokenizer')
|
646 |
+
else:
|
647 |
+
print("Training new tokenizer...")
|
648 |
+
tokenizer.train_on_dataset()
|
649 |
+
tokenizer.save('telugu_tokenizer')
|
650 |
+
|
651 |
+
# Test the tokenizer
|
652 |
+
test_text = "เฐคเฑเฐฒเฑเฐเฑ เฐญเฐพเฐท"
|
653 |
+
encoded = tokenizer.encode(test_text)
|
654 |
+
decoded = tokenizer.decode(encoded)
|
655 |
+
|
656 |
+
print("\nTest Results:")
|
657 |
+
print(f"Original: {test_text}")
|
658 |
+
print(f"Encoded: {encoded}")
|
659 |
+
print(f"Decoded: {decoded}")
|
660 |
+
print(f"Matches original: {test_text == decoded}")
|
src/templates/index.html
ADDED
@@ -0,0 +1,134 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<!DOCTYPE html>
|
2 |
+
<html>
|
3 |
+
<head>
|
4 |
+
<title>{{ title }}</title>
|
5 |
+
<script src="https://cdn.tailwindcss.com"></script>
|
6 |
+
</head>
|
7 |
+
<body class="bg-gray-100">
|
8 |
+
<div class="container mx-auto px-4 py-8">
|
9 |
+
<h1 class="text-3xl font-bold mb-8">Telugu BPE Tokenizer</h1>
|
10 |
+
|
11 |
+
<div class="bg-white rounded-lg shadow p-6">
|
12 |
+
<textarea
|
13 |
+
id="input-text"
|
14 |
+
class="w-full p-2 border rounded mb-4"
|
15 |
+
rows="4"
|
16 |
+
placeholder="Enter Telugu text here..."></textarea>
|
17 |
+
|
18 |
+
<button
|
19 |
+
onclick="tokenize()"
|
20 |
+
class="bg-blue-500 text-white px-4 py-2 rounded hover:bg-blue-600">
|
21 |
+
Tokenize
|
22 |
+
</button>
|
23 |
+
|
24 |
+
<div id="result" class="mt-6 hidden">
|
25 |
+
<h2 class="text-xl font-semibold mb-2">Results:</h2>
|
26 |
+
<div class="space-y-4">
|
27 |
+
<div>
|
28 |
+
<span class="font-medium">Tokens:</span>
|
29 |
+
<pre id="tokens" class="bg-gray-100 p-2 rounded mt-1"></pre>
|
30 |
+
</div>
|
31 |
+
<div>
|
32 |
+
<span class="font-medium">Decoded:</span>
|
33 |
+
<pre id="decoded" class="bg-gray-100 p-2 rounded mt-1"></pre>
|
34 |
+
</div>
|
35 |
+
<div>
|
36 |
+
<span class="font-medium">Token Details:</span>
|
37 |
+
<div id="token-details" class="bg-gray-100 p-2 rounded mt-1 overflow-x-auto">
|
38 |
+
<table class="min-w-full bg-white border rounded-lg overflow-hidden table-fixed">
|
39 |
+
<thead class="bg-gray-100">
|
40 |
+
<tr>
|
41 |
+
<th class="px-4 py-2 text-left w-1/4">Word</th>
|
42 |
+
<th class="px-4 py-2 text-left w-1/4">Type</th>
|
43 |
+
<th class="px-4 py-2 text-left w-2/4">Token Details</th>
|
44 |
+
</tr>
|
45 |
+
</thead>
|
46 |
+
<tbody id="token-details-body">
|
47 |
+
<!-- Token details will be inserted here -->
|
48 |
+
</tbody>
|
49 |
+
</table>
|
50 |
+
</div>
|
51 |
+
</div>
|
52 |
+
<div id="match-result"></div>
|
53 |
+
</div>
|
54 |
+
</div>
|
55 |
+
</div>
|
56 |
+
</div>
|
57 |
+
|
58 |
+
<script>
|
59 |
+
async function tokenize() {
|
60 |
+
const text = document.getElementById('input-text').value;
|
61 |
+
try {
|
62 |
+
const response = await fetch('/tokenize', {
|
63 |
+
method: 'POST',
|
64 |
+
headers: {
|
65 |
+
'Content-Type': 'application/json',
|
66 |
+
},
|
67 |
+
body: JSON.stringify({ text }),
|
68 |
+
});
|
69 |
+
|
70 |
+
const data = await response.json();
|
71 |
+
|
72 |
+
document.getElementById('result').classList.remove('hidden');
|
73 |
+
document.getElementById('tokens').textContent = JSON.stringify(data.tokens, null, 2);
|
74 |
+
document.getElementById('decoded').textContent = data.decoded;
|
75 |
+
|
76 |
+
// Display token details
|
77 |
+
const detailsBody = document.getElementById('token-details-body');
|
78 |
+
detailsBody.innerHTML = '';
|
79 |
+
|
80 |
+
data.token_details.forEach(detail => {
|
81 |
+
const row = document.createElement('tr');
|
82 |
+
row.className = 'border-b hover:bg-gray-50';
|
83 |
+
|
84 |
+
// Create table cells
|
85 |
+
const wordCell = document.createElement('td');
|
86 |
+
const typeCell = document.createElement('td');
|
87 |
+
const tokenCell = document.createElement('td');
|
88 |
+
|
89 |
+
// Set cell classes for vertical alignment and wrapping
|
90 |
+
wordCell.className = 'px-4 py-2 align-top font-mono border-r';
|
91 |
+
typeCell.className = 'px-4 py-2 align-top border-r';
|
92 |
+
tokenCell.className = 'px-4 py-2 align-top font-mono';
|
93 |
+
|
94 |
+
// Set content
|
95 |
+
wordCell.textContent = detail.word;
|
96 |
+
typeCell.textContent = detail.type;
|
97 |
+
|
98 |
+
// Create a container for token details to ensure proper spacing
|
99 |
+
const tokenList = document.createElement('div');
|
100 |
+
tokenList.className = 'space-y-1';
|
101 |
+
|
102 |
+
if (detail.type === 'complete_word') {
|
103 |
+
const tokenDiv = document.createElement('div');
|
104 |
+
tokenDiv.textContent = `ID ${detail.token_id}: "${detail.text}"`;
|
105 |
+
tokenList.appendChild(tokenDiv);
|
106 |
+
} else if (detail.type === 'subword_tokens') {
|
107 |
+
detail.tokens.forEach(t => {
|
108 |
+
const tokenDiv = document.createElement('div');
|
109 |
+
tokenDiv.textContent = `ID ${t.id}: "${t.text}"`;
|
110 |
+
tokenList.appendChild(tokenDiv);
|
111 |
+
});
|
112 |
+
}
|
113 |
+
|
114 |
+
tokenCell.appendChild(tokenList);
|
115 |
+
|
116 |
+
// Add cells to row
|
117 |
+
row.appendChild(wordCell);
|
118 |
+
row.appendChild(typeCell);
|
119 |
+
row.appendChild(tokenCell);
|
120 |
+
|
121 |
+
detailsBody.appendChild(row);
|
122 |
+
});
|
123 |
+
|
124 |
+
const matchEl = document.getElementById('match-result');
|
125 |
+
matchEl.textContent = data.matches ? 'โ
Perfect match!' : 'โ Mismatch';
|
126 |
+
matchEl.className = data.matches ? 'text-green-600' : 'text-red-600';
|
127 |
+
} catch (error) {
|
128 |
+
console.error('Error:', error);
|
129 |
+
alert('Error tokenizing text: ' + error.message);
|
130 |
+
}
|
131 |
+
}
|
132 |
+
</script>
|
133 |
+
</body>
|
134 |
+
</html>
|
telugu_base_vocab.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
telugu_tokenizer_merges.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
telugu_tokenizer_vocab.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
training_logs.log
ADDED
@@ -0,0 +1,376 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
(session10) (base) Chaitanyas-MacBook-Pro:telugu-tokenizer chaitanyasagargurujula$ python src/bpe_tokenizer.py
|
2 |
+
Loading existing base vocabulary...
|
3 |
+
Training new tokenizer...
|
4 |
+
Loading dataset...
|
5 |
+
Preparing training text...
|
6 |
+
Loading documents: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 83866/83866 [00:00<00:00, 88094.70it/s]
|
7 |
+
|
8 |
+
Training on 167732 documents...
|
9 |
+
Total characters in training data: 105,279,512
|
10 |
+
Converting text to token IDs using base vocabulary...
|
11 |
+
|
12 |
+
Before training: text bytes length: 283,496,279
|
13 |
+
Processing 45 chunks using 11 cores...
|
14 |
+
Initial tokenization: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 45/45 [00:04<00:00, 9.95it/s]
|
15 |
+
|
16 |
+
Base vocabulary size: 2400
|
17 |
+
Initial sequence length: 105836015
|
18 |
+
Training BPE: 0%| | 1/2599 [00:37<26:47:26, 37.12s/it]Merge for (304, 333) already exists in the vocabulary.
|
19 |
+
Training BPE: 0%| | 4/2599 [01:26<13:45:51, 19.10s/it]Merge for (296, 333) already exists in the vocabulary.
|
20 |
+
Training BPE: 0%|โ | 6/2599 [01:57<12:20:12, 17.13s/it]Merge for (312, 333) already exists in the vocabulary.
|
21 |
+
Training BPE: 1%|โ | 16/2599 [04:29<10:44:21, 14.97s/it]Merge for (783, 296) already exists in the vocabulary.
|
22 |
+
Training BPE: 1%|โ | 19/2599 [05:13<10:29:00, 14.63s/it]Merge for (296, 319) already exists in the vocabulary.
|
23 |
+
Training BPE: 1%|โ | 23/2599 [06:10<10:13:44, 14.30s/it]Merge for (277, 333) already exists in the vocabulary.
|
24 |
+
Training BPE: 1%|โ | 27/2599 [07:06<10:01:51, 14.04s/it]Merge for (309, 319) already exists in the vocabulary.
|
25 |
+
Training BPE: 1%|โ | 29/2599 [07:33<9:54:13, 13.87s/it]Merge for (282, 327) already exists in the vocabulary.
|
26 |
+
Training BPE: 1%|โ | 34/2599 [08:41<9:39:29, 13.56s/it]Merge for (302, 318) already exists in the vocabulary.
|
27 |
+
Training BPE: 1%|โโ | 38/2599 [09:35<9:36:41, 13.51s/it]Merge for (304, 318) already exists in the vocabulary.
|
28 |
+
Training BPE: 2%|โโ | 39/2599 [09:48<9:34:36, 13.47s/it]Merge for (298, 2403) already exists in the vocabulary.
|
29 |
+
Training BPE: 2%|โโ | 41/2599 [10:15<9:31:03, 13.39s/it]Merge for (1023, 292) already exists in the vocabulary.
|
30 |
+
Training BPE: 2%|โโ | 43/2599 [10:41<9:25:50, 13.28s/it]Merge for (292, 333) already exists in the vocabulary.
|
31 |
+
Training BPE: 2%|โโ | 48/2599 [11:46<9:13:28, 13.02s/it]Merge for (277, 321) already exists in the vocabulary.
|
32 |
+
Training BPE: 2%|โโ | 50/2599 [12:12<9:08:03, 12.90s/it]Merge for (304, 319) already exists in the vocabulary.
|
33 |
+
Training BPE: 2%|โโ | 55/2599 [13:16<9:04:11, 12.83s/it]Merge for (309, 318) already exists in the vocabulary.
|
34 |
+
Training BPE: 2%|โโ | 58/2599 [13:54<8:59:41, 12.74s/it]Merge for (294, 333) already exists in the vocabulary.
|
35 |
+
Training BPE: 2%|โโ | 61/2599 [14:32<8:56:47, 12.69s/it]Merge for (306, 2412) already exists in the vocabulary.
|
36 |
+
Training BPE: 3%|โโ | 66/2599 [15:34<8:46:39, 12.47s/it]Merge for (292, 319) already exists in the vocabulary.
|
37 |
+
Training BPE: 3%|โโ | 68/2599 [15:59<8:43:38, 12.41s/it]Merge for (287, 2412) already exists in the vocabulary.
|
38 |
+
Training BPE: 3%|โโ | 69/2599 [16:12<8:43:06, 12.41s/it]Merge for (304, 321) already exists in the vocabulary.
|
39 |
+
Training BPE: 3%|โโโ | 70/2599 [16:24<8:41:57, 12.38s/it]Merge for (287, 2438) already exists in the vocabulary.
|
40 |
+
Training BPE: 3%|โโโ | 72/2599 [16:48<8:38:32, 12.31s/it]Merge for (403, 311) already exists in the vocabulary.
|
41 |
+
Training BPE: 3%|โโโ | 75/2599 [17:25<8:35:33, 12.26s/it]Merge for (296, 321) already exists in the vocabulary.
|
42 |
+
Training BPE: 3%|โโโ | 76/2599 [17:37<8:34:11, 12.23s/it]Merge for (289, 319) already exists in the vocabulary.
|
43 |
+
Training BPE: 3%|โโโ | 77/2599 [17:49<8:30:31, 12.15s/it]Merge for (309, 327) already exists in the vocabulary.
|
44 |
+
Training BPE: 3%|โโโ | 78/2599 [18:01<8:30:18, 12.15s/it]Merge for (298, 2457) already exists in the vocabulary.
|
45 |
+
Training BPE: 3%|โโโ | 80/2599 [18:26<8:28:40, 12.12s/it]Merge for (277, 318) already exists in the vocabulary.
|
46 |
+
Training BPE: 3%|โโโ | 83/2599 [19:02<8:32:24, 12.22s/it]Merge for (282, 333) already exists in the vocabulary.
|
47 |
+
Training BPE: 3%|โโโ | 84/2599 [19:15<8:33:27, 12.25s/it]Merge for (277, 331) already exists in the vocabulary.
|
48 |
+
Training BPE: 3%|โโโ | 86/2599 [19:39<8:31:13, 12.21s/it]Merge for (289, 333) already exists in the vocabulary.
|
49 |
+
Training BPE: 3%|โโโ | 90/2599 [20:27<8:25:58, 12.10s/it]Merge for (277, 330) already exists in the vocabulary.
|
50 |
+
Training BPE: 4%|โโโ | 91/2599 [20:39<8:25:16, 12.09s/it]Merge for (300, 318) already exists in the vocabulary.
|
51 |
+
Training BPE: 4%|โโโ | 94/2599 [21:15<8:23:51, 12.07s/it]Merge for (298, 328) already exists in the vocabulary.
|
52 |
+
Training BPE: 4%|โโโ | 96/2599 [21:39<8:21:06, 12.01s/it]Merge for (1023, 287) already exists in the vocabulary.
|
53 |
+
Training BPE: 4%|โโโ | 99/2599 [22:15<8:13:43, 11.85s/it]
|
54 |
+
Vocab size: 2500: เฐ + เฐฌ = เฐเฐฌ
|
55 |
+
Merge for (298, 318) already exists in the vocabulary.
|
56 |
+
Training BPE: 4%|โโโ | 100/2599 [22:27<8:15:33, 11.90s/it]Merge for (306, 331) already exists in the vocabulary.
|
57 |
+
Training BPE: 4%|โโโ | 104/2599 [23:14<8:14:28, 11.89s/it]Merge for (298, 331) already exists in the vocabulary.
|
58 |
+
Training BPE: 4%|โโโโ | 106/2599 [23:38<8:11:43, 11.83s/it]Merge for (307, 2412) already exists in the vocabulary.
|
59 |
+
Training BPE: 4%|โโโโ | 110/2599 [24:24<8:05:58, 11.71s/it]Merge for (1023, 293) already exists in the vocabulary.
|
60 |
+
Training BPE: 4%|โโโโ | 111/2599 [24:36<8:04:27, 11.68s/it]Merge for (503, 282) already exists in the vocabulary.
|
61 |
+
Training BPE: 4%|โโโโ | 112/2599 [24:47<7:59:29, 11.57s/it]Merge for (311, 2438) already exists in the vocabulary.
|
62 |
+
Training BPE: 4%|โโโโ | 113/2599 [24:59<7:58:49, 11.56s/it]Merge for (279, 321) already exists in the vocabulary.
|
63 |
+
Training BPE: 4%|โโโโ | 115/2599 [25:22<7:56:27, 11.51s/it]Merge for (303, 318) already exists in the vocabulary.
|
64 |
+
Training BPE: 4%|โโโโ | 116/2599 [25:33<7:56:28, 11.51s/it]Merge for (312, 320) already exists in the vocabulary.
|
65 |
+
Training BPE: 5%|โโโโ | 117/2599 [25:45<7:55:38, 11.50s/it]Merge for (306, 327) already exists in the vocabulary.
|
66 |
+
Training BPE: 5%|โโโโ | 118/2599 [25:56<7:54:21, 11.47s/it]Merge for (296, 327) already exists in the vocabulary.
|
67 |
+
Training BPE: 5%|โโโโ | 121/2599 [26:32<8:06:55, 11.79s/it]Merge for (282, 326) already exists in the vocabulary.
|
68 |
+
Training BPE: 5%|โโโโ | 122/2599 [26:44<8:02:47, 11.69s/it]Merge for (298, 326) already exists in the vocabulary.
|
69 |
+
Training BPE: 5%|โโโโ | 124/2599 [27:06<7:55:34, 11.53s/it]Merge for (287, 320) already exists in the vocabulary.
|
70 |
+
Training BPE: 5%|โโโโ | 126/2599 [27:29<7:50:27, 11.41s/it]Merge for (304, 326) already exists in the vocabulary.
|
71 |
+
Training BPE: 5%|โโโโ | 127/2599 [27:40<7:47:41, 11.35s/it]Merge for (294, 327) already exists in the vocabulary.
|
72 |
+
Training BPE: 5%|โโโโ | 129/2599 [28:03<7:45:52, 11.32s/it]Merge for (312, 319) already exists in the vocabulary.
|
73 |
+
Training BPE: 5%|โโโโ | 133/2599 [28:49<7:50:25, 11.45s/it]Merge for (304, 331) already exists in the vocabulary.
|
74 |
+
Training BPE: 5%|โโโโ | 134/2599 [29:00<7:45:20, 11.33s/it]Merge for (703, 292) already exists in the vocabulary.
|
75 |
+
Training BPE: 5%|โโโโ | 137/2599 [29:33<7:42:08, 11.26s/it]Merge for (277, 327) already exists in the vocabulary.
|
76 |
+
Training BPE: 5%|โโโโโ | 142/2599 [30:29<7:37:31, 11.17s/it]Merge for (306, 333) already exists in the vocabulary.
|
77 |
+
Training BPE: 6%|โโโโโ | 144/2599 [30:51<7:34:32, 11.11s/it]Merge for (302, 319) already exists in the vocabulary.
|
78 |
+
Training BPE: 6%|โโโโโ | 145/2599 [31:03<7:34:21, 11.11s/it]Merge for (310, 318) already exists in the vocabulary.
|
79 |
+
Training BPE: 6%|โโโโโ | 148/2599 [31:36<7:29:56, 11.01s/it]Merge for (277, 2403) already exists in the vocabulary.
|
80 |
+
Training BPE: 6%|โโโโโ | 149/2599 [31:47<7:29:45, 11.01s/it]Merge for (304, 322) already exists in the vocabulary.
|
81 |
+
Training BPE: 6%|โโโโโ | 150/2599 [31:58<7:29:13, 11.01s/it]Merge for (302, 321) already exists in the vocabulary.
|
82 |
+
Training BPE: 6%|โโโโโ | 152/2599 [32:20<7:29:14, 11.02s/it]Merge for (743, 294) already exists in the vocabulary.
|
83 |
+
Training BPE: 6%|โโโโโ | 156/2599 [33:03<7:21:49, 10.85s/it]Merge for (294, 2414) already exists in the vocabulary.
|
84 |
+
Training BPE: 6%|โโโโโ | 157/2599 [33:14<7:23:35, 10.90s/it]Merge for (403, 277) already exists in the vocabulary.
|
85 |
+
Training BPE: 6%|โโโโโ | 158/2599 [33:25<7:23:33, 10.90s/it]Merge for (643, 289) already exists in the vocabulary.
|
86 |
+
Training BPE: 6%|โโโโโ | 159/2599 [33:35<7:20:09, 10.82s/it]Merge for (306, 319) already exists in the vocabulary.
|
87 |
+
Training BPE: 6%|โโโโโ | 162/2599 [34:08<7:19:58, 10.83s/it]Merge for (277, 322) already exists in the vocabulary.
|
88 |
+
Training BPE: 6%|โโโโโ | 164/2599 [34:30<7:17:59, 10.79s/it]Merge for (703, 309) already exists in the vocabulary.
|
89 |
+
Training BPE: 6%|โโโโโ | 166/2599 [34:51<7:16:49, 10.77s/it]Merge for (292, 2403) already exists in the vocabulary.
|
90 |
+
Training BPE: 6%|โโโโโ | 168/2599 [35:13<7:15:41, 10.75s/it]Merge for (304, 327) already exists in the vocabulary.
|
91 |
+
Training BPE: 7%|โโโโโ | 170/2599 [35:34<7:16:07, 10.77s/it]Merge for (403, 287) already exists in the vocabulary.
|
92 |
+
Training BPE: 7%|โโโโโโ | 174/2599 [36:17<7:13:15, 10.72s/it]Merge for (309, 326) already exists in the vocabulary.
|
93 |
+
Training BPE: 7%|โโโโโโ | 175/2599 [36:28<7:13:08, 10.72s/it]Merge for (301, 321) already exists in the vocabulary.
|
94 |
+
Training BPE: 7%|โโโโโโ | 179/2599 [37:10<7:10:36, 10.68s/it]Merge for (294, 319) already exists in the vocabulary.
|
95 |
+
Training BPE: 7%|โโโโโโ | 181/2599 [37:32<7:11:27, 10.71s/it]Merge for (284, 320) already exists in the vocabulary.
|
96 |
+
Training BPE: 7%|โโโโโโ | 189/2599 [38:58<7:19:21, 10.94s/it]Merge for (296, 318) already exists in the vocabulary.
|
97 |
+
Training BPE: 7%|โโโโโโ | 191/2599 [39:20<7:14:50, 10.83s/it]Merge for (302, 2537) already exists in the vocabulary.
|
98 |
+
Training BPE: 7%|โโโโโโ | 192/2599 [39:30<7:09:39, 10.71s/it]Merge for (302, 326) already exists in the vocabulary.
|
99 |
+
Training BPE: 7%|โโโโโโ | 193/2599 [39:41<7:07:00, 10.65s/it]Merge for (306, 321) already exists in the vocabulary.
|
100 |
+
Training BPE: 7%|โโโโโโ | 194/2599 [39:51<7:04:38, 10.59s/it]Merge for (279, 318) already exists in the vocabulary.
|
101 |
+
Training BPE: 8%|โโโโโโ | 195/2599 [40:02<7:03:07, 10.56s/it]Merge for (279, 2403) already exists in the vocabulary.
|
102 |
+
Training BPE: 8%|โโโโโโ | 196/2599 [40:12<7:03:23, 10.57s/it]Merge for (294, 318) already exists in the vocabulary.
|
103 |
+
Training BPE: 8%|โโโโโโ | 197/2599 [40:23<7:02:25, 10.55s/it]Merge for (284, 2414) already exists in the vocabulary.
|
104 |
+
Training BPE: 8%|โโโโโโ | 199/2599 [40:43<6:56:46, 10.42s/it]
|
105 |
+
Vocab size: 2600: เฐทเฑเฐ + เฑเฐฐ = เฐทเฑเฐเฑเฐฐ
|
106 |
+
Training BPE: 8%|โโโโโโ | 200/2599 [40:54<6:56:33, 10.42s/it]Merge for (294, 321) already exists in the vocabulary.
|
107 |
+
Training BPE: 8%|โโโโโโ | 202/2599 [41:15<6:55:18, 10.40s/it]Merge for (312, 326) already exists in the vocabulary.
|
108 |
+
Training BPE: 8%|โโโโโโ | 204/2599 [41:35<6:54:53, 10.39s/it]Merge for (313, 328) already exists in the vocabulary.
|
109 |
+
Training BPE: 8%|โโโโโโโ | 205/2599 [41:46<6:54:28, 10.39s/it]Merge for (289, 318) already exists in the vocabulary.
|
110 |
+
Training BPE: 8%|โโโโโโโ | 208/2599 [42:17<6:55:27, 10.43s/it]Merge for (292, 320) already exists in the vocabulary.
|
111 |
+
Training BPE: 8%|โโโโโโโ | 214/2599 [43:19<6:49:01, 10.29s/it]Merge for (296, 320) already exists in the vocabulary.
|
112 |
+
Training BPE: 8%|โโโโโโโ | 215/2599 [43:29<6:49:19, 10.30s/it]Merge for (294, 320) already exists in the vocabulary.
|
113 |
+
Training BPE: 8%|โโโโโโโ | 216/2599 [43:40<6:49:14, 10.30s/it]Merge for (287, 319) already exists in the vocabulary.
|
114 |
+
Training BPE: 8%|โโโโโโโ | 220/2599 [44:21<6:43:30, 10.18s/it]Merge for (309, 320) already exists in the vocabulary.
|
115 |
+
Training BPE: 9%|โโโโโโโ | 222/2599 [44:41<6:43:40, 10.19s/it]Merge for (295, 2414) already exists in the vocabulary.
|
116 |
+
Training BPE: 9%|โโโโโโโ | 230/2599 [46:03<6:43:24, 10.22s/it]Merge for (300, 320) already exists in the vocabulary.
|
117 |
+
Training BPE: 9%|โโโโโโโ | 231/2599 [46:13<6:43:10, 10.22s/it]Merge for (310, 2403) already exists in the vocabulary.
|
118 |
+
Training BPE: 9%|โโโโโโโ | 234/2599 [46:43<6:42:05, 10.20s/it]Merge for (783, 303) already exists in the vocabulary.
|
119 |
+
Training BPE: 9%|โโโโโโโโ | 240/2599 [47:45<6:38:23, 10.13s/it]Merge for (298, 327) already exists in the vocabulary.
|
120 |
+
Training BPE: 9%|โโโโโโโโ | 243/2599 [48:15<6:35:26, 10.07s/it]Merge for (310, 333) already exists in the vocabulary.
|
121 |
+
Training BPE: 9%|โโโโโโโโ | 246/2599 [48:45<6:33:10, 10.03s/it]Merge for (312, 318) already exists in the vocabulary.
|
122 |
+
Training BPE: 10%|โโโโโโโโ | 250/2599 [49:25<6:30:14, 9.97s/it]Merge for (306, 318) already exists in the vocabulary.
|
123 |
+
Training BPE: 10%|โโโโโโโโ | 251/2599 [49:35<6:31:08, 10.00s/it]Merge for (302, 328) already exists in the vocabulary.
|
124 |
+
Training BPE: 10%|โโโโโโโโ | 252/2599 [49:45<6:31:56, 10.02s/it]Merge for (309, 2414) already exists in the vocabulary.
|
125 |
+
Training BPE: 10%|โโโโโโโโ | 257/2599 [50:35<6:30:24, 10.00s/it]Merge for (298, 320) already exists in the vocabulary.
|
126 |
+
Training BPE: 10%|โโโโโโโโ | 258/2599 [50:45<6:30:24, 10.01s/it]Merge for (289, 321) already exists in the vocabulary.
|
127 |
+
Training BPE: 10%|โโโโโโโโ | 260/2599 [51:04<6:28:06, 9.96s/it]Merge for (300, 333) already exists in the vocabulary.
|
128 |
+
Training BPE: 10%|โโโโโโโโ | 263/2599 [51:34<6:26:37, 9.93s/it]Merge for (312, 321) already exists in the vocabulary.
|
129 |
+
Training BPE: 10%|โโโโโโโโ | 266/2599 [52:04<6:25:29, 9.91s/it]Merge for (311, 333) already exists in the vocabulary.
|
130 |
+
Training BPE: 10%|โโโโโโโโ | 268/2599 [52:24<6:24:38, 9.90s/it]Merge for (298, 321) already exists in the vocabulary.
|
131 |
+
Training BPE: 10%|โโโโโโโโ | 269/2599 [52:34<6:24:45, 9.91s/it]Merge for (312, 258) already exists in the vocabulary.
|
132 |
+
Training BPE: 10%|โโโโโโโโโ | 271/2599 [52:54<6:31:20, 10.09s/it]Merge for (284, 318) already exists in the vocabulary.
|
133 |
+
Training BPE: 11%|โโโโโโโโโ | 275/2599 [53:35<6:30:29, 10.08s/it]Merge for (302, 331) already exists in the vocabulary.
|
134 |
+
Training BPE: 11%|โโโโโโโโโ | 278/2599 [54:04<6:19:09, 9.80s/it]Merge for (923, 310) already exists in the vocabulary.
|
135 |
+
Training BPE: 11%|โโโโโโโโโ | 283/2599 [54:53<6:17:56, 9.79s/it]Merge for (743, 295) already exists in the vocabulary.
|
136 |
+
Training BPE: 11%|โโโโโโโโโ | 286/2599 [55:22<6:16:20, 9.76s/it]Merge for (304, 320) already exists in the vocabulary.
|
137 |
+
Training BPE: 11%|โโโโโโโโโ | 289/2599 [55:52<6:13:59, 9.71s/it]Merge for (309, 328) already exists in the vocabulary.
|
138 |
+
Training BPE: 11%|โโโโโโโโโ | 291/2599 [56:11<6:13:59, 9.72s/it]Merge for (282, 319) already exists in the vocabulary.
|
139 |
+
Training BPE: 11%|โโโโโโโโโ | 293/2599 [56:30<6:13:50, 9.73s/it]Merge for (279, 333) already exists in the vocabulary.
|
140 |
+
Training BPE: 11%|โโโโโโโโโ | 297/2599 [57:09<6:09:01, 9.62s/it]Merge for (292, 2414) already exists in the vocabulary.
|
141 |
+
Training BPE: 12%|โโโโโโโโโ | 299/2599 [57:28<6:09:03, 9.63s/it]
|
142 |
+
Vocab size: 2700: เฐ + เฐพเฐฐ = เฐเฐพเฐฐ
|
143 |
+
Training BPE: 12%|โโโโโโโโโ | 302/2599 [57:57<6:07:00, 9.59s/it]Merge for (302, 320) already exists in the vocabulary.
|
144 |
+
Training BPE: 12%|โโโโโโโโโโ | 308/2599 [58:54<6:05:50, 9.58s/it]Merge for (302, 327) already exists in the vocabulary.
|
145 |
+
Training BPE: 12%|โโโโโโโโโโ | 310/2599 [59:13<6:03:35, 9.53s/it]Merge for (304, 328) already exists in the vocabulary.
|
146 |
+
Training BPE: 12%|โโโโโโโโโโ | 321/2599 [1:00:58<5:58:43, 9.45s/it]Merge for (303, 2414) already exists in the vocabulary.
|
147 |
+
Training BPE: 12%|โโโโโโโโโโ | 322/2599 [1:01:07<5:58:57, 9.46s/it]Merge for (292, 318) already exists in the vocabulary.
|
148 |
+
Training BPE: 13%|โโโโโโโโโโ | 326/2599 [1:01:45<5:55:14, 9.38s/it]Merge for (289, 320) already exists in the vocabulary.
|
149 |
+
Training BPE: 13%|โโโโโโโโโโ | 331/2599 [1:02:32<5:54:40, 9.38s/it]Merge for (287, 333) already exists in the vocabulary.
|
150 |
+
Training BPE: 13%|โโโโโโโโโโ | 332/2599 [1:02:41<5:54:26, 9.38s/it]Merge for (287, 321) already exists in the vocabulary.
|
151 |
+
Training BPE: 13%|โโโโโโโโโโ | 341/2599 [1:04:05<5:50:36, 9.32s/it]Merge for (284, 327) already exists in the vocabulary.
|
152 |
+
Training BPE: 14%|โโโโโโโโโโโ | 351/2599 [1:05:38<5:50:00, 9.34s/it]Merge for (277, 326) already exists in the vocabulary.
|
153 |
+
Training BPE: 14%|โโโโโโโโโโโ | 353/2599 [1:05:57<5:48:10, 9.30s/it]Merge for (1023, 298) already exists in the vocabulary.
|
154 |
+
Training BPE: 14%|โโโโโโโโโโโ | 361/2599 [1:07:11<5:45:45, 9.27s/it]Merge for (302, 322) already exists in the vocabulary.
|
155 |
+
Training BPE: 14%|โโโโโโโโโโโ | 363/2599 [1:07:30<5:44:06, 9.23s/it]Merge for (287, 2403) already exists in the vocabulary.
|
156 |
+
Training BPE: 14%|โโโโโโโโโโโ | 372/2599 [1:08:52<5:41:05, 9.19s/it]Merge for (295, 318) already exists in the vocabulary.
|
157 |
+
Training BPE: 15%|โโโโโโโโโโโ | 377/2599 [1:09:38<5:38:03, 9.13s/it]Merge for (279, 330) already exists in the vocabulary.
|
158 |
+
Training BPE: 15%|โโโโโโโโโโโ | 380/2599 [1:10:06<5:37:53, 9.14s/it]Merge for (298, 333) already exists in the vocabulary.
|
159 |
+
Training BPE: 15%|โโโโโโโโโโโโ | 385/2599 [1:10:51<5:34:22, 9.06s/it]Merge for (309, 333) already exists in the vocabulary.
|
160 |
+
Training BPE: 15%|โโโโโโโโโโโโ | 388/2599 [1:11:18<5:34:17, 9.07s/it]Merge for (302, 330) already exists in the vocabulary.
|
161 |
+
Training BPE: 15%|โโโโโโโโโโโโ | 391/2599 [1:11:45<5:32:07, 9.03s/it]Merge for (278, 2414) already exists in the vocabulary.
|
162 |
+
Training BPE: 15%|โโโโโโโโโโโโ | 392/2599 [1:11:54<5:30:20, 8.98s/it]Merge for (301, 318) already exists in the vocabulary.
|
163 |
+
Training BPE: 15%|โโโโโโโโโโโโ | 399/2599 [1:12:57<5:30:04, 9.00s/it]
|
164 |
+
Vocab size: 2800: เฐฒ + เฑเฐธ = เฐฒเฑเฐธ
|
165 |
+
Training BPE: 15%|โโโโโโโโโโโโ | 400/2599 [1:13:06<5:30:41, 9.02s/it]Merge for (843, 300) already exists in the vocabulary.
|
166 |
+
Training BPE: 16%|โโโโโโโโโโโโ | 405/2599 [1:13:52<5:30:04, 9.03s/it]Merge for (298, 330) already exists in the vocabulary.
|
167 |
+
Training BPE: 16%|โโโโโโโโโโโโโ | 420/2599 [1:16:05<5:24:36, 8.94s/it]Merge for (282, 322) already exists in the vocabulary.
|
168 |
+
Training BPE: 16%|โโโโโโโโโโโโโ | 422/2599 [1:16:23<5:23:56, 8.93s/it]Merge for (923, 292) already exists in the vocabulary.
|
169 |
+
Training BPE: 16%|โโโโโโโโโโโโโ | 423/2599 [1:16:32<5:23:55, 8.93s/it]Merge for (301, 2414) already exists in the vocabulary.
|
170 |
+
Training BPE: 16%|โโโโโโโโโโโโโ | 425/2599 [1:16:50<5:22:38, 8.90s/it]Merge for (279, 331) already exists in the vocabulary.
|
171 |
+
Training BPE: 16%|โโโโโโโโโโโโโ | 428/2599 [1:17:17<5:24:58, 8.98s/it]Merge for (303, 319) already exists in the vocabulary.
|
172 |
+
Training BPE: 17%|โโโโโโโโโโโโโ | 446/2599 [1:19:56<5:15:42, 8.80s/it]Merge for (313, 318) already exists in the vocabulary.
|
173 |
+
Training BPE: 17%|โโโโโโโโโโโโโโ | 449/2599 [1:20:23<5:17:26, 8.86s/it]Merge for (301, 319) already exists in the vocabulary.
|
174 |
+
Training BPE: 17%|โโโโโโโโโโโโโโ | 450/2599 [1:20:31<5:16:23, 8.83s/it]Merge for (277, 319) already exists in the vocabulary.
|
175 |
+
Training BPE: 17%|โโโโโโโโโโโโโโ | 452/2599 [1:20:49<5:15:23, 8.81s/it]Merge for (312, 331) already exists in the vocabulary.
|
176 |
+
Training BPE: 17%|โโโโโโโโโโโโโโ | 453/2599 [1:20:58<5:15:21, 8.82s/it]Merge for (284, 319) already exists in the vocabulary.
|
177 |
+
Training BPE: 17%|โโโโโโโโโโโโโโ | 454/2599 [1:21:07<5:14:49, 8.81s/it]Merge for (312, 327) already exists in the vocabulary.
|
178 |
+
Training BPE: 18%|โโโโโโโโโโโโโโ | 460/2599 [1:21:59<5:12:18, 8.76s/it]Merge for (287, 326) already exists in the vocabulary.
|
179 |
+
Training BPE: 18%|โโโโโโโโโโโโโโ | 462/2599 [1:22:17<5:13:12, 8.79s/it]Merge for (313, 326) already exists in the vocabulary.
|
180 |
+
Training BPE: 18%|โโโโโโโโโโโโโโ | 465/2599 [1:22:43<5:13:17, 8.81s/it]Merge for (284, 326) already exists in the vocabulary.
|
181 |
+
Training BPE: 18%|โโโโโโโโโโโโโโ | 471/2599 [1:23:36<5:09:57, 8.74s/it]Merge for (277, 323) already exists in the vocabulary.
|
182 |
+
Training BPE: 18%|โโโโโโโโโโโโโโ | 475/2599 [1:24:11<5:05:57, 8.64s/it]Merge for (298, 319) already exists in the vocabulary.
|
183 |
+
Training BPE: 19%|โโโโโโโโโโโโโโโ | 485/2599 [1:25:37<5:03:47, 8.62s/it]Merge for (310, 319) already exists in the vocabulary.
|
184 |
+
Training BPE: 19%|โโโโโโโโโโโโโโโ | 489/2599 [1:26:11<5:03:59, 8.64s/it]Merge for (312, 322) already exists in the vocabulary.
|
185 |
+
Training BPE: 19%|โโโโโโโโโโโโโโโ | 494/2599 [1:26:55<5:02:17, 8.62s/it]Merge for (301, 322) already exists in the vocabulary.
|
186 |
+
Training BPE: 19%|โโโโโโโโโโโโโโโ | 499/2599 [1:27:38<5:01:00, 8.60s/it]
|
187 |
+
Vocab size: 2900: เฐจ + เฑเฐจเฑเฐจ = เฐจเฑเฐจเฑเฐจ
|
188 |
+
Training BPE: 19%|โโโโโโโโโโโโโโโ | 503/2599 [1:28:12<4:57:10, 8.51s/it]Merge for (279, 319) already exists in the vocabulary.
|
189 |
+
Training BPE: 20%|โโโโโโโโโโโโโโโ | 515/2599 [1:29:54<4:55:09, 8.50s/it]Merge for (300, 321) already exists in the vocabulary.
|
190 |
+
Training BPE: 20%|โโโโโโโโโโโโโโโโ | 520/2599 [1:30:37<4:55:10, 8.52s/it]Merge for (312, 328) already exists in the vocabulary.
|
191 |
+
Training BPE: 20%|โโโโโโโโโโโโโโโโ | 524/2599 [1:31:11<4:57:02, 8.59s/it]Merge for (303, 322) already exists in the vocabulary.
|
192 |
+
Training BPE: 20%|โโโโโโโโโโโโโโโโ | 525/2599 [1:31:20<4:57:28, 8.61s/it]Merge for (963, 309) already exists in the vocabulary.
|
193 |
+
Training BPE: 20%|โโโโโโโโโโโโโโโโ | 526/2599 [1:31:28<4:56:30, 8.58s/it]Merge for (299, 319) already exists in the vocabulary.
|
194 |
+
Training BPE: 20%|โโโโโโโโโโโโโโโโ | 527/2599 [1:31:37<4:56:00, 8.57s/it]Merge for (300, 326) already exists in the vocabulary.
|
195 |
+
Training BPE: 21%|โโโโโโโโโโโโโโโโ | 535/2599 [1:32:44<4:51:41, 8.48s/it]Merge for (443, 279) already exists in the vocabulary.
|
196 |
+
Training BPE: 21%|โโโโโโโโโโโโโโโโ | 536/2599 [1:32:53<4:51:43, 8.48s/it]Merge for (300, 331) already exists in the vocabulary.
|
197 |
+
Training BPE: 21%|โโโโโโโโโโโโโโโโ | 537/2599 [1:33:01<4:52:00, 8.50s/it]Merge for (306, 320) already exists in the vocabulary.
|
198 |
+
Training BPE: 21%|โโโโโโโโโโโโโโโโ | 539/2599 [1:33:19<4:53:04, 8.54s/it]Merge for (703, 312) already exists in the vocabulary.
|
199 |
+
Training BPE: 22%|โโโโโโโโโโโโโโโโโ | 563/2599 [1:36:40<4:45:12, 8.41s/it]Merge for (1023, 303) already exists in the vocabulary.
|
200 |
+
Training BPE: 22%|โโโโโโโโโโโโโโโโโ | 566/2599 [1:37:06<4:44:06, 8.38s/it]Merge for (292, 330) already exists in the vocabulary.
|
201 |
+
Training BPE: 22%|โโโโโโโโโโโโโโโโโ | 568/2599 [1:37:22<4:44:37, 8.41s/it]Merge for (294, 2403) already exists in the vocabulary.
|
202 |
+
Training BPE: 22%|โโโโโโโโโโโโโโโโโ | 579/2599 [1:38:55<4:43:24, 8.42s/it]Merge for (306, 328) already exists in the vocabulary.
|
203 |
+
Training BPE: 22%|โโโโโโโโโโโโโโโโโ | 581/2599 [1:39:12<4:43:46, 8.44s/it]Merge for (923, 282) already exists in the vocabulary.
|
204 |
+
Training BPE: 23%|โโโโโโโโโโโโโโโโโโ | 597/2599 [1:41:26<4:38:58, 8.36s/it]Merge for (309, 323) already exists in the vocabulary.
|
205 |
+
Training BPE: 23%|โโโโโโโโโโโโโโโโโโ | 599/2599 [1:41:43<4:39:47, 8.39s/it]
|
206 |
+
Vocab size: 3000: (เฐเฐเฐงเฑเฐฐเฐเฑเฐฏเฑเฐคเฐฟ) + : = (เฐเฐเฐงเฑเฐฐเฐเฑเฐฏเฑเฐคเฐฟ):
|
207 |
+
Training BPE: 23%|โโโโโโโโโโโโโโโโโโ | 601/2599 [1:42:00<4:38:34, 8.37s/it]Merge for (923, 302) already exists in the vocabulary.
|
208 |
+
Training BPE: 23%|โโโโโโโโโโโโโโโโโโ | 609/2599 [1:43:06<4:36:56, 8.35s/it]Merge for (923, 293) already exists in the vocabulary.
|
209 |
+
Training BPE: 23%|โโโโโโโโโโโโโโโโโโ | 610/2599 [1:43:15<4:37:18, 8.37s/it]Merge for (296, 331) already exists in the vocabulary.
|
210 |
+
Training BPE: 24%|โโโโโโโโโโโโโโโโโโ | 612/2599 [1:43:32<4:37:01, 8.37s/it]Merge for (300, 319) already exists in the vocabulary.
|
211 |
+
Training BPE: 24%|โโโโโโโโโโโโโโโโโโ | 613/2599 [1:43:40<4:35:10, 8.31s/it]Merge for (289, 2403) already exists in the vocabulary.
|
212 |
+
Training BPE: 24%|โโโโโโโโโโโโโโโโโโ | 614/2599 [1:43:48<4:34:45, 8.31s/it]Merge for (296, 326) already exists in the vocabulary.
|
213 |
+
Training BPE: 24%|โโโโโโโโโโโโโโโโโโ | 616/2599 [1:44:05<4:34:12, 8.30s/it]Merge for (310, 321) already exists in the vocabulary.
|
214 |
+
Training BPE: 24%|โโโโโโโโโโโโโโโโโโ | 619/2599 [1:44:30<4:36:34, 8.38s/it]Merge for (292, 327) already exists in the vocabulary.
|
215 |
+
Training BPE: 24%|โโโโโโโโโโโโโโโโโโโ | 626/2599 [1:45:28<4:33:09, 8.31s/it]Merge for (284, 333) already exists in the vocabulary.
|
216 |
+
Training BPE: 24%|โโโโโโโโโโโโโโโโโโโ | 633/2599 [1:46:26<4:31:57, 8.30s/it]Merge for (1003, 291) already exists in the vocabulary.
|
217 |
+
Training BPE: 25%|โโโโโโโโโโโโโโโโโโโ | 637/2599 [1:46:59<4:31:02, 8.29s/it]Merge for (295, 319) already exists in the vocabulary.
|
218 |
+
Training BPE: 25%|โโโโโโโโโโโโโโโโโโโ | 649/2599 [1:48:38<4:28:15, 8.25s/it]Merge for (278, 318) already exists in the vocabulary.
|
219 |
+
Training BPE: 25%|โโโโโโโโโโโโโโโโโโโโ | 660/2599 [1:50:09<4:27:54, 8.29s/it]Merge for (282, 328) already exists in the vocabulary.
|
220 |
+
Training BPE: 26%|โโโโโโโโโโโโโโโโโโโโ | 663/2599 [1:50:34<4:26:33, 8.26s/it]Merge for (313, 319) already exists in the vocabulary.
|
221 |
+
Training BPE: 26%|โโโโโโโโโโโโโโโโโโโโ | 671/2599 [1:51:40<4:24:56, 8.24s/it]Merge for (292, 321) already exists in the vocabulary.
|
222 |
+
Training BPE: 26%|โโโโโโโโโโโโโโโโโโโโ | 677/2599 [1:52:30<4:23:41, 8.23s/it]Merge for (292, 331) already exists in the vocabulary.
|
223 |
+
Training BPE: 27%|โโโโโโโโโโโโโโโโโโโโโ | 699/2599 [1:55:28<4:16:20, 8.09s/it]
|
224 |
+
Vocab size: 3100: , + เฐจเฐตเฐเฐฌเฐฐเฑ = , เฐจเฐตเฐเฐฌเฐฐเฑ
|
225 |
+
Training BPE: 27%|โโโโโโโโโโโโโโโโโโโโโ | 712/2599 [1:57:13<4:14:06, 8.08s/it]Merge for (306, 326) already exists in the vocabulary.
|
226 |
+
Training BPE: 28%|โโโโโโโโโโโโโโโโโโโโโ | 716/2599 [1:57:46<4:13:56, 8.09s/it]Merge for (296, 322) already exists in the vocabulary.
|
227 |
+
Training BPE: 28%|โโโโโโโโโโโโโโโโโโโโโ | 717/2599 [1:57:54<4:15:07, 8.13s/it]Merge for (277, 320) already exists in the vocabulary.
|
228 |
+
Training BPE: 29%|โโโโโโโโโโโโโโโโโโโโโโ | 749/2599 [2:02:11<4:06:19, 7.99s/it]Merge for (302, 333) already exists in the vocabulary.
|
229 |
+
Training BPE: 29%|โโโโโโโโโโโโโโโโโโโโโโ | 756/2599 [2:03:08<4:07:00, 8.04s/it]Merge for (287, 318) already exists in the vocabulary.
|
230 |
+
Training BPE: 29%|โโโโโโโโโโโโโโโโโโโโโโโ | 761/2599 [2:03:48<4:06:58, 8.06s/it]Merge for (299, 331) already exists in the vocabulary.
|
231 |
+
Training BPE: 29%|โโโโโโโโโโโโโโโโโโโโโโโ | 766/2599 [2:04:28<4:03:32, 7.97s/it]Merge for (292, 326) already exists in the vocabulary.
|
232 |
+
Training BPE: 30%|โโโโโโโโโโโโโโโโโโโโโโโ | 771/2599 [2:05:08<4:01:41, 7.93s/it]Merge for (803, 292) already exists in the vocabulary.
|
233 |
+
Training BPE: 30%|โโโโโโโโโโโโโโโโโโโโโโโ | 776/2599 [2:05:48<4:02:30, 7.98s/it]Merge for (306, 2457) already exists in the vocabulary.
|
234 |
+
Training BPE: 30%|โโโโโโโโโโโโโโโโโโโโโโโ | 783/2599 [2:06:44<4:04:33, 8.08s/it]Merge for (403, 292) already exists in the vocabulary.
|
235 |
+
Training BPE: 31%|โโโโโโโโโโโโโโโโโโโโโโโโ | 794/2599 [2:08:14<4:04:44, 8.14s/it]Merge for (309, 2403) already exists in the vocabulary.
|
236 |
+
Training BPE: 31%|โโโโโโโโโโโโโโโโโโโโโโโโ | 799/2599 [2:08:54<4:04:07, 8.14s/it]
|
237 |
+
Vocab size: 3200: เฑ + เฐ = เฑเฐ
|
238 |
+
Training BPE: 31%|โโโโโโโโโโโโโโโโโโโโโโโโ | 804/2599 [2:09:35<4:03:09, 8.13s/it]Merge for (291, 318) already exists in the vocabulary.
|
239 |
+
Training BPE: 31%|โโโโโโโโโโโโโโโโโโโโโโโโ | 807/2599 [2:09:59<4:02:25, 8.12s/it]Merge for (309, 321) already exists in the vocabulary.
|
240 |
+
Training BPE: 32%|โโโโโโโโโโโโโโโโโโโโโโโโ | 823/2599 [2:12:07<3:56:31, 7.99s/it]Merge for (923, 309) already exists in the vocabulary.
|
241 |
+
Training BPE: 32%|โโโโโโโโโโโโโ๏ฟฝ๏ฟฝโโโโโโโโโโโ | 837/2599 [2:14:00<3:54:25, 7.98s/it]Merge for (309, 331) already exists in the vocabulary.
|
242 |
+
Training BPE: 32%|โโโโโโโโโโโโโโโโโโโโโโโโโ | 844/2599 [2:14:56<3:57:36, 8.12s/it]Merge for (289, 326) already exists in the vocabulary.
|
243 |
+
Training BPE: 33%|โโโโโโโโโโโโโโโโโโโโโโโโโ | 856/2599 [2:16:33<3:52:33, 8.01s/it]Merge for (313, 331) already exists in the vocabulary.
|
244 |
+
Training BPE: 33%|โโโโโโโโโโโโโโโโโโโโโโโโโโ | 868/2599 [2:18:09<3:51:08, 8.01s/it]Merge for (298, 2438) already exists in the vocabulary.
|
245 |
+
Training BPE: 34%|โโโโโโโโโโโโโโโโโโโโโโโโโโ | 871/2599 [2:18:33<3:51:14, 8.03s/it]Merge for (295, 333) already exists in the vocabulary.
|
246 |
+
Training BPE: 34%|โโโโโโโโโโโโโโโโโโโโโโโโโโ | 882/2599 [2:20:01<3:50:30, 8.05s/it]Merge for (298, 322) already exists in the vocabulary.
|
247 |
+
Training BPE: 34%|โโโโโโโโโโโโโโโโโโโโโโโโโโ | 889/2599 [2:20:58<3:48:58, 8.03s/it]Merge for (287, 331) already exists in the vocabulary.
|
248 |
+
Training BPE: 35%|โโโโโโโโโโโโโโโโโโโโโโโโโโโ | 899/2599 [2:22:18<3:48:23, 8.06s/it]
|
249 |
+
Vocab size: 3300: เฑเฐ + เฑ = เฑเฐเฑ
|
250 |
+
Training BPE: 35%|โโโโโโโโโโโโโโโโโโโโโโโโโโโ | 907/2599 [2:23:23<3:47:26, 8.07s/it]Merge for (299, 333) already exists in the vocabulary.
|
251 |
+
Training BPE: 35%|โโโโโโโโโโโโโโโโโโโโโโโโโโโ | 914/2599 [2:24:18<3:42:35, 7.93s/it]Merge for (1023, 309) already exists in the vocabulary.
|
252 |
+
Training BPE: 36%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 933/2599 [2:26:51<3:42:47, 8.02s/it]Merge for (300, 2403) already exists in the vocabulary.
|
253 |
+
Training BPE: 36%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 948/2599 [2:28:52<3:40:22, 8.01s/it]Merge for (279, 332) already exists in the vocabulary.
|
254 |
+
Training BPE: 37%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 974/2599 [2:32:20<3:38:58, 8.09s/it]Merge for (284, 328) already exists in the vocabulary.
|
255 |
+
Training BPE: 38%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 978/2599 [2:32:52<3:37:19, 8.04s/it]Merge for (279, 2414) already exists in the vocabulary.
|
256 |
+
Training BPE: 38%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 996/2599 [2:35:16<3:34:13, 8.02s/it]Merge for (282, 318) already exists in the vocabulary.
|
257 |
+
Training BPE: 38%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 999/2599 [2:35:40<3:33:57, 8.02s/it]
|
258 |
+
Vocab size: 3400: เฐต + เฑ = เฐตเฑ
|
259 |
+
Training BPE: 38%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1000/2599 [2:35:48<3:34:52, 8.06s/it]Merge for (284, 331) already exists in the vocabulary.
|
260 |
+
Training BPE: 39%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1019/2599 [2:38:19<3:27:43, 7.89s/it]Merge for (299, 326) already exists in the vocabulary.
|
261 |
+
Training BPE: 39%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1025/2599 [2:39:07<3:29:29, 7.99s/it]Merge for (307, 318) already exists in the vocabulary.
|
262 |
+
Training BPE: 40%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1034/2599 [2:40:19<3:25:44, 7.89s/it]Merge for (313, 320) already exists in the vocabulary.
|
263 |
+
Training BPE: 40%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1039/2599 [2:40:58<3:25:01, 7.89s/it]Merge for (983, 296) already exists in the vocabulary.
|
264 |
+
Training BPE: 40%|โโโโโโโ๏ฟฝ๏ฟฝโโโโโโโโโโโโโโโโโโโโโโโ | 1048/2599 [2:42:10<3:26:27, 7.99s/it]Merge for (289, 327) already exists in the vocabulary.
|
265 |
+
Training BPE: 41%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1074/2599 [2:45:37<3:19:53, 7.86s/it]Merge for (287, 328) already exists in the vocabulary.
|
266 |
+
Training BPE: 42%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1097/2599 [2:48:39<3:18:38, 7.94s/it]Merge for (289, 328) already exists in the vocabulary.
|
267 |
+
Training BPE: 42%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1098/2599 [2:48:47<3:20:05, 8.00s/it]Merge for (302, 323) already exists in the vocabulary.
|
268 |
+
Training BPE: 42%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1099/2599 [2:48:55<3:19:12, 7.97s/it]
|
269 |
+
Vocab size: 3500: เฐฎ + เฑ = เฐฎเฑ
|
270 |
+
Training BPE: 43%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1114/2599 [2:50:55<3:18:19, 8.01s/it]Merge for (311, 327) already exists in the vocabulary.
|
271 |
+
Training BPE: 44%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1140/2599 [2:54:21<3:13:37, 7.96s/it]Merge for (294, 322) already exists in the vocabulary.
|
272 |
+
Training BPE: 45%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1158/2599 [2:56:44<3:11:07, 7.96s/it]Merge for (299, 320) already exists in the vocabulary.
|
273 |
+
Training BPE: 45%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1181/2599 [2:59:46<3:06:18, 7.88s/it]Merge for (983, 309) already exists in the vocabulary.
|
274 |
+
Training BPE: 46%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1199/2599 [3:02:09<3:04:53, 7.92s/it]
|
275 |
+
Vocab size: 3600: เฐธเฑเฐ + เฑ = เฐธเฑเฐเฑ
|
276 |
+
Training BPE: 47%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1210/2599 [3:03:37<3:04:37, 7.98s/it]Merge for (311, 318) already exists in the vocabulary.
|
277 |
+
Training BPE: 47%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1214/2599 [3:04:08<3:01:11, 7.85s/it]Merge for (300, 2412) already exists in the vocabulary.
|
278 |
+
Training BPE: 47%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1224/2599 [3:05:26<2:59:35, 7.84s/it]Merge for (282, 331) already exists in the vocabulary.
|
279 |
+
Training BPE: 47%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1226/2599 [3:05:42<3:00:33, 7.89s/it]Merge for (299, 328) already exists in the vocabulary.
|
280 |
+
Training BPE: 48%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1241/2599 [3:07:40<2:57:38, 7.85s/it]Merge for (307, 333) already exists in the vocabulary.
|
281 |
+
Training BPE: 48%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1254/2599 [3:09:23<2:57:35, 7.92s/it]Merge for (310, 327) already exists in the vocabulary.
|
282 |
+
Training BPE: 49%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1268/2599 [3:11:12<2:53:54, 7.84s/it]Merge for (923, 279) already exists in the vocabulary.
|
283 |
+
Training BPE: 49%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1274/2599 [3:11:59<2:50:48, 7.73s/it]Merge for (303, 321) already exists in the vocabulary.
|
284 |
+
Training BPE: 49%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1279/2599 [3:12:39<2:53:24, 7.88s/it]Merge for (284, 321) already exists in the vocabulary.
|
285 |
+
Training BPE: 49%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1280/2599 [3:12:47<2:53:37, 7.90s/it]Merge for (294, 331) already exists in the vocabulary.
|
286 |
+
Training BPE: 50%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1298/2599 [3:15:08<2:49:45, 7.83s/it]Merge for (923, 298) already exists in the vocabulary.
|
287 |
+
Training BPE: 50%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1299/2599 [3:15:16<2:49:19, 7.82s/it]
|
288 |
+
Vocab size: 3700: เฐฐเฑ + เฐช = เฐฐเฑเฐช
|
289 |
+
Training BPE: 50%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1306/2599 [3:16:11<2:48:35, 7.82s/it]Merge for (284, 322) already exists in the vocabulary.
|
290 |
+
Training BPE: 50%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1309/2599 [3:16:34<2:47:53, 7.81s/it]Merge for (310, 331) already exists in the vocabulary.
|
291 |
+
Training BPE: 51%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1314/2599 [3:17:13<2:48:02, 7.85s/it]Merge for (287, 327) already exists in the vocabulary.
|
292 |
+
Training BPE: 51%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1334/2599 [3:19:48<2:45:00, 7.83s/it]Merge for (282, 321) already exists in the vocabulary.
|
293 |
+
Training BPE: 52%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1341/2599 [3:20:43<2:42:21, 7.74s/it]Merge for (277, 332) already exists in the vocabulary.
|
294 |
+
Training BPE: 52%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1343/2599 [3:20:58<2:42:16, 7.75s/it]Merge for (277, 2412) already exists in the vocabulary.
|
295 |
+
Training BPE: 52%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1346/2599 [3:21:22<2:42:56, 7.80s/it]Merge for (282, 320) already exists in the vocabulary.
|
296 |
+
Training BPE: 52%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1349/2599 [3:21:45<2:41:49, 7.77s/it]Merge for (299, 2438) already exists in the vocabulary.
|
297 |
+
Training BPE: 54%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1392/2599 [3:27:18<2:36:01, 7.76s/it]Merge for (289, 2412) already exists in the vocabulary.
|
298 |
+
Training BPE: 54%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1399/2599 [3:28:12<2:33:45, 7.69s/it]
|
299 |
+
Vocab size: 3800: เฐซเฐฟเฐฐเฑเฐฏเฐพ + เฐฆเฑ = เฐซเฐฟเฐฐเฑเฐฏเฐพเฐฆเฑ
|
300 |
+
Training BPE: 55%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1421/2599 [3:31:01<2:29:57, 7.64s/it]Merge for (294, 330) already exists in the vocabulary.
|
301 |
+
Training BPE: 55%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1442/2599 [3:33:44<2:29:12, 7.74s/it]Merge for (313, 333) already exists in the vocabulary.
|
302 |
+
Training BPE: 57%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1478/2599 [3:38:20<2:23:52, 7.70s/it]Merge for (783, 312) already exists in the vocabulary.
|
303 |
+
Training BPE: 57%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1483/2599 [3:38:57<2:20:01, 7.53s/it]Merge for (1003, 288) already exists in the vocabulary.
|
304 |
+
Training BPE: 57%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1491/2599 [3:39:59<2:21:18, 7.65s/it]Merge for (703, 302) already exists in the vocabulary.
|
305 |
+
Training BPE: 58%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1499/2599 [3:41:01<2:20:24, 7.66s/it]
|
306 |
+
Vocab size: 3900: เฐฎเฐพเฐเฑเฐฒเฐพเฐก + เฐพเฐฐเฑ. = เฐฎเฐพเฐเฑเฐฒเฐพเฐกเฐพเฐฐเฑ.
|
307 |
+
Training BPE: 58%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1508/2599 [3:42:10<2:18:48, 7.63s/it]Merge for (312, 330) already exists in the vocabulary.
|
308 |
+
Training BPE: 60%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1565/2599 [3:49:23<2:10:01, 7.55s/it]Merge for (298, 2412) already exists in the vocabulary.
|
309 |
+
Training BPE: 61%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1585/2599 [3:51:55<2:07:34, 7.55s/it]Merge for (300, 328) already exists in the vocabulary.
|
310 |
+
Training BPE: 61%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1597/2599 [3:53:26<2:04:38, 7.46s/it]Merge for (296, 328) already exists in the vocabulary.
|
311 |
+
Training BPE: 62%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1599/2599 [3:53:41<2:05:58, 7.56s/it]
|
312 |
+
Vocab size: 4000: เฐคเฐฟ + เฐจเฐฟ = เฐคเฐฟเฐจเฐฟ
|
313 |
+
Training BPE: 62%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1611/2599 [3:55:11<2:01:21, 7.37s/it]Merge for (284, 2403) already exists in the vocabulary.
|
314 |
+
Training BPE: 62%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1613/2599 [3:55:25<2:00:58, 7.36s/it]Merge for (303, 331) already exists in the vocabulary.
|
315 |
+
Training BPE: 63%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1630/2599 [3:57:33<2:00:49, 7.48s/it]Merge for (543, 286) already exists in the vocabulary.
|
316 |
+
Training BPE: 63%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1639/2599 [3:58:40<1:59:28, 7.47s/it]Merge for (1023, 312) already exists in the vocabulary.
|
317 |
+
Training BPE: 64%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1653/2599 [4:00:24<1:57:20, 7.44s/it]Merge for (300, 327) already exists in the vocabulary.
|
318 |
+
Training BPE: 64%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1664/2599 [4:01:47<1:57:03, 7.51s/it]Merge for (295, 321) already exists in the vocabulary.
|
319 |
+
Training BPE: 65%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1698/2599 [4:05:59<1:51:53, 7.45s/it]Merge for (313, 321) already exists in the vocabulary.
|
320 |
+
Training BPE: 65%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1699/2599 [4:06:07<1:51:03, 7.40s/it]
|
321 |
+
Vocab size: 4100: เฐน + เฑ = เฐนเฑ
|
322 |
+
Training BPE: 67%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1750/2599 [4:12:21<1:42:27, 7.24s/it]Merge for (277, 2414) already exists in the vocabulary.
|
323 |
+
Training BPE: 69%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1799/2599 [4:18:21<1:36:27, 7.23s/it]
|
324 |
+
Vocab size: 4200: เฐชเฐพเฐฐเฑ + เฐเฑ = เฐชเฐพเฐฐเฑเฐเฑ
|
325 |
+
Training BPE: 70%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1822/2599 [4:21:08<1:34:43, 7.31s/it]Merge for (294, 326) already exists in the vocabulary.
|
326 |
+
Training BPE: 70%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1830/2599 [4:22:07<1:35:05, 7.42s/it]Merge for (300, 330) already exists in the vocabulary.
|
327 |
+
Training BPE: 72%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1862/2599 [4:26:01<1:29:42, 7.30s/it]Merge for (923, 311) already exists in the vocabulary.
|
328 |
+
Training BPE: 73%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1888/2599 [4:29:10<1:26:21, 7.29s/it]Merge for (299, 2403) already exists in the vocabulary.
|
329 |
+
Training BPE: 73%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1899/2599 [4:30:30<1:24:22, 7.23s/it]
|
330 |
+
Vocab size: 4300: เฐธเฐฎ + เฐฏเฐเฐฒเฑ = เฐธเฐฎเฐฏเฐเฐฒเฑ
|
331 |
+
Training BPE: 74%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1918/2599 [4:32:47<1:22:06, 7.23s/it]Merge for (310, 258) already exists in the vocabulary.
|
332 |
+
Training BPE: 74%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1929/2599 [4:34:06<1:20:24, 7.20s/it]Merge for (300, 332) already exists in the vocabulary.
|
333 |
+
Training BPE: 77%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1999/2599 [4:42:31<1:11:27, 7.15s/it]
|
334 |
+
Vocab size: 4400: เฐชเฑเฐฐ + เฐธเฐพ = เฐชเฑเฐฐเฐธเฐพ
|
335 |
+
Merge for (295, 320) already exists in the vocabulary.
|
336 |
+
Training BPE: 78%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 2016/2599 [4:44:34<1:10:21, 7.24s/it]Merge for (923, 306) already exists in the vocabulary.
|
337 |
+
Training BPE: 78%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 2019/2599 [4:44:55<1:08:38, 7.10s/it]Merge for (1064, 327) already exists in the vocabulary.
|
338 |
+
Training BPE: 79%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 2043/2599 [4:47:47<1:06:17, 7.15s/it]Merge for (300, 322) already exists in the vocabulary.
|
339 |
+
Training BPE: 80%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 2074/2599 [4:51:29<1:03:09, 7.22s/it]Merge for (943, 282) already exists in the vocabulary.
|
340 |
+
Training BPE: 81%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 2099/2599 [4:54:26<58:57, 7.08s/it]
|
341 |
+
Vocab size: 4500: เฐ
+ เฐงเฑเฐฏเฐเฑเฐทเฑเฐกเฑ = เฐ
เฐงเฑเฐฏเฐเฑเฐทเฑเฐกเฑ
|
342 |
+
Training BPE: 81%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 2106/2599 [4:55:16<58:50, 7.16s/it]Merge for (299, 321) already exists in the vocabulary.
|
343 |
+
Training BPE: 82%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 2140/2599 [4:59:19<55:56, 7.31s/it]Merge for (291, 2414) already exists in the vocabulary.
|
344 |
+
Training BPE: 83%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 2158/2599 [5:01:26<52:14, 7.11s/it]Merge for (279, 327) already exists in the vocabulary.
|
345 |
+
Training BPE: 84%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 2182/2599 [5:04:16<49:28, 7.12s/it]Merge for (311, 2414) already exists in the vocabulary.
|
346 |
+
Training BPE: 85%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 2199/2599 [5:06:16<46:36, 6.99s/it]
|
347 |
+
Vocab size: 4600: เฐฆเฑ + เฐถเฐพ = เฐฆเฑเฐถเฐพ
|
348 |
+
Training BPE: 85%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 2207/2599 [5:07:12<45:58, 7.04s/it]Merge for (312, 332) already exists in the vocabulary.
|
349 |
+
Training BPE: 86%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ๏ฟฝ๏ฟฝโโโโโโโโ | 2234/2599 [5:10:23<43:07, 7.09s/it]Merge for (503, 283) already exists in the vocabulary.
|
350 |
+
Training BPE: 87%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 2250/2599 [5:12:15<40:04, 6.89s/it]Merge for (310, 320) already exists in the vocabulary.
|
351 |
+
Training BPE: 87%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 2258/2599 [5:13:11<40:00, 7.04s/it]Merge for (299, 318) already exists in the vocabulary.
|
352 |
+
Training BPE: 88%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 2299/2599 [5:17:58<34:31, 6.91s/it]
|
353 |
+
Vocab size: 4700: เฐฐเฑ + เฐฒเฑ = เฐฐเฑเฐฒเฑ
|
354 |
+
Training BPE: 91%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 2367/2599 [5:25:54<26:50, 6.94s/it]Merge for (843, 294) already exists in the vocabulary.
|
355 |
+
Training BPE: 92%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 2399/2599 [5:29:38<23:13, 6.97s/it]
|
356 |
+
Vocab size: 4800: เฐธ + เฐฆ = เฐธเฐฆ
|
357 |
+
Training BPE: 93%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 2414/2599 [5:31:22<21:35, 7.01s/it]Merge for (280, 318) already exists in the vocabulary.
|
358 |
+
Training BPE: 96%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 2499/2599 [5:41:03<11:12, 6.73s/it]
|
359 |
+
Vocab size: 4900: เฐญเฐตเฐฟเฐท + เฑเฐฏ = เฐญเฐตเฐฟเฐทเฑเฐฏ
|
360 |
+
Training BPE: 96%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 2505/2599 [5:41:43<10:27, 6.67s/it]Merge for (763, 309) already exists in the vocabulary.
|
361 |
+
Training BPE: 99%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2574/2599 [5:49:29<02:50, 6.84s/it]Merge for (313, 332) already exists in the vocabulary.
|
362 |
+
Training BPE: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2598/2599 [5:52:10<00:08, 8.13s/it]
|
363 |
+
|
364 |
+
Final statistics:
|
365 |
+
Final vocabulary size: 4,999
|
366 |
+
Number of merges: 2,599
|
367 |
+
Final compression ratio: 8.63x
|
368 |
+
Training time: 21135.62 seconds
|
369 |
+
|
370 |
+
Tokenizer mappings saved to telugu_tokenizer_vocab.json and telugu_tokenizer_merges.json
|
371 |
+
|
372 |
+
Test Results:
|
373 |
+
Original: เฐคเฑเฐฒเฑเฐเฑ เฐญเฐพเฐท
|
374 |
+
Encoded: [4149, 4717]
|
375 |
+
Decoded: เฐคเฑเฐฒเฑเฐเฑ เฐญเฐพเฐท
|
376 |
+
Matches original: True
|