Chaitanya Sagar Gurujula commited on
Commit
496ac89
ยท
1 Parent(s): bc28434

Add application file

Browse files
Dockerfile ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.9-slim
2
+
3
+ WORKDIR /app
4
+
5
+ COPY requirements.txt .
6
+ RUN pip install -r requirements.txt
7
+
8
+ COPY src/ .
9
+ COPY telugu_tokenizer_vocab.json .
10
+ COPY telugu_tokenizer_merges.json .
11
+
12
+ CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
README.md CHANGED
@@ -1,11 +1,144 @@
1
  ---
2
- title: Telugu Tokenizer
3
- emoji: ๐Ÿ‘
4
- colorFrom: purple
5
- colorTo: yellow
6
  sdk: docker
 
 
7
  pinned: false
8
- short_description: Telugu tokenizer with Vocab Size 5k
 
 
 
 
 
 
 
 
 
 
9
  ---
10
 
11
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Telugu Tokenizer App
3
+ emoji: เฐ…
4
+ colorFrom: indigo
5
+ colorTo: blue
6
  sdk: docker
7
+ sdk_version: "1.0"
8
+ app_file: app:app
9
  pinned: false
10
+ description: A tokenizer app for tokenizing Telugu text. It uses BPE (Byte Pair Encoding) to tokenize Telugu text. 5k is the vocab size.
11
+ tags:
12
+ - telugu
13
+ - tokenizer
14
+ - NLP
15
+ - transformers
16
+ license: apache-2.0
17
+ model: telugu-tokenizer-model
18
+ datasets:
19
+ - telugu-dataset
20
+ isPrivate: false
21
  ---
22
 
23
+ # Telugu Tokenizer
24
+
25
+ This repository provides a tokenizer implementation for processing Telugu text, designed to handle both Telugu Unicode characters and ASCII characters. It uses a Byte Pair Encoding (BPE) approach to efficiently tokenize text and create a vocabulary optimized for Telugu language processing.
26
+
27
+ ## Features
28
+
29
+ - **Comprehensive Telugu Support**: Includes all Telugu Unicode characters (0C00-0C7F), common ligatures, and valid consonant combinations.
30
+ - **Base Vocabulary Creation**: Generates a base vocabulary containing ASCII, Extended ASCII, and Telugu characters.
31
+ - **Byte Pair Encoding (BPE)**: Trains the tokenizer to merge frequently occurring token pairs, creating an optimized vocabulary.
32
+ - **Parallel Processing**: Utilizes multiprocessing for efficient tokenization of large text datasets.
33
+ - **Persistence**: Supports saving and loading the vocabulary to/from JSON files.
34
+
35
+ ## Requirements
36
+
37
+ The tokenizer requires the following dependencies:
38
+
39
+ - Python 3.7+
40
+ - tqdm
41
+ - pandas
42
+ - datasets
43
+
44
+ Install the required packages using pip:
45
+ ```bash
46
+ pip install tqdm pandas datasets
47
+ ```
48
+
49
+ ## Usage
50
+
51
+ ### 1. Base Vocabulary Creation
52
+
53
+ The tokenizer first generates a base vocabulary containing ASCII, Extended ASCII, and Telugu characters.
54
+
55
+ ```python
56
+ from telugu_tokenizer import create_base_vocab, save_base_vocab
57
+
58
+ base_vocab = create_base_vocab()
59
+ save_base_vocab(base_vocab, path='telugu_base_vocab.json')
60
+ ```
61
+
62
+ ### 2. Loading an Existing Vocabulary
63
+
64
+ You can load an existing base vocabulary from a JSON file:
65
+
66
+ ```python
67
+ from telugu_tokenizer import load_base_vocab
68
+
69
+ vocab = load_base_vocab('telugu_base_vocab.json')
70
+ ```
71
+
72
+ ### 3. Training the Tokenizer
73
+
74
+ The `BPETokenizer` class can be used to train a tokenizer on a given text input:
75
+
76
+ ```python
77
+ from telugu_tokenizer import BPETokenizer
78
+
79
+ text = "เฐฎเฑ€เฐฐเฑ เฐŽเฐฒเฐพ เฐ‰เฐจเฑเฐจเฐพเฐฐเฑ?" # Sample Telugu text
80
+ tokenizer = BPETokenizer(vocab_size=5000)
81
+ tokenizer.fit(text)
82
+ ```
83
+
84
+ ### 4. Saving and Loading the Tokenizer
85
+
86
+ After training, save the tokenizer's vocabulary and merges:
87
+
88
+ ```python
89
+ tokenizer.save('telugu_tokenizer')
90
+ ```
91
+
92
+ Load the trained tokenizer:
93
+
94
+ ```python
95
+ tokenizer.load('telugu_tokenizer')
96
+ ```
97
+
98
+ ## Telugu Unicode Support
99
+
100
+ The tokenizer covers the full range of Telugu Unicode characters, including vowels, consonants, vowel signs, digits, and fraction symbols. Additionally, it supports:
101
+
102
+ - Common ligatures formed with Telugu consonants and vowel signs.
103
+ - Valid consonant combinations like `เฐ•เฑเฐ•`, `เฐ•เฑเฐœ`, etc.
104
+
105
+ ## File Structure
106
+
107
+ - **`bpe_tokenizer.py`**: Contains the implementation of the Telugu tokenizer.
108
+ - **`telugu_base_vocab.json`**: JSON file storing the base vocabulary.
109
+ - **`telugu_tokenizer_vocab.json`**: JSON file storing the trained vocabulary and merges (generated after training).
110
+
111
+ ## Results
112
+
113
+ - **Final vocabulary size**: 4,999
114
+ - **Final compression ratio**: 8.63x
115
+
116
+ ## Logs
117
+ - [View Training Logs ](./training_logs.log)
118
+
119
+ ## Performance
120
+
121
+ The tokenizer uses multiprocessing to handle large datasets efficiently. It processes text in chunks and merges token pairs iteratively to optimize the vocabulary size. This is a simple implementation and can be improved for large-scale datasets.
122
+ ## Future Enhancements
123
+
124
+ - Extend support for additional Telugu ligatures and symbols.
125
+ - Optimize BPE training for large-scale datasets.
126
+ - Provide pre-trained models for common Telugu NLP tasks.
127
+
128
+ ## License
129
+
130
+ This project is licensed under the MIT License. See the LICENSE file for more details.
131
+
132
+ ## Contributing
133
+
134
+ Contributions are welcome! Feel free to submit a pull request or open an issue if you encounter bugs or have suggestions for improvement.
135
+
136
+ ## Acknowledgments
137
+
138
+ - Unicode Consortium for Telugu Unicode character information.
139
+ - Community contributions to Telugu NLP development.
140
+
141
+ ---
142
+
143
+ Feel free to explore the tokenizer and adapt it for your Telugu language processing needs. Happy coding!
144
+
requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ fastapi==0.68.0
2
+ uvicorn==0.15.0
3
+ jinja2==3.0.1
4
+ python-multipart==0.0.5
5
+ datasets==2.12.0
6
+ tqdm==4.65.0
7
+ aiofiles==0.8.0
8
+ python-multipart==0.0.5
9
+ pandas==2.2.3
src/__pycache__/bpe_tokenizer.cpython-312.pyc ADDED
Binary file (42.6 kB). View file
 
src/app.py ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from fastapi import FastAPI, Request
2
+ from fastapi.responses import HTMLResponse
3
+ from fastapi.templating import Jinja2Templates
4
+ from fastapi.middleware.cors import CORSMiddleware
5
+ from pydantic import BaseModel
6
+ from bpe_tokenizer import BPETokenizer, create_base_vocab
7
+ import os
8
+ import json
9
+
10
+ # Get the absolute path to the templates directory
11
+ TEMPLATES_DIR = os.path.join(os.path.dirname(__file__), "templates")
12
+
13
+ app = FastAPI(title="Telugu BPE Tokenizer")
14
+
15
+ # Add CORS middleware
16
+ app.add_middleware(
17
+ CORSMiddleware,
18
+ allow_origins=["*"],
19
+ allow_credentials=True,
20
+ allow_methods=["*"],
21
+ allow_headers=["*"],
22
+ )
23
+
24
+ # Templates with absolute path
25
+ templates = Jinja2Templates(directory=TEMPLATES_DIR)
26
+
27
+ # Initialize tokenizer
28
+ tokenizer = BPETokenizer(vocab_size=5000)
29
+
30
+ # Load the vocabulary file directly
31
+ print("Loading vocabulary...")
32
+ vocab_file = 'telugu_tokenizer_vocab.json'
33
+ with open(vocab_file, 'r', encoding='utf-8') as f:
34
+ vocab_data = json.load(f)
35
+
36
+ class TokenizeRequest(BaseModel):
37
+ text: str
38
+
39
+ @app.get("/", response_class=HTMLResponse)
40
+ async def home(request: Request):
41
+ return templates.TemplateResponse(
42
+ "index.html",
43
+ {"request": request, "title": "Telugu BPE Tokenizer"}
44
+ )
45
+
46
+ @app.post("/tokenize")
47
+ async def tokenize(request: TokenizeRequest):
48
+ text = request.text
49
+ try:
50
+ tokens = tokenizer.encode(text)
51
+ decoded = tokenizer.decode(tokens)
52
+
53
+ # Get token details from vocabulary for display
54
+ token_details = []
55
+ current_position = 0
56
+ current_byte_position = 0
57
+ text_bytes = text.encode('utf-8')
58
+
59
+ while current_position < len(tokens):
60
+ # Skip leading spaces in original text
61
+ while current_byte_position < len(text_bytes) and text_bytes[current_byte_position] == 32:
62
+ current_byte_position += 1
63
+
64
+ # Get next word from original text
65
+ word_start = current_byte_position
66
+ word_end = word_start
67
+ while word_end < len(text_bytes) and text_bytes[word_end] != 32:
68
+ word_end += 1
69
+
70
+ word_bytes = text_bytes[word_start:word_end]
71
+ word = word_bytes.decode('utf-8')
72
+
73
+ # Collect tokens for this word
74
+ word_tokens = []
75
+ decoded_bytes = b''
76
+
77
+ while current_position < len(tokens):
78
+ token = tokens[current_position]
79
+ token_bytes = tokenizer.vocab[token]
80
+
81
+ # If we've collected enough bytes for the word (plus possible space)
82
+ if len(decoded_bytes) >= len(word_bytes):
83
+ break
84
+
85
+ word_tokens.append(token)
86
+ decoded_bytes += token_bytes
87
+ current_position += 1
88
+
89
+ # Update byte position for next word
90
+ current_byte_position = word_end
91
+
92
+ # Add word and its tokens to details
93
+ token_details.append({
94
+ "word": word,
95
+ "type": "subword_tokens",
96
+ "tokens": [{
97
+ "id": t,
98
+ "text": vocab_data.get(str(t), {}).get('text', '[UNKNOWN]')
99
+ } for t in word_tokens]
100
+ })
101
+
102
+ return {
103
+ "original": text,
104
+ "tokens": tokens,
105
+ "token_details": token_details,
106
+ "decoded": decoded,
107
+ "matches": text == decoded
108
+ }
109
+ except Exception as e:
110
+ print(f"Error: {str(e)}")
111
+ return {"error": str(e)}
112
+
113
+ @app.get("/vocab")
114
+ async def get_vocab():
115
+ return {
116
+ "vocab_size": len(vocab_data),
117
+ "base_vocab_size": sum(1 for info in vocab_data.values() if info.get('is_base', False)),
118
+ "num_merges": len(getattr(tokenizer, 'merges', {}))
119
+ }
120
+
121
+ if __name__ == "__main__":
122
+ import uvicorn
123
+ uvicorn.run(app, host="127.0.0.1", port=8001)
src/bpe_tokenizer.py ADDED
@@ -0,0 +1,660 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from tqdm import tqdm
2
+ from collections import Counter
3
+ import json
4
+ from datasets import load_dataset
5
+ import time
6
+ import os
7
+ import re
8
+ import pandas as pd
9
+ from multiprocessing import Pool
10
+ import array
11
+
12
+ def get_telugu_char_info():
13
+ """
14
+ Returns a dictionary of Telugu Unicode ranges with their descriptions.
15
+ Based on Unicode 13.0 Telugu block (0C00-0C7F).
16
+ """
17
+ return {
18
+ (0x0C00, 0x0C03): "Various forms of Telugu anusvara and visarga",
19
+ (0x0C05, 0x0C14): "Telugu vowels (เฐ… to เฐ”)",
20
+ (0x0C15, 0x0C39): "Telugu consonants (เฐ• to เฐน)",
21
+ (0x0C3D, 0x0C44): "Telugu vowel signs (เฐฝ to เฑ„)",
22
+ (0x0C46, 0x0C48): "Telugu vowel signs (เฑ† to เฑˆ)",
23
+ (0x0C4A, 0x0C4D): "Telugu vowel signs and virama (เฑŠ to เฑ)",
24
+ (0x0C55, 0x0C56): "Telugu length marks",
25
+ (0x0C58, 0x0C5A): "Additional Telugu consonants",
26
+ (0x0C60, 0x0C63): "Telugu vocalic letters",
27
+ (0x0C66, 0x0C6F): "Telugu digits (เฑฆ to เฑฏ)",
28
+ (0x0C78, 0x0C7F): "Telugu fraction symbols"
29
+ }
30
+
31
+ def create_base_vocab():
32
+ """Create a base vocabulary with ASCII, Telugu characters, and common ligatures."""
33
+ vocab = {}
34
+ token_id = 0
35
+ existing_tokens = set() # Set to track existing tokens
36
+
37
+ # Add ASCII characters (0-127)
38
+ print("Adding ASCII characters...")
39
+ for i in range(128):
40
+ char_bytes = bytes([i])
41
+ try:
42
+ char = char_bytes.decode('utf-8', errors='strict')
43
+ vocab[token_id] = {
44
+ 'text': char,
45
+ 'bytes': list(char_bytes),
46
+ 'type': 'ASCII',
47
+ 'description': f"ASCII character: {repr(char)}"
48
+ }
49
+ token_id += 1
50
+ except UnicodeDecodeError:
51
+ continue
52
+
53
+ # Add Extended ASCII characters (128-255)
54
+ print("Adding Extended ASCII characters...")
55
+ for i in range(128, 256):
56
+ char_bytes = bytes([i])
57
+ try:
58
+ # Try to decode as UTF-8 first
59
+ char = char_bytes.decode('utf-8', errors='strict')
60
+ vocab[token_id] = {
61
+ 'text': char if char.isprintable() else f"<{hex(i)[2:].upper()}>",
62
+ 'bytes': list(char_bytes),
63
+ 'type': 'Extended ASCII',
64
+ 'description': f"Extended ASCII character: {char} ({hex(i)})"
65
+ }
66
+ except UnicodeDecodeError:
67
+ # If not valid UTF-8, store as bytes representation
68
+ vocab[token_id] = {
69
+ 'text': f"[Bytes: {list(char_bytes)}]",
70
+ 'bytes': list(char_bytes),
71
+ 'type': 'Extended ASCII',
72
+ 'description': f"Extended ASCII byte: {hex(i)}"
73
+ }
74
+ token_id += 1
75
+
76
+ # Add Telugu Unicode characters (0C00-0C7F)
77
+ print("Adding Telugu characters...")
78
+ telugu_info = get_telugu_char_info()
79
+
80
+ for i in range(0x0C00, 0x0C7F + 1):
81
+ try:
82
+ char = chr(i)
83
+ char_bytes = char.encode('utf-8')
84
+ # Only add if it's a valid character
85
+ char.encode('utf-8').decode('utf-8')
86
+
87
+ # Find the character's category
88
+ char_type = "Other Telugu Character"
89
+ char_description = "Telugu character"
90
+ for (start, end), desc in telugu_info.items():
91
+ if start <= i <= end:
92
+ char_type = desc
93
+ char_description = f"Telugu character: {char} ({hex(i)})"
94
+ break
95
+
96
+ vocab[token_id] = {
97
+ 'text': char,
98
+ 'bytes': list(char_bytes),
99
+ 'type': char_type,
100
+ 'description': char_description
101
+ }
102
+ token_id += 1
103
+ except UnicodeEncodeError:
104
+ continue
105
+
106
+ # Define Telugu consonants and vowel signs
107
+ consonants = [
108
+ 'เฐ•', 'เฐ–', 'เฐ—', 'เฐ˜', 'เฐ™', 'เฐš', 'เฐ›', 'เฐœ', 'เฐ', 'เฐž',
109
+ 'เฐŸ', 'เฐ ', 'เฐก', 'เฐข', 'เฐฃ', 'เฐค', 'เฐฅ', 'เฐฆ', 'เฐง', 'เฐจ',
110
+ 'เฐช', 'เฐซ', 'เฐฌ', 'เฐญ', 'เฐฎ', 'เฐฏ', 'เฐฐ', 'เฐฒ', 'เฐต', 'เฐถ',
111
+ 'เฐท', 'เฐธ', 'เฐน', 'เฐณ', 'เฐ•เฑเฐท', 'เฐฑ'
112
+ ]
113
+
114
+ vowel_signs = [
115
+ '', 'เฐพ', 'เฐฟ', 'เฑ€', 'เฑ', 'เฑ‚', 'เฑƒ', 'เฑ„', 'เฑข', 'เฑฃ', 'เฑ†', 'เฑ‡', 'เฑˆ', 'เฑŠ', 'เฑ‹', 'เฑŒ', 'เฐ‚', 'เฐƒ', 'เฐ', 'เฑ'
116
+ ]
117
+
118
+
119
+ # Add common Telugu ligatures with existing vowel signs
120
+ print("Adding common Telugu ligatures with existing vowel signs...")
121
+ for consonant in consonants:
122
+ for vowel_sign in vowel_signs:
123
+ ligature = consonant + vowel_sign
124
+ if ligature not in existing_tokens: # Check for duplicates
125
+ char_bytes = ligature.encode('utf-8')
126
+ vocab[token_id] = {
127
+ 'text': ligature,
128
+ 'bytes': list(char_bytes),
129
+ 'type': 'Ligature',
130
+ 'description': f"Telugu ligature: {ligature}"
131
+ }
132
+ existing_tokens.add(ligature) # Add to the set
133
+ token_id += 1
134
+
135
+ # Add valid consonant combinations
136
+ print("Adding valid consonant combinations...")
137
+ valid_consonant_combinations = [
138
+ 'เฐ•เฑเฐ•', 'เฐ•เฑเฐ–', 'เฐ•เฑเฐ—', 'เฐ•เฑเฐ˜', 'เฐ•เฑเฐ™', 'เฐ•เฑเฐš', 'เฐ•เฑเฐ›', 'เฐ•เฑเฐœ', 'เฐ•เฑเฐ', 'เฐ•เฑเฐž',
139
+ 'เฐ•เฑเฐŸ', 'เฐ•เฑเฐ ', 'เฐ•เฑเฐก', 'เฐ•เฑเฐข', 'เฐ•เฑเฐฃ', 'เฐ•เฑเฐค', 'เฐ•เฑเฐฅ', 'เฐ•เฑเฐฆ', 'เฐ•เฑเฐง', 'เฐ•เฑเฐจ',
140
+ 'เฐ•เฑเฐช', 'เฐ•เฑเฐซ', 'เฐ•เฑเฐฌ', 'เฐ•เฑเฐญ', 'เฐ•เฑเฐฎ', 'เฐ•เฑเฐฏ', 'เฐ•เฑเฐฐ', 'เฐ•เฑเฐฒ', 'เฐ•เฑเฐต', 'เฐ•เฑเฐถ',
141
+ 'เฐ•เฑเฐท', 'เฐ•เฑเฐธ', 'เฐ•เฑเฐน', 'เฐ•เฑเฐณ', 'เฐ•เฑเฐ•เฑเฐท', 'เฐ•เฑเฐฑ',
142
+ 'เฐ–เฑเฐ•', 'เฐ–เฑเฐ–', 'เฐ–เฑเฐ—', 'เฐ–เฑเฐ˜', 'เฐ–เฑเฐ™', 'เฐ–เฑเฐš', 'เฐ–เฑเฐ›', 'เฐ–เฑเฐœ', 'เฐ–เฑเฐ', 'เฐ–เฑเฐž',
143
+ 'เฐ–เฑเฐŸ', 'เฐ–เฑเฐ ', 'เฐ–เฑเฐก', 'เฐ–เฑเฐข', 'เฐ–เฑเฐฃ', 'เฐ–เฑเฐค', 'เฐ–เฑเฐฅ', 'เฐ–เฑเฐฆ', 'เฐ–เฑเฐง', 'เฐ–เฑเฐจ',
144
+ 'เฐ–เฑเฐช', 'เฐ–เฑเฐซ', 'เฐ–เฑเฐฌ', 'เฐ–เฑเฐญ', 'เฐ–เฑเฐฎ', 'เฐ–เฑเฐฏ', 'เฐ–เฑเฐฐ', 'เฐ–เฑเฐฒ', 'เฐ–เฑเฐต', 'เฐ–เฑเฐถ',
145
+ 'เฐ–เฑเฐท', 'เฐ–เฑเฐธ', 'เฐ–เฑเฐน', 'เฐ–เฑเฐณ', 'เฐ–เฑเฐ•เฑเฐท', 'เฐ–เฑเฐฑ',
146
+ 'เฐ—เฑเฐ•', 'เฐ—เฑเฐ–', 'เฐ—เฑเฐ—', 'เฐ—เฑเฐ˜', 'เฐ—เฑเฐ™', 'เฐ—เฑเฐš', 'เฐ—เฑเฐ›', 'เฐ—เฑเฐœ', 'เฐ—เฑเฐ', 'เฐ—เฑเฐž',
147
+ 'เฐ—เฑเฐŸ', 'เฐ—เฑเฐ ', 'เฐ—เฑเฐก', 'เฐ—เฑเฐข', 'เฐ—เฑเฐฃ', 'เฐ—เฑเฐค', 'เฐ—เฑเฐฅ', 'เฐ—เฑเฐฆ', 'เฐ—เฑเฐง', 'เฐ—เฑเฐจ',
148
+ 'เฐ—เฑเฐช', 'เฐ—เฑเฐซ', 'เฐ—เฑเฐฌ', 'เฐ—เฑเฐญ', 'เฐ—เฑเฐฎ', 'เฐ—เฑเฐฏ', 'เฐ—เฑเฐฐ', 'เฐ—เฑเฐฒ', 'เฐ—เฑเฐต', 'เฐ—เฑเฐถ',
149
+ 'เฐ—เฑเฐท', 'เฐ—เฑเฐธ', 'เฐ—เฑเฐน', 'เฐ—เฑเฐณ', 'เฐ—เฑเฐ•เฑเฐท', 'เฐ—เฑเฐฑ',
150
+ 'เฐ˜เฑเฐ•', 'เฐ˜เฑเฐ–', 'เฐ˜เฑเฐ—', 'เฐ˜เฑเฐ˜', 'เฐ˜เฑเฐ™', 'เฐ˜เฑเฐš', 'เฐ˜เฑเฐ›', 'เฐ˜เฑเฐœ', 'เฐ˜เฑเฐ', 'เฐ˜เฑเฐž',
151
+ 'เฐ˜เฑเฐŸ', 'เฐ˜เฑเฐ ', 'เฐ˜เฑเฐก', 'เฐ˜เฑเฐข', 'เฐ˜เฑเฐฃ', 'เฐ˜เฑเฐค', 'เฐ˜เฑเฐฅ', 'เฐ˜เฑเฐฆ', 'เฐ˜เฑเฐง', 'เฐ˜เฑเฐจ',
152
+ 'เฐ˜เฑเฐช', 'เฐ˜เฑเฐซ', 'เฐ˜เฑเฐฌ', 'เฐ˜เฑเฐญ', 'เฐ˜เฑเฐฎ', 'เฐ˜เฑเฐฏ', 'เฐ˜เฑเฐฐ', 'เฐ˜เฑเฐฒ', 'เฐ˜เฑเฐต', 'เฐ˜เฑเฐถ',
153
+ 'เฐ˜เฑเฐท', 'เฐ˜เฑเฐธ', 'เฐ˜เฑเฐน', 'เฐ˜เฑเฐณ', 'เฐ˜เฑเฐ•เฑเฐท', 'เฐ˜เฑเฐฑ',
154
+ 'เฐ™เฑเฐ•', 'เฐ™เฑเฐ–', 'เฐ™เฑเฐ—', 'เฐ™เฑเฐ˜', 'เฐ™เฑเฐ™', 'เฐ™เฑเฐš', 'เฐ™เฑเฐ›', 'เฐ™เฑเฐœ', 'เฐ™เฑเฐ', 'เฐ™เฑเฐž',
155
+ 'เฐ™เฑเฐŸ', 'เฐ™เฑเฐ ', 'เฐ™เฑเฐก', 'เฐ™เฑเฐข', 'เฐ™เฑเฐฃ', 'เฐ™เฑเฐค', 'เฐ™เฑเฐฅ', 'เฐ™เฑเฐฆ', 'เฐ™เฑเฐง', 'เฐ™เฑเฐจ',
156
+ 'เฐ™เฑเฐช', 'เฐ™เฑเฐซ', 'เฐ™เฑเฐฌ', 'เฐ™เฑเฐญ', 'เฐ™เฑเฐฎ', 'เฐ™เฑเฐฏ', 'เฐ™เฑเฐฐ', 'เฐ™เฑเฐฒ', 'เฐ™เฑเฐต', 'เฐ™เฑเฐถ',
157
+ 'เฐ™เฑเฐท', 'เฐ™เฑเฐธ', 'เฐ™เฑเฐน', 'เฐ™เฑเฐณ', 'เฐ™เฑเฐ•เฑเฐท', 'เฐ™เฑเฐฑ',
158
+ 'เฐšเฑเฐ•', 'เฐšเฑเฐ–', 'เฐšเฑเฐ—', 'เฐšเฑเฐ˜', 'เฐšเฑเฐ™', 'เฐšเฑเฐš', 'เฐšเฑเฐ›', 'เฐšเฑเฐœ', 'เฐšเฑเฐ', 'เฐšเฑเฐž',
159
+ 'เฐšเฑเฐŸ', 'เฐšเฑเฐ ', 'เฐšเฑเฐก', 'เฐšเฑเฐข', 'เฐšเฑเฐฃ', 'เฐšเฑเฐค', 'เฐšเฑเฐฅ', 'เฐšเฑเฐฆ', 'เฐšเฑเฐง', 'เฐšเฑเฐจ',
160
+ 'เฐšเฑเฐช', 'เฐšเฑเฐซ', 'เฐšเฑเฐฌ', 'เฐšเฑเฐญ', 'เฐšเฑเฐฎ', 'เฐšเฑเฐฏ', 'เฐšเฑเฐฐ', 'เฐšเฑเฐฒ', 'เฐšเฑเฐต', 'เฐšเฑเฐถ',
161
+ 'เฐšเฑเฐท', 'เฐšเฑเฐธ', 'เฐšเฑเฐน', 'เฐšเฑเฐณ', 'เฐšเฑเฐ•เฑเฐท', 'เฐšเฑเฐฑ',
162
+ 'เฐ›เฑเฐ•', 'เฐ›เฑเฐ–', 'เฐ›เฑเฐ—', 'เฐ›เฑเฐ˜', 'เฐ›เฑเฐ™', 'เฐ›เฑเฐš', 'เฐ›เฑเฐ›', 'เฐ›เฑเฐœ', 'เฐ›เฑเฐ', 'เฐ›เฑเฐž',
163
+ 'เฐ›เฑเฐŸ', 'เฐ›เฑเฐ ', 'เฐ›เฑเฐก', 'เฐ›เฑเฐข', 'เฐ›เฑเฐฃ', 'เฐ›เฑเฐค', 'เฐ›เฑเฐฅ', 'เฐ›เฑเฐฆ', 'เฐ›เฑเฐง', 'เฐ›เฑเฐจ',
164
+ 'เฐ›เฑเฐช', 'เฐ›เฑเฐซ', 'เฐ›เฑเฐฌ', 'เฐ›เฑเฐญ', 'เฐ›เฑเฐฎ', 'เฐ›เฑเฐฏ', 'เฐ›เฑเฐฐ', 'เฐ›เฑเฐฒ', 'เฐ›เฑเฐต', 'เฐ›เฑเฐถ',
165
+ 'เฐ›เฑเฐท', 'เฐ›เฑเฐธ', 'เฐ›เฑเฐน', 'เฐ›เฑเฐณ', 'เฐ›เฑเฐ•เฑเฐท', 'เฐ›เฑเฐฑ',
166
+ 'เฐœเฑเฐ•', 'เฐœเฑเฐ–', 'เฐœเฑเฐ—', 'เฐœเฑเฐ˜', 'เฐœเฑเฐ™', 'เฐœเฑเฐš', 'เฐœเฑเฐ›', 'เฐœเฑเฐœ', 'เฐœเฑเฐ', 'เฐœเฑเฐž',
167
+ 'เฐœเฑเฐŸ', 'เฐœเฑเฐ ', 'เฐœเฑเฐก', 'เฐœเฑเฐข', 'เฐœเฑเฐฃ', 'เฐœเฑเฐค', 'เฐœเฑเฐฅ', 'เฐœเฑเฐฆ', 'เฐœเฑเฐง', 'เฐœเฑเฐจ',
168
+ 'เฐœเฑเฐช', 'เฐœเฑเฐซ', 'เฐœเฑเฐฌ', 'เฐœเฑเฐญ', 'เฐœเฑเฐฎ', 'เฐœเฑเฐฏ', 'เฐœเฑเฐฐ', 'เฐœเฑเฐฒ', 'เฐœเฑเฐต', 'เฐœเฑเฐถ',
169
+ 'เฐœเฑเฐท', 'เฐœเฑเฐธ', 'เฐœเฑเฐน', 'เฐœเฑเฐณ', 'เฐœเฑเฐ•เฑเฐท', 'เฐœเฑเฐฑ',
170
+ 'เฐเฑเฐ•', 'เฐเฑเฐ–', 'เฐเฑเฐ—', 'เฐเฑเฐ˜', 'เฐเฑเฐ™', 'เฐเฑเฐš', 'เฐเฑเฐ›', 'เฐเฑเฐœ', 'เฐเฑเฐ', 'เฐเฑเฐž',
171
+ 'เฐเฑเฐŸ', 'เฐเฑเฐ ', 'เฐเฑเฐก', 'เฐเฑเฐข', 'เฐเฑเฐฃ', 'เฐเฑเฐค', 'เฐเฑเฐฅ', 'เฐเฑเฐฆ', 'เฐเฑเฐง', 'เฐเฑเฐจ',
172
+ 'เฐเฑเฐช', 'เฐเฑเฐซ', 'เฐเฑเฐฌ', 'เฐเฑเฐญ', 'เฐเฑเฐฎ', 'เฐเฑเฐฏ', 'เฐเฑเฐฐ', 'เฐเฑเฐฒ', 'เฐเฑเฐต', 'เฐเฑเฐถ',
173
+ 'เฐเฑเฐท', 'เฐเฑเฐธ', 'เฐเฑเฐน', 'เฐเฑเฐณ', 'เฐเฑเฐ•เฑเฐท', 'เฐเฑเฐฑ',
174
+ 'เฐžเฑเฐ•', 'เฐžเฑเฐ–', 'เฐžเฑเฐ—', 'เฐžเฑเฐ˜', 'เฐžเฑเฐ™', 'เฐžเฑเฐš', 'เฐžเฑเฐ›', 'เฐžเฑเฐœ', 'เฐžเฑเฐ', 'เฐžเฑเฐž',
175
+ 'เฐžเฑเฐŸ', 'เฐžเฑเฐ ', 'เฐžเฑเฐก', 'เฐžเฑเฐข', 'เฐžเฑเฐฃ', 'เฐžเฑเฐค', 'เฐžเฑเฐฅ', 'เฐžเฑเฐฆ', 'เฐžเฑเฐง', 'เฐžเฑเฐจ',
176
+ 'เฐžเฑเฐช', 'เฐžเฑเฐซ', 'เฐžเฑเฐฌ', 'เฐžเฑเฐญ', 'เฐžเฑเฐฎ', 'เฐžเฑเฐฏ', 'เฐžเฑเฐฐ', 'เฐžเฑเฐฒ', 'เฐžเฑเฐต', 'เฐžเฑเฐถ',
177
+ 'เฐžเฑเฐท', 'เฐžเฑเฐธ', 'เฐžเฑเฐน', 'เฐžเฑเฐณ', 'เฐžเฑเฐ•เฑเฐท', 'เฐžเฑเฐฑ',
178
+ 'เฐŸเฑเฐ•', 'เฐŸเฑเฐ–', 'เฐŸเฑเฐ—', 'เฐŸเฑเฐ˜', 'เฐŸเฑเฐ™', 'เฐŸเฑเฐš', 'เฐŸเฑเฐ›', 'เฐŸเฑเฐœ', 'เฐŸเฑเฐ', 'เฐŸเฑเฐž',
179
+ 'เฐŸเฑเฐŸ', 'เฐŸเฑเฐ ', 'เฐŸเฑเฐก', 'เฐŸเฑเฐข', 'เฐŸเฑเฐฃ', 'เฐŸเฑเฐค', 'เฐŸเฑเฐฅ', 'เฐŸเฑเฐฆ', 'เฐŸเฑเฐง', 'เฐŸเฑเฐจ',
180
+ 'เฐŸเฑเฐช', 'เฐŸเฑเฐซ', 'เฐŸเฑเฐฌ', 'เฐŸเฑเฐญ', 'เฐŸเฑเฐฎ', 'เฐŸเฑเฐฏ', 'เฐŸเฑเฐฐ', 'เฐŸเฑเฐฒ', 'เฐŸเฑเฐต', 'เฐŸเฑเฐถ',
181
+ 'เฐŸเฑเฐท', 'เฐŸเฑเฐธ', 'เฐŸเฑเฐน', 'เฐŸเฑเฐณ', 'เฐŸเฑเฐ•เฑเฐท', 'เฐŸเฑเฐฑ',
182
+ 'เฐ เฑเฐ•', 'เฐ เฑเฐ–', 'เฐ เฑเฐ—', 'เฐ เฑเฐ˜', 'เฐ เฑเฐ™', 'เฐ เฑเฐš', 'เฐ เฑเฐ›', 'เฐ เฑเฐœ', 'เฐ เฑเฐ', 'เฐ เฑเฐž',
183
+ 'เฐ เฑเฐŸ', 'เฐ เฑเฐ ', 'เฐ เฑเฐก', 'เฐ เฑเฐข', 'เฐ เฑเฐฃ', 'เฐ เฑเฐค', 'เฐ เฑเฐฅ', 'เฐ เฑเฐฆ', 'เฐ เฑเฐง', 'เฐ เฑเฐจ',
184
+ 'เฐ เฑเฐช', 'เฐ เฑเฐซ', 'เฐ เฑเฐฌ', 'เฐ เฑเฐญ', 'เฐ เฑเฐฎ', 'เฐ เฑเฐฏ', 'เฐ เฑเฐฐ', 'เฐ เฑเฐฒ', 'เฐ เฑเฐต', 'เฐ เฑเฐถ',
185
+ 'เฐ เฑเฐท', 'เฐ เฑเฐธ', 'เฐ เฑเฐน', 'เฐ เฑเฐณ', 'เฐ เฑเฐ•เฑเฐท', 'เฐ เฑเฐฑ',
186
+ 'เฐกเฑเฐ•', 'เฐกเฑเฐ–', 'เฐกเฑเฐ—', 'เฐกเฑเฐ˜', 'เฐกเฑเฐ™', 'เฐกเฑเฐš', 'เฐกเฑเฐ›', 'เฐกเฑเฐœ', 'เฐกเฑเฐ', 'เฐกเฑเฐž',
187
+ 'เฐกเฑเฐŸ', 'เฐกเฑเฐ ', 'เฐกเฑเฐก', 'เฐกเฑเฐข', 'เฐกเฑเฐฃ', 'เฐกเฑเฐค', 'เฐกเฑเฐฅ', 'เฐกเฑเฐฆ', 'เฐกเฑเฐง', 'เฐกเฑเฐจ',
188
+ 'เฐกเฑเฐช', 'เฐกเฑเฐซ', 'เฐกเฑเฐฌ', 'เฐกเฑเฐญ', 'เฐกเฑเฐฎ', 'เฐกเฑเฐฏ', 'เฐกเฑเฐฐ', 'เฐกเฑเฐฒ', 'เฐกเฑเฐต', 'เฐกเฑเฐถ',
189
+ 'เฐกเฑเฐท', 'เฐกเฑเฐธ', 'เฐกเฑเฐน', 'เฐกเฑเฐณ', 'เฐกเฑเฐ•เฑเฐท', 'เฐกเฑเฐฑ',
190
+ 'เฐขเฑเฐ•', 'เฐขเฑเฐ–', 'เฐขเฑเฐ—', 'เฐขเฑเฐ˜', 'เฐขเฑเฐ™', 'เฐขเฑเฐš', 'เฐขเฑเฐ›', 'เฐขเฑเฐœ', 'เฐขเฑเฐ', 'เฐขเฑเฐž',
191
+ 'เฐขเฑเฐŸ', 'เฐขเฑเฐ ', 'เฐขเฑเฐก', 'เฐขเฑเฐข', 'เฐขเฑเฐฃ', 'เฐขเฑเฐค', 'เฐขเฑเฐฅ', 'เฐขเฑเฐฆ', 'เฐขเฑเฐง', 'เฐขเฑเฐจ',
192
+ 'เฐขเฑเฐช', 'เฐขเฑเฐซ', 'เฐขเฑเฐฌ', 'เฐขเฑเฐญ', 'เฐขเฑเฐฎ', 'เฐขเฑเฐฏ', 'เฐขเฑเฐฐ', 'เฐขเฑเฐฒ', 'เฐขเฑเฐต', 'เฐขเฑเฐถ',
193
+ 'เฐขเฑเฐท', 'เฐขเฑเฐธ', 'เฐขเฑเฐน', 'เฐขเฑเฐณ', 'เฐขเฑเฐ•เฑเฐท', 'เฐขเฑเฐฑ',
194
+ 'เฐฃเฑเฐ•', 'เฐฃเฑเฐ–', 'เฐฃเฑเฐ—', 'เฐฃเฑเฐ˜', 'เฐฃเฑเฐ™', 'เฐฃเฑเฐš', 'เฐฃเฑเฐ›', 'เฐฃเฑเฐœ', 'เฐฃเฑเฐ', 'เฐฃเฑเฐž',
195
+ 'เฐฃเฑเฐŸ', 'เฐฃเฑเฐ ', 'เฐฃเฑเฐก', 'เฐฃเฑเฐข', 'เฐฃเฑเฐฃ', 'เฐฃเฑเฐค', 'เฐฃเฑเฐฅ', 'เฐฃเฑเฐฆ', 'เฐฃเฑเฐง', 'เฐฃเฑเฐจ',
196
+ 'เฐฃเฑเฐช', 'เฐฃเฑเฐซ', 'เฐฃเฑเฐฌ', 'เฐฃเฑเฐญ', 'เฐฃเฑเฐฎ', 'เฐฃเฑเฐฏ', 'เฐฃเฑเฐฐ', 'เฐฃเฑเฐฒ', 'เฐฃเฑเฐต', 'เฐฃเฑเฐถ',
197
+ 'เฐฃเฑเฐท', 'เฐฃเฑเฐธ', 'เฐฃเฑเฐน', 'เฐฃเฑเฐณ', 'เฐฃเฑเฐ•เฑเฐท', 'เฐฃเฑเฐฑ',
198
+ 'เฐคเฑเฐ•', 'เฐคเฑเฐ–', 'เฐคเฑเฐ—', 'เฐคเฑเฐ˜', 'เฐคเฑเฐ™', 'เฐคเฑเฐš', 'เฐคเฑเฐ›', 'เฐคเฑเฐœ', 'เฐคเฑเฐ', 'เฐคเฑเฐž',
199
+ 'เฐคเฑเฐŸ', 'เฐคเฑเฐ ', 'เฐคเฑเฐก', 'เฐคเฑเฐข', 'เฐคเฑเฐฃ', 'เฐคเฑเฐค', 'เฐคเฑเฐฅ', 'เฐคเฑเฐฆ', 'เฐคเฑเฐง', 'เฐคเฑเฐจ',
200
+ 'เฐคเฑเฐช', 'เฐคเฑเฐซ', 'เฐคเฑเฐฌ', 'เฐคเฑเฐญ', 'เฐคเฑเฐฎ', 'เฐคเฑเฐฏ', 'เฐคเฑเฐฐ', 'เฐคเฑเฐฒ', 'เฐคเฑเฐต', 'เฐคเฑเฐถ',
201
+ 'เฐคเฑเฐท', 'เฐคเฑเฐธ', 'เฐคเฑเฐน', 'เฐคเฑเฐณ', 'เฐคเฑเฐ•เฑเฐท', 'เฐคเฑเฐฑ',
202
+ 'เฐฅเฑเฐ•', 'เฐฅเฑเฐ–', 'เฐฅเฑเฐ—', 'เฐฅเฑเฐ˜', 'เฐฅเฑเฐ™', 'เฐฅเฑเฐš', 'เฐฅเฑเฐ›', 'เฐฅเฑเฐœ', 'เฐฅเฑเฐ', 'เฐฅเฑเฐž',
203
+ 'เฐฅเฑเฐŸ', 'เฐฅเฑเฐ ', 'เฐฅเฑเฐก', 'เฐฅเฑเฐข', 'เฐฅเฑเฐฃ', 'เฐฅเฑเฐค', 'เฐฅเฑเฐฅ', 'เฐฅเฑเฐฆ', 'เฐฅเฑเฐง', 'เฐฅเฑเฐจ',
204
+ 'เฐฅเฑเฐช', 'เฐฅเฑเฐซ', 'เฐฅเฑเฐฌ', 'เฐฅเฑเฐญ', 'เฐฅเฑเฐฎ', 'เฐฅเฑเฐฏ', 'เฐฅเฑเฐฐ', 'เฐฅเฑเฐฒ', 'เฐฅเฑเฐต', 'เฐฅเฑเฐถ',
205
+ 'เฐฅเฑเฐท', 'เฐฅเฑเฐธ', 'เฐฅเฑเฐน', 'เฐฅเฑเฐณ', 'เฐฅเฑเฐ•เฑเฐท', 'เฐฅเฑเฐฑ',
206
+ 'เฐฆเฑเฐ•', 'เฐฆเฑเฐ–', 'เฐฆเฑเฐ—', 'เฐฆเฑเฐ˜', 'เฐฆเฑเฐ™', 'เฐฆเฑเฐš', 'เฐฆเฑเฐ›', 'เฐฆเฑเฐœ', 'เฐฆเฑเฐ', 'เฐฆเฑเฐž',
207
+ 'เฐฆเฑเฐŸ', 'เฐฆเฑเฐ ', 'เฐฆเฑเฐก', 'เฐฆเฑเฐข', 'เฐฆเฑเฐฃ', 'เฐฆเฑเฐค', 'เฐฆเฑเฐฅ', 'เฐฆเฑเฐฆ', 'เฐฆเฑเฐง', 'เฐฆเฑเฐจ',
208
+ 'เฐฆเฑเฐช', 'เฐฆเฑเฐซ', 'เฐฆเฑเฐฌ', 'เฐฆเฑเฐญ', 'เฐฆเฑเฐฎ', 'เฐฆเฑเฐฏ', 'เฐฆเฑเฐฐ', 'เฐฆเฑเฐฒ', 'เฐฆเฑเฐต', 'เฐฆเฑเฐถ',
209
+ 'เฐฆเฑเฐท', 'เฐฆเฑเฐธ', 'เฐฆเฑเฐน', 'เฐฆเฑเฐณ', 'เฐฆเฑเฐ•เฑเฐท', 'เฐฆเฑเฐฑ',
210
+ 'เฐงเฑเฐ•', 'เฐงเฑเฐ–', 'เฐงเฑเฐ—', 'เฐงเฑเฐ˜', 'เฐงเฑเฐ™', 'เฐงเฑเฐš', 'เฐงเฑเฐ›', 'เฐงเฑเฐœ', 'เฐงเฑเฐ', 'เฐงเฑเฐž',
211
+ 'เฐงเฑเฐŸ', 'เฐงเฑเฐ ', 'เฐงเฑเฐก', 'เฐงเฑเฐข', 'เฐงเฑเฐฃ', 'เฐงเฑเฐค', 'เฐงเฑเฐฅ', 'เฐงเฑเฐฆ', 'เฐงเฑเฐง', 'เฐงเฑเฐจ',
212
+ 'เฐงเฑเฐช', 'เฐงเฑเฐซ', 'เฐงเฑเฐฌ', 'เฐงเฑเฐญ', 'เฐงเฑเฐฎ', 'เฐงเฑเฐฏ', 'เฐงเฑเฐฐ', 'เฐงเฑเฐฒ', 'เฐงเฑเฐต', 'เฐงเฑเฐถ',
213
+ 'เฐงเฑเฐท', 'เฐงเฑเฐธ', 'เฐงเฑเฐน', 'เฐงเฑเฐณ', 'เฐงเฑเฐ•เฑเฐท', 'เฐงเฑเฐฑ',
214
+ 'เฐจเฑเฐ•', 'เฐจเฑเฐ–', 'เฐจเฑเฐ—', 'เฐจเฑเฐ˜', 'เฐจเฑเฐ™', 'เฐจเฑเฐš', 'เฐจเฑเฐ›', 'เฐจเฑเฐœ', 'เฐจเฑเฐ', 'เฐจเฑเฐž',
215
+ 'เฐจเฑเฐŸ', 'เฐจเฑเฐ ', 'เฐจเฑเฐก', 'เฐจเฑเฐข', 'เฐจเฑเฐฃ', 'เฐจเฑเฐค', 'เฐจเฑเฐฅ', 'เฐจเฑเฐฆ', 'เฐจเฑเฐง', 'เฐจเฑเฐจ',
216
+ 'เฐจเฑเฐช', 'เฐจเฑเฐซ', 'เฐจเฑเฐฌ', 'เฐจเฑเฐญ', 'เฐจเฑเฐฎ', 'เฐจเฑเฐฏ', 'เฐจเฑเฐฐ', 'เฐจเฑเฐฒ', 'เฐจเฑเฐต', 'เฐจเฑเฐถ',
217
+ 'เฐจเฑเฐท', 'เฐจเฑเฐธ', 'เฐจเฑเฐน', 'เฐจเฑเฐณ', 'เฐจเฑเฐ•เฑเฐท', 'เฐจเฑเฐฑ',
218
+ 'เฐชเฑเฐ•', 'เฐชเฑเฐ–', 'เฐชเฑเฐ—', 'เฐชเฑเฐ˜', 'เฐชเฑเฐ™', 'เฐชเฑเฐš', 'เฐชเฑเฐ›', 'เฐชเฑเฐœ', 'เฐชเฑเฐ', 'เฐชเฑเฐž',
219
+ 'เฐชเฑเฐŸ', 'เฐชเฑเฐ ', 'เฐชเฑเฐก', 'เฐชเฑเฐข', 'เฐชเฑเฐฃ', 'เฐชเฑเฐค', 'เฐชเฑเฐฅ', 'เฐชเฑเฐฆ', 'เฐชเฑเฐง', 'เฐชเฑเฐจ',
220
+ 'เฐชเฑเฐช', 'เฐชเฑเฐซ', 'เฐชเฑเฐฌ', 'เฐชเฑเฐญ', 'เฐชเฑเฐฎ', 'เฐชเฑเฐฏ', 'เฐชเฑเฐฐ', 'เฐชเฑเฐฒ', 'เฐชเฑเฐต', 'เฐชเฑเฐถ',
221
+ 'เฐชเฑเฐท', 'เฐชเฑเฐธ', 'เฐชเฑเฐน', 'เฐชเฑเฐณ', 'เฐชเฑเฐ•เฑเฐท', 'เฐชเฑเฐฑ',
222
+ 'เฐซเฑเฐ•', 'เฐซเฑเฐ–', 'เฐซเฑเฐ—', 'เฐซเฑเฐ˜', 'เฐซเฑเฐ™', 'เฐซเฑเฐš', 'เฐซเฑเฐ›', 'เฐซเฑเฐœ', 'เฐซเฑเฐ', 'เฐซเฑเฐž',
223
+ 'เฐซเฑเฐŸ', 'เฐซเฑเฐ ', 'เฐซเฑเฐก', 'เฐซเฑเฐข', 'เฐซเฑเฐฃ', 'เฐซเฑเฐค', 'เฐซเฑเฐฅ', 'เฐซเฑเฐฆ', 'เฐซเฑเฐง', 'เฐซเฑเฐจ',
224
+ 'เฐซเฑเฐช', 'เฐซเฑเฐซ', 'เฐซเฑเฐฌ', 'เฐซเฑเฐญ', 'เฐซเฑเฐฎ', 'เฐซเฑเฐฏ', 'เฐซเฑเฐฐ', 'เฐซเฑเฐฒ', 'เฐซเฑเฐต', 'เฐซเฑเฐถ',
225
+ 'เฐซเฑเฐท', 'เฐซเฑเฐธ', 'เฐซเฑเฐน', 'เฐซเฑเฐณ', 'เฐซเฑเฐ•เฑเฐท', 'เฐซเฑเฐฑ',
226
+ 'เฐฌเฑเฐ•', 'เฐฌเฑเฐ–', 'เฐฌเฑเฐ—', 'เฐฌเฑเฐ˜', 'เฐฌเฑเฐ™', 'เฐฌเฑเฐš', 'เฐฌเฑเฐ›', 'เฐฌเฑเฐœ', 'เฐฌเฑเฐ', 'เฐฌเฑเฐž',
227
+ 'เฐฌเฑเฐŸ', 'เฐฌเฑเฐ ', 'เฐฌเฑเฐก', 'เฐฌเฑเฐข', 'เฐฌเฑเฐฃ', 'เฐฌเฑเฐค', 'เฐฌเฑเฐฅ', 'เฐฌเฑเฐฆ', 'เฐฌเฑเฐง', 'เฐฌเฑเฐจ',
228
+ 'เฐฌเฑเฐช', 'เฐฌเฑเฐซ', 'เฐฌเฑเฐฌ', 'เฐฌเฑเฐญ', 'เฐฌเฑเฐฎ', 'เฐฌเฑเฐฏ', 'เฐฌเฑเฐฐ', 'เฐฌเฑเฐฒ', 'เฐฌเฑเฐต', 'เฐฌเฑเฐถ',
229
+ 'เฐฌเฑเฐท', 'เฐฌเฑเฐธ', 'เฐฌเฑเฐน', 'เฐฌเฑเฐณ', 'เฐฌเฑเฐ•เฑเฐท', 'เฐฌเฑเฐฑ',
230
+ 'เฐญเฑเฐ•', 'เฐญเฑเฐ–', 'เฐญเฑเฐ—', 'เฐญเฑเฐ˜', 'เฐญเฑเฐ™', 'เฐญเฑเฐš', 'เฐญเฑเฐ›', 'เฐญเฑเฐœ', 'เฐญเฑเฐ', 'เฐญเฑเฐž',
231
+ 'เฐญเฑเฐŸ', 'เฐญเฑเฐ ', 'เฐญเฑเฐก', 'เฐญเฑเฐข', 'เฐญเฑเฐฃ', 'เฐญเฑเฐค', 'เฐญเฑเฐฅ', 'เฐญเฑเฐฆ', 'เฐญเฑเฐง', 'เฐญเฑเฐจ',
232
+ 'เฐญเฑเฐช', 'เฐญเฑเฐซ', 'เฐญเฑเฐฌ', 'เฐญเฑเฐญ', 'เฐญเฑเฐฎ', 'เฐญเฑเฐฏ', 'เฐญเฑเฐฐ', 'เฐญเฑเฐฒ', 'เฐญเฑเฐต', 'เฐญเฑเฐถ',
233
+ 'เฐญเฑเฐท', 'เฐญเฑเฐธ', 'เฐญเฑเฐน', 'เฐญเฑเฐณ', 'เฐญเฑเฐ•เฑเฐท', 'เฐญเฑเฐฑ',
234
+ 'เฐฎเฑเฐ•', 'เฐฎเฑเฐ–', 'เฐฎเฑเฐ—', 'เฐฎเฑเฐ˜', 'เฐฎเฑเฐ™', 'เฐฎเฑเฐš', 'เฐฎเฑเฐ›', 'เฐฎเฑเฐœ', 'เฐฎเฑเฐ', 'เฐฎเฑเฐž',
235
+ 'เฐฎเฑเฐŸ', 'เฐฎเฑเฐ ', 'เฐฎเฑเฐก', 'เฐฎเฑเฐข', 'เฐฎเฑเฐฃ', 'เฐฎเฑเฐค', 'เฐฎเฑเฐฅ', 'เฐฎเฑเฐฆ', 'เฐฎเฑเฐง', 'เฐฎเฑเฐจ',
236
+ 'เฐฎเฑเฐช', 'เฐฎเฑเฐซ', 'เฐฎเฑเฐฌ', 'เฐฎเฑเฐญ', 'เฐฎเฑเฐฎ', 'เฐฎเฑเฐฏ', 'เฐฎเฑเฐฐ', 'เฐฎเฑเฐฒ', 'เฐฎเฑเฐต', 'เฐฎเฑเฐถ',
237
+ 'เฐฎเฑเฐท', 'เฐฎเฑเฐธ', 'เฐฎเฑเฐน', 'เฐฎเฑเฐณ', 'เฐฎเฑเฐ•เฑเฐท', 'เฐฎเฑเฐฑ',
238
+ 'เฐฏเฑเฐ•', 'เฐฏเฑเฐ–', 'เฐฏเฑเฐ—', 'เฐฏเฑเฐ˜', 'เฐฏเฑเฐ™', 'เฐฏเฑเฐš', 'เฐฏเฑเฐ›', 'เฐฏเฑเฐœ', 'เฐฏเฑเฐ', 'เฐฏเฑเฐž',
239
+ 'เฐฏเฑเฐŸ', 'เฐฏเฑเฐ ', 'เฐฏเฑเฐก', 'เฐฏเฑเฐข', 'เฐฏเฑเฐฃ', 'เฐฏเฑเฐค', 'เฐฏเฑเฐฅ', 'เฐฏเฑเฐฆ', 'เฐฏเฑเฐง', 'เฐฏเฑเฐจ',
240
+ 'เฐฏเฑเฐช', 'เฐฏเฑเฐซ', 'เฐฏเฑเฐฌ', 'เฐฏเฑเฐญ', 'เฐฏเฑเฐฎ', 'เฐฏเฑเฐฏ', 'เฐฏเฑเฐฐ', 'เฐฏเฑเฐฒ', 'เฐฏเฑเฐต', 'เฐฏเฑเฐถ',
241
+ 'เฐฏเฑเฐท', 'เฐฏเฑเฐธ', 'เฐฏเฑเฐน', 'เฐฏเฑเฐณ', 'เฐฏเฑเฐ•เฑเฐท', 'เฐฏเฑเฐฑ',
242
+ 'เฐฐเฑเฐ•', 'เฐฐเฑเฐ–', 'เฐฐเฑเฐ—', 'เฐฐเฑเฐ˜', 'เฐฐเฑเฐ™', 'เฐฐเฑเฐš', 'เฐฐเฑเฐ›', 'เฐฐเฑเฐœ', 'เฐฐเฑเฐ', 'เฐฐเฑเฐž',
243
+ 'เฐฐเฑเฐŸ', 'เฐฐเฑเฐ ', 'เฐฐเฑเฐก', 'เฐฐเฑเฐข', 'เฐฐเฑเฐฃ', 'เฐฐเฑเฐค', 'เฐฐเฑเฐฅ', 'เฐฐเฑเฐฆ', 'เฐฐเฑเฐง', 'เฐฐเฑเฐจ',
244
+ 'เฐฐเฑเฐช', 'เฐฐเฑเฐซ', 'เฐฐเฑเฐฌ', 'เฐฐเฑเฐญ', 'เฐฐเฑเฐฎ', 'เฐฐเฑเฐฏ', 'เฐฐเฑเฐฐ', 'เฐฐเฑเฐฒ', 'เฐฐเฑเฐต', 'เฐฐเฑเฐถ',
245
+ 'เฐฐเฑเฐท', 'เฐฐเฑเฐธ', 'เฐฐเฑเฐน', 'เฐฐเฑเฐณ', 'เฐฐเฑเฐ•เฑเฐท', 'เฐฐเฑเฐฑ',
246
+ 'เฐฒเฑเฐ•', 'เฐฒเฑเฐ–', 'เฐฒเฑเฐ—', 'เฐฒเฑเฐ˜', 'เฐฒเฑเฐ™', 'เฐฒเฑเฐš', 'เฐฒเฑเฐ›', 'เฐฒเฑเฐœ', 'เฐฒเฑเฐ', 'เฐฒเฑเฐž',
247
+ 'เฐฒเฑเฐŸ', 'เฐฒเฑเฐ ', 'เฐฒเฑเฐก', 'เฐฒเฑเฐข', 'เฐฒเฑเฐฃ', 'เฐฒเฑเฐค', 'เฐฒเฑเฐฅ', 'เฐฒเฑเฐฆ', 'เฐฒเฑเฐง', 'เฐฒเฑเฐจ',
248
+ 'เฐฒเฑเฐช', 'เฐฒเฑเฐซ', 'เฐฒเฑเฐฌ', 'เฐฒเฑเฐญ', 'เฐฒเฑเฐฎ', 'เฐฒเฑเฐฏ', 'เฐฒเฑเฐฐ', 'เฐฒเฑเฐฒ', 'เฐฒเฑเฐต', 'เฐฒเฑเฐถ',
249
+ 'เฐฒเฑเฐท', 'เฐฒเฑเฐธ', 'เฐฒเฑเฐน', 'เฐฒเฑเฐณ', 'เฐฒเฑเฐ•เฑเฐท', 'เฐฒเฑเฐฑ',
250
+ 'เฐตเฑเฐ•', 'เฐตเฑเฐ–', 'เฐตเฑเฐ—', 'เฐตเฑเฐ˜', 'เฐตเฑเฐ™', 'เฐตเฑเฐš', 'เฐตเฑเฐ›', 'เฐตเฑเฐœ', 'เฐตเฑเฐ', 'เฐตเฑเฐž',
251
+ 'เฐตเฑเฐŸ', 'เฐตเฑเฐ ', 'เฐตเฑเฐก', 'เฐตเฑเฐข', 'เฐตเฑเฐฃ', 'เฐตเฑเฐค', 'เฐตเฑเฐฅ', 'เฐตเฑเฐฆ', 'เฐตเฑเฐง', 'เฐตเฑเฐจ',
252
+ 'เฐตเฑเฐช', 'เฐตเฑเฐซ', 'เฐตเฑเฐฌ', 'เฐตเฑเฐญ', 'เฐตเฑเฐฎ', 'เฐตเฑเฐฏ', 'เฐตเฑเฐฐ', 'เฐตเฑเฐฒ', 'เฐตเฑเฐต', 'เฐตเฑเฐถ',
253
+ 'เฐตเฑเฐท', 'เฐตเฑเฐธ', 'เฐตเฑเฐน', 'เฐตเฑเฐณ', 'เฐตเฑเฐ•เฑเฐท', 'เฐตเฑเฐฑ',
254
+ 'เฐถเฑเฐ•', 'เฐถเฑเฐ–', 'เฐถเฑเฐ—', 'เฐถเฑเฐ˜', 'เฐถเฑเฐ™', 'เฐถเฑเฐš', 'เฐถเฑเฐ›', 'เฐถเฑเฐœ', 'เฐถเฑเฐ', 'เฐถเฑเฐž',
255
+ 'เฐถเฑเฐŸ', 'เฐถเฑเฐ ', 'เฐถเฑเฐก', 'เฐถเฑเฐข', 'เฐถเฑเฐฃ', 'เฐถเฑเฐค', 'เฐถเฑเฐฅ', 'เฐถเฑเฐฆ', 'เฐถเฑเฐง', 'เฐถเฑเฐจ',
256
+ 'เฐถเฑเฐช', 'เฐถเฑเฐซ', 'เฐถเฑเฐฌ', 'เฐถเฑเฐญ', 'เฐถเฑเฐฎ', 'เฐถเฑเฐฏ', 'เฐถเฑเฐฐ', 'เฐถ๏ฟฝ๏ฟฝ๏ฟฝเฐฒ', 'เฐถเฑเฐต', 'เฐถเฑเฐถ',
257
+ 'เฐถเฑเฐท', 'เฐถเฑเฐธ', 'เฐถเฑเฐน', 'เฐถเฑเฐณ', 'เฐถเฑเฐ•เฑเฐท', 'เฐถเฑเฐฑ',
258
+ 'เฐทเฑเฐ•', 'เฐทเฑเฐ–', 'เฐทเฑเฐ—', 'เฐทเฑเฐ˜', 'เฐทเฑเฐ™', 'เฐทเฑเฐš', 'เฐทเฑเฐ›', 'เฐทเฑเฐœ', 'เฐทเฑเฐ', 'เฐทเฑเฐž',
259
+ 'เฐทเฑเฐŸ', 'เฐทเฑเฐ ', 'เฐทเฑเฐก', 'เฐทเฑเฐข', 'เฐทเฑเฐฃ', 'เฐทเฑเฐค', 'เฐทเฑเฐฅ', 'เฐทเฑเฐฆ', 'เฐทเฑเฐง', 'เฐทเฑเฐจ',
260
+ 'เฐทเฑเฐช', 'เฐทเฑเฐซ', 'เฐทเฑเฐฌ', 'เฐทเฑเฐญ', 'เฐทเฑเฐฎ', 'เฐทเฑเฐฏ', 'เฐทเฑเฐฐ', 'เฐทเฑเฐฒ', 'เฐทเฑเฐต', 'เฐทเฑเฐถ',
261
+ 'เฐทเฑเฐท', 'เฐทเฑเฐธ', 'เฐทเฑเฐน', 'เฐทเฑเฐณ', 'เฐทเฑเฐ•เฑเฐท', 'เฐทเฑเฐฑ',
262
+ 'เฐธเฑเฐ•', 'เฐธเฑเฐ–', 'เฐธเฑเฐ—', 'เฐธเฑเฐ˜', 'เฐธเฑเฐ™', 'เฐธเฑเฐš', 'เฐธเฑเฐ›', 'เฐธเฑเฐœ', 'เฐธเฑเฐ', 'เฐธเฑเฐž',
263
+ 'เฐธเฑเฐŸ', 'เฐธเฑเฐ ', 'เฐธเฑเฐก', 'เฐธเฑเฐข', 'เฐธเฑเฐฃ', 'เฐธเฑเฐค', 'เฐธเฑเฐฅ', 'เฐธเฑเฐฆ', 'เฐธเฑเฐง', 'เฐธเฑเฐจ',
264
+ 'เฐธเฑเฐช', 'เฐธเฑเฐซ', 'เฐธเฑเฐฌ', 'เฐธเฑเฐญ', 'เฐธเฑเฐฎ', 'เฐธเฑเฐฏ', 'เฐธเฑเฐฐ', 'เฐธเฑเฐฒ', 'เฐธเฑเฐต', 'เฐธเฑเฐถ',
265
+ 'เฐธเฑเฐท', 'เฐธเฑเฐธ', 'เฐธเฑเฐน', 'เฐธเฑเฐณ', 'เฐธเฑเฐ•เฑเฐท', 'เฐธเฑเฐฑ',
266
+ 'เฐนเฑเฐ•', 'เฐนเฑเฐ–', 'เฐนเฑเฐ—', 'เฐนเฑเฐ˜', 'เฐนเฑเฐ™', 'เฐนเฑเฐš', 'เฐนเฑเฐ›', 'เฐนเฑเฐœ', 'เฐนเฑเฐ', 'เฐนเฑเฐž',
267
+ 'เฐนเฑเฐŸ', 'เฐนเฑเฐ ', 'เฐนเฑเฐก', 'เฐนเฑเฐข', 'เฐนเฑเฐฃ', 'เฐนเฑเฐค', 'เฐนเฑเฐฅ', 'เฐนเฑเฐฆ', 'เฐนเฑเฐง', 'เฐนเฑเฐจ',
268
+ 'เฐนเฑเฐช', 'เฐนเฑเฐซ', 'เฐนเฑเฐฌ', 'เฐนเฑเฐญ', 'เฐนเฑเฐฎ', 'เฐนเฑเฐฏ', 'เฐนเฑเฐฐ', 'เฐนเฑเฐฒ', 'เฐนเฑเฐต', 'เฐนเฑเฐถ',
269
+ 'เฐนเฑเฐท', 'เฐนเฑเฐธ', 'เฐนเฑเฐน', 'เฐนเฑเฐณ', 'เฐนเฑเฐ•เฑเฐท', 'เฐนเฑเฐฑ',
270
+ 'เฐณเฑเฐ•', 'เฐณเฑเฐ–', 'เฐณเฑเฐ—', 'เฐณเฑเฐ˜', 'เฐณเฑเฐ™', 'เฐณเฑเฐš', 'เฐณเฑเฐ›', 'เฐณเฑเฐœ', 'เฐณเฑเฐ', 'เฐณเฑเฐž',
271
+ 'เฐณเฑเฐŸ', 'เฐณเฑเฐ ', 'เฐณเฑเฐก', 'เฐณเฑเฐข', 'เฐณเฑเฐฃ', 'เฐณเฑเฐค', 'เฐณเฑเฐฅ', 'เฐณเฑเฐฆ', 'เฐณเฑเฐง', 'เฐณเฑเฐจ',
272
+ 'เฐณเฑเฐช', 'เฐณเฑเฐซ', 'เฐณเฑเฐฌ', 'เฐณเฑเฐญ', 'เฐณเฑเฐฎ', 'เฐณเฑเฐฏ', 'เฐณเฑเฐฐ', 'เฐณเฑเฐฒ', 'เฐณเฑเฐต', 'เฐณเฑเฐถ',
273
+ 'เฐณเฑเฐท', 'เฐณเฑเฐธ', 'เฐณเฑเฐน', 'เฐณเฑเฐณ', 'เฐณเฑเฐ•เฑเฐท', 'เฐณเฑเฐฑ',
274
+ 'เฐ•เฑเฐทเฑเฐ•', 'เฐ•เฑเฐทเฑเฐ–', 'เฐ•เฑเฐทเฑเฐ—', 'เฐ•เฑเฐทเฑเฐ˜', 'เฐ•เฑเฐทเฑเฐ™', 'เฐ•เฑเฐทเฑเฐš', 'เฐ•เฑเฐทเฑเฐ›', 'เฐ•เฑเฐทเฑเฐœ', 'เฐ•เฑเฐทเฑเฐ', 'เฐ•เฑเฐทเฑเฐž',
275
+ 'เฐ•เฑเฐทเฑเฐŸ', 'เฐ•เฑเฐทเฑเฐ ', 'เฐ•เฑเฐทเฑเฐก', 'เฐ•เฑเฐทเฑเฐข', 'เฐ•เฑเฐทเฑเฐฃ', 'เฐ•เฑเฐทเฑเฐค', 'เฐ•เฑเฐทเฑเฐฅ', 'เฐ•เฑเฐทเฑเฐฆ', 'เฐ•เฑเฐทเฑเฐง', 'เฐ•เฑเฐทเฑเฐจ',
276
+ 'เฐ•เฑเฐทเฑเฐช', 'เฐ•เฑเฐทเฑเฐซ', 'เฐ•เฑเฐทเฑเฐฌ', 'เฐ•เฑเฐทเฑเฐญ', 'เฐ•เฑเฐทเฑเฐฎ', 'เฐ•เฑเฐทเฑเฐฏ', 'เฐ•เฑเฐทเฑเฐฐ', 'เฐ•เฑเฐทเฑเฐฒ', 'เฐ•เฑเฐทเฑเฐต', 'เฐ•เฑเฐทเฑเฐถ',
277
+ 'เฐ•เฑเฐทเฑเฐท', 'เฐ•เฑเฐทเฑเฐธ', 'เฐ•เฑเฐทเฑเฐน', 'เฐ•เฑเฐทเฑเฐณ', 'เฐ•เฑเฐทเฑเฐ•เฑเฐท', 'เฐ•เฑเฐทเฑเฐฑ',
278
+ 'เฐฑเฑเฐ•', 'เฐฑเฑเฐ–', 'เฐฑเฑเฐ—', 'เฐฑเฑเฐ˜', 'เฐฑเฑเฐ™', 'เฐฑเฑเฐš', 'เฐฑเฑเฐ›', 'เฐฑเฑเฐœ', 'เฐฑเฑเฐ', 'เฐฑเฑเฐž',
279
+ 'เฐฑเฑเฐŸ', 'เฐฑเฑเฐ ', 'เฐฑเฑเฐก', 'เฐฑเฑเฐข', 'เฐฑเฑเฐฃ', 'เฐฑเฑเฐค', 'เฐฑเฑเฐฅ', 'เฐฑเฑเฐฆ', 'เฐฑเฑเฐง', 'เฐฑเฑเฐจ',
280
+ 'เฐฑเฑเฐช', 'เฐฑเฑเฐซ', 'เฐฑเฑเฐฌ', 'เฐฑเฑเฐญ', 'เฐฑเฑเฐฎ', 'เฐฑเฑเฐฏ', 'เฐฑเฑเฐฐ', 'เฐฑเฑเฐฒ', 'เฐฑเฑเฐต', 'เฐฑเฑเฐถ',
281
+ 'เฐฑเฑเฐท', 'เฐฑเฑเฐธ', 'เฐฑเฑเฐน', 'เฐฑเฑเฐณ', 'เฐฑเฑเฐ•เฑเฐท', 'เฐฑเฑเฐฑ'
282
+ # Add more valid combinations as needed
283
+ ]
284
+
285
+ for combination in valid_consonant_combinations:
286
+ if combination not in existing_tokens: # Check for duplicates
287
+ char_bytes = combination.encode('utf-8')
288
+ vocab[token_id] = {
289
+ 'text': combination,
290
+ 'bytes': list(char_bytes),
291
+ 'type': 'Ligature',
292
+ 'description': f"Telugu ligature: {combination}"
293
+ }
294
+ existing_tokens.add(combination) # Add to the set
295
+ token_id += 1
296
+
297
+ print(f"Created base vocabulary with {len(vocab)} tokens")
298
+ return vocab
299
+
300
+ def save_base_vocab(vocab, path='telugu_base_vocab.json'):
301
+ """Save the base vocabulary with character information."""
302
+ # Sort by character type for better readability
303
+ sorted_vocab = {}
304
+ for k, v in sorted(vocab.items(), key=lambda x: (x[1]['type'], x[0])):
305
+ sorted_vocab[str(k)] = v
306
+
307
+ with open(path, 'w', encoding='utf-8') as f:
308
+ json.dump(sorted_vocab, f, ensure_ascii=False, indent=2)
309
+ print(f"Base vocabulary saved to {path}")
310
+
311
+ def load_base_vocab(path='telugu_base_vocab.json'):
312
+ """Load the base vocabulary."""
313
+ with open(path, 'r', encoding='utf-8') as f:
314
+ vocab = json.load(f)
315
+ return {int(k): bytes(v['bytes']) for k, v in vocab.items()}
316
+
317
+ class BPETokenizer:
318
+ def __init__(self, vocab_size=5000, sample_size=None):
319
+ self.vocab_size = vocab_size
320
+ self.sample_size = sample_size
321
+
322
+ # First try to load trained vocabulary
323
+ trained_vocab_path = 'telugu_tokenizer_vocab.json'
324
+ if os.path.exists(trained_vocab_path):
325
+ print("Loading trained vocabulary...")
326
+ self.load('telugu_tokenizer') # This loads both vocab and merges
327
+ return
328
+
329
+ # If no trained vocab exists, fall back to base vocabulary
330
+ base_vocab_path = 'telugu_base_vocab.json'
331
+ if os.path.exists(base_vocab_path):
332
+ print("Loading existing base vocabulary...")
333
+ self.vocab = load_base_vocab(base_vocab_path)
334
+ else:
335
+ print("Creating new base vocabulary...")
336
+ base_vocab = create_base_vocab()
337
+ save_base_vocab(base_vocab)
338
+ self.vocab = load_base_vocab(base_vocab_path)
339
+
340
+ self.base_vocab_size = len(self.vocab)
341
+ self.merges = {}
342
+
343
+ def get_stats(self, ids):
344
+ """Count token pair frequencies."""
345
+ counts = {}
346
+ for pair in zip(ids, ids[1:]):
347
+ counts[pair] = counts.get(pair, 0) + 1
348
+ return counts
349
+
350
+ def merge(self, ids, pair, idx):
351
+ """Merge all occurrences of a token pair."""
352
+ # Create the merged token
353
+ merged_token = self.vocab[pair[0]] + self.vocab[pair[1]]
354
+
355
+ # Check if the merged token already exists in the vocabulary
356
+ for existing_id, existing_token in self.vocab.items():
357
+ if existing_token == merged_token:
358
+ # Instead of skipping, use the existing token ID for merging
359
+ print(f"Merge for {pair} already exists in the vocabulary.")
360
+ newids = []
361
+ i = 0
362
+ while i < len(ids):
363
+ if i < len(ids) - 1 and ids[i] == pair[0] and ids[i + 1] == pair[1]:
364
+ newids.append(existing_id)
365
+ i += 2
366
+ else:
367
+ newids.append(ids[i])
368
+ i += 1
369
+ return newids
370
+
371
+ # If we get here, the merged token doesn't exist yet
372
+ newids = []
373
+ i = 0
374
+ while i < len(ids):
375
+ if i < len(ids) - 1 and ids[i] == pair[0] and ids[i + 1] == pair[1]:
376
+ newids.append(idx)
377
+ i += 2
378
+ else:
379
+ newids.append(ids[i])
380
+ i += 1
381
+ return newids
382
+
383
+ def _process_chunk(self, args):
384
+ """Process a chunk of text for parallel processing."""
385
+ chunk, byte_to_token = args
386
+ ids = array.array('I') # Unsigned int array
387
+ j = 0
388
+ while j < len(chunk):
389
+ if chunk[j] == 32: # Space
390
+ ids.append(32)
391
+ j += 1
392
+ continue
393
+
394
+ found = False
395
+ for length in [3, 2, 1]:
396
+ if j + length <= len(chunk):
397
+ char_bytes = bytes(chunk[j:j+length])
398
+ if char_bytes in byte_to_token:
399
+ ids.append(byte_to_token[char_bytes])
400
+ j += length
401
+ found = True
402
+ break
403
+ if not found:
404
+ j += 1
405
+ return ids
406
+
407
+ def fit(self, text):
408
+ """Train the BPE tokenizer."""
409
+ print("Converting text to token IDs using base vocabulary...")
410
+
411
+ original_bytes = text.encode('utf-8')
412
+ original_length = len(original_bytes)
413
+ print(f"\nBefore training: text bytes length: {original_length:,}")
414
+
415
+ # Pre-compute byte sequences for faster lookup
416
+ byte_to_token = {token_bytes: token_id for token_id, token_bytes in self.vocab.items()}
417
+
418
+ # Parallel processing of chunks
419
+ num_cores = os.cpu_count() or 1
420
+ chunk_size = max(1024 * 64, len(original_bytes) // (num_cores * 4)) # Larger chunks
421
+ chunks = [original_bytes[i:i + chunk_size] for i in range(0, len(original_bytes), chunk_size)]
422
+
423
+ print(f"Processing {len(chunks)} chunks using {num_cores} cores...")
424
+
425
+ # Process chunks in parallel
426
+ with Pool(num_cores) as pool:
427
+ chunk_results = list(tqdm(
428
+ pool.imap(self._process_chunk, [(chunk, byte_to_token) for chunk in chunks]),
429
+ total=len(chunks),
430
+ desc="Initial tokenization"
431
+ ))
432
+
433
+ # Combine results
434
+ ids = array.array('I')
435
+ for result in chunk_results:
436
+ ids.extend(result)
437
+
438
+ print(f"\nBase vocabulary size: {self.base_vocab_size}")
439
+ print(f"Initial sequence length: {len(ids)}")
440
+
441
+ # Keep training until we reach the target vocab size
442
+ target_vocab_size = self.vocab_size
443
+ pbar = tqdm(total=target_vocab_size - self.base_vocab_size, desc="Training BPE")
444
+ last_vocab_size = len(self.vocab)
445
+
446
+ while len(self.vocab) < target_vocab_size:
447
+ stats = self.get_stats(ids)
448
+ if not stats:
449
+ print("No more pairs to merge.")
450
+ break
451
+
452
+ pair = max(stats, key=stats.get)
453
+ idx = len(self.vocab)
454
+ ids = self.merge(ids, pair, idx)
455
+
456
+ # Only update progress when vocabulary actually grows
457
+ if len(self.vocab) > last_vocab_size:
458
+ pbar.update(len(self.vocab) - last_vocab_size)
459
+ last_vocab_size = len(self.vocab)
460
+
461
+ # Add the merged token to the vocabulary
462
+ if pair not in self.merges: # Ensure we don't overwrite existing merges
463
+ self.merges[pair] = idx
464
+ self.vocab[idx] = self.vocab[pair[0]] + self.vocab[pair[1]]
465
+
466
+ # Print progress periodically
467
+ if len(self.vocab) % 100 == 0:
468
+ try:
469
+ text0 = self.vocab[pair[0]].decode('utf-8')
470
+ text1 = self.vocab[pair[1]].decode('utf-8')
471
+ merged = self.vocab[idx].decode('utf-8')
472
+ print(f"\nVocab size: {len(self.vocab)}: {text0} + {text1} = {merged}")
473
+ except UnicodeDecodeError:
474
+ continue
475
+
476
+ pbar.close()
477
+ print("\nFinal statistics:")
478
+ print(f"Final vocabulary size: {len(self.vocab):,}")
479
+ print(f"Number of merges: {len(self.merges):,}")
480
+ print(f"Final compression ratio: {original_length / len(ids):.2f}x")
481
+
482
+ def encode(self, text):
483
+ """Encode text to token IDs."""
484
+ final_tokens = []
485
+ i = 0
486
+ text_bytes = text.encode('utf-8')
487
+
488
+ while i < len(text_bytes):
489
+ # If we're at a leading space, encode it separately
490
+ if text_bytes[i] == 32: # ASCII space
491
+ final_tokens.append(32) # Space token
492
+ i += 1
493
+ continue
494
+
495
+ # Try to find the longest matching sequence (including potential trailing spaces)
496
+ longest_match = None
497
+ longest_length = 0
498
+ matched_token = None
499
+
500
+ # Sort vocab items by length (longest first)
501
+ for token_id, token_bytes in sorted(self.vocab.items(),
502
+ key=lambda x: len(x[1]),
503
+ reverse=True):
504
+ if (i + len(token_bytes) <= len(text_bytes) and
505
+ text_bytes[i:i+len(token_bytes)] == token_bytes):
506
+ longest_length = len(token_bytes)
507
+ longest_match = token_bytes
508
+ matched_token = token_id
509
+ break
510
+
511
+ if longest_match:
512
+ final_tokens.append(matched_token)
513
+ i += longest_length
514
+ else:
515
+ # If no match found, fall back to single byte
516
+ for token_id, token_bytes in self.vocab.items():
517
+ if token_bytes == bytes([text_bytes[i]]):
518
+ final_tokens.append(token_id)
519
+ break
520
+ i += 1
521
+
522
+ return final_tokens
523
+
524
+ def decode(self, tokens):
525
+ """Decode token IDs back to text."""
526
+ bytes_tokens = b''.join(self.vocab[idx] for idx in tokens)
527
+ return bytes_tokens.decode('utf-8')
528
+
529
+ def save(self, path):
530
+ """Save the tokenizer mappings to files."""
531
+ base_path = path.rsplit('.', 1)[0]
532
+
533
+ # Save vocabulary with human-readable form
534
+ vocab_mapping = {}
535
+ for token_id, byte_seq in self.vocab.items():
536
+ try:
537
+ text = byte_seq.decode('utf-8')
538
+ vocab_mapping[token_id] = {
539
+ 'text': text,
540
+ 'bytes': list(byte_seq),
541
+ 'is_base': token_id < self.base_vocab_size
542
+ }
543
+ except UnicodeDecodeError:
544
+ vocab_mapping[token_id] = {
545
+ 'text': f"[Bytes: {list(byte_seq)}]",
546
+ 'bytes': list(byte_seq),
547
+ 'is_base': token_id < self.base_vocab_size
548
+ }
549
+
550
+ # Save merge patterns with human-readable form
551
+ merge_patterns = {}
552
+ for (p0, p1), idx in self.merges.items():
553
+ try:
554
+ text0 = self.vocab[p0].decode('utf-8')
555
+ text1 = self.vocab[p1].decode('utf-8')
556
+ merged = self.vocab[idx].decode('utf-8')
557
+ merge_patterns[idx] = {
558
+ 'parts': [text0, text1],
559
+ 'result': merged,
560
+ 'token_ids': [p0, p1]
561
+ }
562
+ except UnicodeDecodeError:
563
+ merge_patterns[idx] = {
564
+ 'parts': [f"Token_{p0}", f"Token_{p1}"],
565
+ 'result': f"Token_{idx}",
566
+ 'token_ids': [p0, p1]
567
+ }
568
+
569
+ with open(f"{base_path}_vocab.json", 'w', encoding='utf-8') as f:
570
+ json.dump(vocab_mapping, f, ensure_ascii=False, indent=2)
571
+
572
+ with open(f"{base_path}_merges.json", 'w', encoding='utf-8') as f:
573
+ json.dump(merge_patterns, f, ensure_ascii=False, indent=2)
574
+
575
+ print(f"\nTokenizer mappings saved to {base_path}_vocab.json and {base_path}_merges.json")
576
+
577
+ def load(self, path):
578
+ """Load the tokenizer from mapping files."""
579
+ base_path = path.rsplit('.', 1)[0]
580
+
581
+ with open(f"{base_path}_vocab.json", 'r', encoding='utf-8') as f:
582
+ vocab_mapping = json.load(f)
583
+ self.vocab = {
584
+ int(k): bytes(v['bytes'])
585
+ for k, v in vocab_mapping.items()
586
+ }
587
+ # Find base vocabulary size
588
+ self.base_vocab_size = sum(1 for k, v in vocab_mapping.items() if v['is_base'])
589
+
590
+ with open(f"{base_path}_merges.json", 'r', encoding='utf-8') as f:
591
+ merge_patterns = json.load(f)
592
+ self.merges = {
593
+ tuple(v['token_ids']): int(k)
594
+ for k, v in merge_patterns.items()
595
+ }
596
+
597
+ self.vocab_size = len(self.vocab)
598
+ print(f"Loaded tokenizer from {base_path}_*.json files")
599
+
600
+ def train_on_dataset(self):
601
+ """Train tokenizer on the Telugu news dataset."""
602
+ print("Loading dataset...")
603
+ try:
604
+ # Load the local parquet file
605
+ dataset = pd.read_parquet('telugu_news_dataset.parquet')
606
+
607
+ print("Preparing training text...")
608
+ training_text = []
609
+
610
+ for _, row in tqdm(dataset.iterrows(), desc="Loading documents", total=len(dataset)):
611
+ if not pd.isna(row["headline"]): training_text.append(row["headline"])
612
+ if not pd.isna(row["article"]): training_text.append(row["article"])
613
+
614
+ if self.sample_size and len(training_text) >= self.sample_size:
615
+ print(f"Using first {self.sample_size} documents for training")
616
+ break
617
+
618
+ full_text = "\n".join(training_text)
619
+ print(f"\nTraining on {len(training_text)} documents...")
620
+ print(f"Total characters in training data: {len(full_text):,}")
621
+
622
+ start_time = time.time()
623
+ self.fit(full_text)
624
+ print(f"Training time: {time.time() - start_time:.2f} seconds")
625
+
626
+ except Exception as e:
627
+ print(f"Error loading dataset: {str(e)}")
628
+ print("Falling back to sample text...")
629
+ sample_text = """
630
+ เฐคเฑ†เฐฒเฑเฐ—เฑ เฐญเฐพเฐท เฐฆเฐ•เฑเฐทเฐฟเฐฃ เฐญเฐพเฐฐเฐคเฐฆเฑ‡เฐถเฐ‚เฐฒเฑ‹เฐจเฐฟ เฐฆเฑเฐฐเฐพเฐตเฐฟเฐก เฐญเฐพเฐทเฐฒเฑเฐฒเฑ‹ เฐ’เฐ•เฐŸเฐฟ.
631
+ เฐ†เฐ‚เฐงเฑเฐฐ เฐชเฑเฐฐเฐฆเฑ‡เฐถเฑ เฐฎเฐฐเฐฟเฐฏเฑ เฐคเฑ†เฐฒเฐ‚เฐ—เฐพเฐฃ เฐฐเฐพเฐทเฑเฐŸเฑเฐฐเฐพเฐฒ เฐ…เฐงเฐฟเฐ•เฐพเฐฐ เฐญเฐพเฐท.
632
+ """
633
+ self.fit(sample_text)
634
+
635
+
636
+ if __name__ == "__main__":
637
+ # For quick testing, use a small sample
638
+ tokenizer = BPETokenizer(vocab_size=4999, sample_size=None)
639
+
640
+ vocab_file = 'telugu_tokenizer_vocab.json'
641
+ merges_file = 'telugu_tokenizer_merges.json'
642
+
643
+ if os.path.exists(vocab_file) and os.path.exists(merges_file):
644
+ print("Loading pre-trained tokenizer...")
645
+ tokenizer.load('telugu_tokenizer')
646
+ else:
647
+ print("Training new tokenizer...")
648
+ tokenizer.train_on_dataset()
649
+ tokenizer.save('telugu_tokenizer')
650
+
651
+ # Test the tokenizer
652
+ test_text = "เฐคเฑ†เฐฒเฑเฐ—เฑ เฐญเฐพเฐท"
653
+ encoded = tokenizer.encode(test_text)
654
+ decoded = tokenizer.decode(encoded)
655
+
656
+ print("\nTest Results:")
657
+ print(f"Original: {test_text}")
658
+ print(f"Encoded: {encoded}")
659
+ print(f"Decoded: {decoded}")
660
+ print(f"Matches original: {test_text == decoded}")
src/templates/index.html ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html>
3
+ <head>
4
+ <title>{{ title }}</title>
5
+ <script src="https://cdn.tailwindcss.com"></script>
6
+ </head>
7
+ <body class="bg-gray-100">
8
+ <div class="container mx-auto px-4 py-8">
9
+ <h1 class="text-3xl font-bold mb-8">Telugu BPE Tokenizer</h1>
10
+
11
+ <div class="bg-white rounded-lg shadow p-6">
12
+ <textarea
13
+ id="input-text"
14
+ class="w-full p-2 border rounded mb-4"
15
+ rows="4"
16
+ placeholder="Enter Telugu text here..."></textarea>
17
+
18
+ <button
19
+ onclick="tokenize()"
20
+ class="bg-blue-500 text-white px-4 py-2 rounded hover:bg-blue-600">
21
+ Tokenize
22
+ </button>
23
+
24
+ <div id="result" class="mt-6 hidden">
25
+ <h2 class="text-xl font-semibold mb-2">Results:</h2>
26
+ <div class="space-y-4">
27
+ <div>
28
+ <span class="font-medium">Tokens:</span>
29
+ <pre id="tokens" class="bg-gray-100 p-2 rounded mt-1"></pre>
30
+ </div>
31
+ <div>
32
+ <span class="font-medium">Decoded:</span>
33
+ <pre id="decoded" class="bg-gray-100 p-2 rounded mt-1"></pre>
34
+ </div>
35
+ <div>
36
+ <span class="font-medium">Token Details:</span>
37
+ <div id="token-details" class="bg-gray-100 p-2 rounded mt-1 overflow-x-auto">
38
+ <table class="min-w-full bg-white border rounded-lg overflow-hidden table-fixed">
39
+ <thead class="bg-gray-100">
40
+ <tr>
41
+ <th class="px-4 py-2 text-left w-1/4">Word</th>
42
+ <th class="px-4 py-2 text-left w-1/4">Type</th>
43
+ <th class="px-4 py-2 text-left w-2/4">Token Details</th>
44
+ </tr>
45
+ </thead>
46
+ <tbody id="token-details-body">
47
+ <!-- Token details will be inserted here -->
48
+ </tbody>
49
+ </table>
50
+ </div>
51
+ </div>
52
+ <div id="match-result"></div>
53
+ </div>
54
+ </div>
55
+ </div>
56
+ </div>
57
+
58
+ <script>
59
+ async function tokenize() {
60
+ const text = document.getElementById('input-text').value;
61
+ try {
62
+ const response = await fetch('/tokenize', {
63
+ method: 'POST',
64
+ headers: {
65
+ 'Content-Type': 'application/json',
66
+ },
67
+ body: JSON.stringify({ text }),
68
+ });
69
+
70
+ const data = await response.json();
71
+
72
+ document.getElementById('result').classList.remove('hidden');
73
+ document.getElementById('tokens').textContent = JSON.stringify(data.tokens, null, 2);
74
+ document.getElementById('decoded').textContent = data.decoded;
75
+
76
+ // Display token details
77
+ const detailsBody = document.getElementById('token-details-body');
78
+ detailsBody.innerHTML = '';
79
+
80
+ data.token_details.forEach(detail => {
81
+ const row = document.createElement('tr');
82
+ row.className = 'border-b hover:bg-gray-50';
83
+
84
+ // Create table cells
85
+ const wordCell = document.createElement('td');
86
+ const typeCell = document.createElement('td');
87
+ const tokenCell = document.createElement('td');
88
+
89
+ // Set cell classes for vertical alignment and wrapping
90
+ wordCell.className = 'px-4 py-2 align-top font-mono border-r';
91
+ typeCell.className = 'px-4 py-2 align-top border-r';
92
+ tokenCell.className = 'px-4 py-2 align-top font-mono';
93
+
94
+ // Set content
95
+ wordCell.textContent = detail.word;
96
+ typeCell.textContent = detail.type;
97
+
98
+ // Create a container for token details to ensure proper spacing
99
+ const tokenList = document.createElement('div');
100
+ tokenList.className = 'space-y-1';
101
+
102
+ if (detail.type === 'complete_word') {
103
+ const tokenDiv = document.createElement('div');
104
+ tokenDiv.textContent = `ID ${detail.token_id}: "${detail.text}"`;
105
+ tokenList.appendChild(tokenDiv);
106
+ } else if (detail.type === 'subword_tokens') {
107
+ detail.tokens.forEach(t => {
108
+ const tokenDiv = document.createElement('div');
109
+ tokenDiv.textContent = `ID ${t.id}: "${t.text}"`;
110
+ tokenList.appendChild(tokenDiv);
111
+ });
112
+ }
113
+
114
+ tokenCell.appendChild(tokenList);
115
+
116
+ // Add cells to row
117
+ row.appendChild(wordCell);
118
+ row.appendChild(typeCell);
119
+ row.appendChild(tokenCell);
120
+
121
+ detailsBody.appendChild(row);
122
+ });
123
+
124
+ const matchEl = document.getElementById('match-result');
125
+ matchEl.textContent = data.matches ? 'โœ… Perfect match!' : 'โŒ Mismatch';
126
+ matchEl.className = data.matches ? 'text-green-600' : 'text-red-600';
127
+ } catch (error) {
128
+ console.error('Error:', error);
129
+ alert('Error tokenizing text: ' + error.message);
130
+ }
131
+ }
132
+ </script>
133
+ </body>
134
+ </html>
telugu_base_vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
telugu_tokenizer_merges.json ADDED
The diff for this file is too large to render. See raw diff
 
telugu_tokenizer_vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
training_logs.log ADDED
@@ -0,0 +1,376 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ (session10) (base) Chaitanyas-MacBook-Pro:telugu-tokenizer chaitanyasagargurujula$ python src/bpe_tokenizer.py
2
+ Loading existing base vocabulary...
3
+ Training new tokenizer...
4
+ Loading dataset...
5
+ Preparing training text...
6
+ Loading documents: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 83866/83866 [00:00<00:00, 88094.70it/s]
7
+
8
+ Training on 167732 documents...
9
+ Total characters in training data: 105,279,512
10
+ Converting text to token IDs using base vocabulary...
11
+
12
+ Before training: text bytes length: 283,496,279
13
+ Processing 45 chunks using 11 cores...
14
+ Initial tokenization: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 45/45 [00:04<00:00, 9.95it/s]
15
+
16
+ Base vocabulary size: 2400
17
+ Initial sequence length: 105836015
18
+ Training BPE: 0%| | 1/2599 [00:37<26:47:26, 37.12s/it]Merge for (304, 333) already exists in the vocabulary.
19
+ Training BPE: 0%| | 4/2599 [01:26<13:45:51, 19.10s/it]Merge for (296, 333) already exists in the vocabulary.
20
+ Training BPE: 0%|โ– | 6/2599 [01:57<12:20:12, 17.13s/it]Merge for (312, 333) already exists in the vocabulary.
21
+ Training BPE: 1%|โ– | 16/2599 [04:29<10:44:21, 14.97s/it]Merge for (783, 296) already exists in the vocabulary.
22
+ Training BPE: 1%|โ–Œ | 19/2599 [05:13<10:29:00, 14.63s/it]Merge for (296, 319) already exists in the vocabulary.
23
+ Training BPE: 1%|โ–‹ | 23/2599 [06:10<10:13:44, 14.30s/it]Merge for (277, 333) already exists in the vocabulary.
24
+ Training BPE: 1%|โ–Š | 27/2599 [07:06<10:01:51, 14.04s/it]Merge for (309, 319) already exists in the vocabulary.
25
+ Training BPE: 1%|โ–‰ | 29/2599 [07:33<9:54:13, 13.87s/it]Merge for (282, 327) already exists in the vocabulary.
26
+ Training BPE: 1%|โ–ˆ | 34/2599 [08:41<9:39:29, 13.56s/it]Merge for (302, 318) already exists in the vocabulary.
27
+ Training BPE: 1%|โ–ˆโ– | 38/2599 [09:35<9:36:41, 13.51s/it]Merge for (304, 318) already exists in the vocabulary.
28
+ Training BPE: 2%|โ–ˆโ– | 39/2599 [09:48<9:34:36, 13.47s/it]Merge for (298, 2403) already exists in the vocabulary.
29
+ Training BPE: 2%|โ–ˆโ– | 41/2599 [10:15<9:31:03, 13.39s/it]Merge for (1023, 292) already exists in the vocabulary.
30
+ Training BPE: 2%|โ–ˆโ–Ž | 43/2599 [10:41<9:25:50, 13.28s/it]Merge for (292, 333) already exists in the vocabulary.
31
+ Training BPE: 2%|โ–ˆโ– | 48/2599 [11:46<9:13:28, 13.02s/it]Merge for (277, 321) already exists in the vocabulary.
32
+ Training BPE: 2%|โ–ˆโ–Œ | 50/2599 [12:12<9:08:03, 12.90s/it]Merge for (304, 319) already exists in the vocabulary.
33
+ Training BPE: 2%|โ–ˆโ–‹ | 55/2599 [13:16<9:04:11, 12.83s/it]Merge for (309, 318) already exists in the vocabulary.
34
+ Training BPE: 2%|โ–ˆโ–Š | 58/2599 [13:54<8:59:41, 12.74s/it]Merge for (294, 333) already exists in the vocabulary.
35
+ Training BPE: 2%|โ–ˆโ–Š | 61/2599 [14:32<8:56:47, 12.69s/it]Merge for (306, 2412) already exists in the vocabulary.
36
+ Training BPE: 3%|โ–ˆโ–ˆ | 66/2599 [15:34<8:46:39, 12.47s/it]Merge for (292, 319) already exists in the vocabulary.
37
+ Training BPE: 3%|โ–ˆโ–ˆ | 68/2599 [15:59<8:43:38, 12.41s/it]Merge for (287, 2412) already exists in the vocabulary.
38
+ Training BPE: 3%|โ–ˆโ–ˆ | 69/2599 [16:12<8:43:06, 12.41s/it]Merge for (304, 321) already exists in the vocabulary.
39
+ Training BPE: 3%|โ–ˆโ–ˆโ– | 70/2599 [16:24<8:41:57, 12.38s/it]Merge for (287, 2438) already exists in the vocabulary.
40
+ Training BPE: 3%|โ–ˆโ–ˆโ– | 72/2599 [16:48<8:38:32, 12.31s/it]Merge for (403, 311) already exists in the vocabulary.
41
+ Training BPE: 3%|โ–ˆโ–ˆโ–Ž | 75/2599 [17:25<8:35:33, 12.26s/it]Merge for (296, 321) already exists in the vocabulary.
42
+ Training BPE: 3%|โ–ˆโ–ˆโ–Ž | 76/2599 [17:37<8:34:11, 12.23s/it]Merge for (289, 319) already exists in the vocabulary.
43
+ Training BPE: 3%|โ–ˆโ–ˆโ–Ž | 77/2599 [17:49<8:30:31, 12.15s/it]Merge for (309, 327) already exists in the vocabulary.
44
+ Training BPE: 3%|โ–ˆโ–ˆโ–Ž | 78/2599 [18:01<8:30:18, 12.15s/it]Merge for (298, 2457) already exists in the vocabulary.
45
+ Training BPE: 3%|โ–ˆโ–ˆโ– | 80/2599 [18:26<8:28:40, 12.12s/it]Merge for (277, 318) already exists in the vocabulary.
46
+ Training BPE: 3%|โ–ˆโ–ˆโ–Œ | 83/2599 [19:02<8:32:24, 12.22s/it]Merge for (282, 333) already exists in the vocabulary.
47
+ Training BPE: 3%|โ–ˆโ–ˆโ–Œ | 84/2599 [19:15<8:33:27, 12.25s/it]Merge for (277, 331) already exists in the vocabulary.
48
+ Training BPE: 3%|โ–ˆโ–ˆโ–Œ | 86/2599 [19:39<8:31:13, 12.21s/it]Merge for (289, 333) already exists in the vocabulary.
49
+ Training BPE: 3%|โ–ˆโ–ˆโ–‹ | 90/2599 [20:27<8:25:58, 12.10s/it]Merge for (277, 330) already exists in the vocabulary.
50
+ Training BPE: 4%|โ–ˆโ–ˆโ–Š | 91/2599 [20:39<8:25:16, 12.09s/it]Merge for (300, 318) already exists in the vocabulary.
51
+ Training BPE: 4%|โ–ˆโ–ˆโ–Š | 94/2599 [21:15<8:23:51, 12.07s/it]Merge for (298, 328) already exists in the vocabulary.
52
+ Training BPE: 4%|โ–ˆโ–ˆโ–‰ | 96/2599 [21:39<8:21:06, 12.01s/it]Merge for (1023, 287) already exists in the vocabulary.
53
+ Training BPE: 4%|โ–ˆโ–ˆโ–ˆ | 99/2599 [22:15<8:13:43, 11.85s/it]
54
+ Vocab size: 2500: เฐ‚ + เฐฌ = เฐ‚เฐฌ
55
+ Merge for (298, 318) already exists in the vocabulary.
56
+ Training BPE: 4%|โ–ˆโ–ˆโ–ˆ | 100/2599 [22:27<8:15:33, 11.90s/it]Merge for (306, 331) already exists in the vocabulary.
57
+ Training BPE: 4%|โ–ˆโ–ˆโ–ˆ | 104/2599 [23:14<8:14:28, 11.89s/it]Merge for (298, 331) already exists in the vocabulary.
58
+ Training BPE: 4%|โ–ˆโ–ˆโ–ˆโ– | 106/2599 [23:38<8:11:43, 11.83s/it]Merge for (307, 2412) already exists in the vocabulary.
59
+ Training BPE: 4%|โ–ˆโ–ˆโ–ˆโ–Ž | 110/2599 [24:24<8:05:58, 11.71s/it]Merge for (1023, 293) already exists in the vocabulary.
60
+ Training BPE: 4%|โ–ˆโ–ˆโ–ˆโ–Ž | 111/2599 [24:36<8:04:27, 11.68s/it]Merge for (503, 282) already exists in the vocabulary.
61
+ Training BPE: 4%|โ–ˆโ–ˆโ–ˆโ–Ž | 112/2599 [24:47<7:59:29, 11.57s/it]Merge for (311, 2438) already exists in the vocabulary.
62
+ Training BPE: 4%|โ–ˆโ–ˆโ–ˆโ– | 113/2599 [24:59<7:58:49, 11.56s/it]Merge for (279, 321) already exists in the vocabulary.
63
+ Training BPE: 4%|โ–ˆโ–ˆโ–ˆโ– | 115/2599 [25:22<7:56:27, 11.51s/it]Merge for (303, 318) already exists in the vocabulary.
64
+ Training BPE: 4%|โ–ˆโ–ˆโ–ˆโ– | 116/2599 [25:33<7:56:28, 11.51s/it]Merge for (312, 320) already exists in the vocabulary.
65
+ Training BPE: 5%|โ–ˆโ–ˆโ–ˆโ–Œ | 117/2599 [25:45<7:55:38, 11.50s/it]Merge for (306, 327) already exists in the vocabulary.
66
+ Training BPE: 5%|โ–ˆโ–ˆโ–ˆโ–Œ | 118/2599 [25:56<7:54:21, 11.47s/it]Merge for (296, 327) already exists in the vocabulary.
67
+ Training BPE: 5%|โ–ˆโ–ˆโ–ˆโ–‹ | 121/2599 [26:32<8:06:55, 11.79s/it]Merge for (282, 326) already exists in the vocabulary.
68
+ Training BPE: 5%|โ–ˆโ–ˆโ–ˆโ–‹ | 122/2599 [26:44<8:02:47, 11.69s/it]Merge for (298, 326) already exists in the vocabulary.
69
+ Training BPE: 5%|โ–ˆโ–ˆโ–ˆโ–‹ | 124/2599 [27:06<7:55:34, 11.53s/it]Merge for (287, 320) already exists in the vocabulary.
70
+ Training BPE: 5%|โ–ˆโ–ˆโ–ˆโ–Š | 126/2599 [27:29<7:50:27, 11.41s/it]Merge for (304, 326) already exists in the vocabulary.
71
+ Training BPE: 5%|โ–ˆโ–ˆโ–ˆโ–Š | 127/2599 [27:40<7:47:41, 11.35s/it]Merge for (294, 327) already exists in the vocabulary.
72
+ Training BPE: 5%|โ–ˆโ–ˆโ–ˆโ–Š | 129/2599 [28:03<7:45:52, 11.32s/it]Merge for (312, 319) already exists in the vocabulary.
73
+ Training BPE: 5%|โ–ˆโ–ˆโ–ˆโ–‰ | 133/2599 [28:49<7:50:25, 11.45s/it]Merge for (304, 331) already exists in the vocabulary.
74
+ Training BPE: 5%|โ–ˆโ–ˆโ–ˆโ–ˆ | 134/2599 [29:00<7:45:20, 11.33s/it]Merge for (703, 292) already exists in the vocabulary.
75
+ Training BPE: 5%|โ–ˆโ–ˆโ–ˆโ–ˆ | 137/2599 [29:33<7:42:08, 11.26s/it]Merge for (277, 327) already exists in the vocabulary.
76
+ Training BPE: 5%|โ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 142/2599 [30:29<7:37:31, 11.17s/it]Merge for (306, 333) already exists in the vocabulary.
77
+ Training BPE: 6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 144/2599 [30:51<7:34:32, 11.11s/it]Merge for (302, 319) already exists in the vocabulary.
78
+ Training BPE: 6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 145/2599 [31:03<7:34:21, 11.11s/it]Merge for (310, 318) already exists in the vocabulary.
79
+ Training BPE: 6%|โ–ˆโ–ˆโ–ˆโ–ˆโ– | 148/2599 [31:36<7:29:56, 11.01s/it]Merge for (277, 2403) already exists in the vocabulary.
80
+ Training BPE: 6%|โ–ˆโ–ˆโ–ˆโ–ˆโ– | 149/2599 [31:47<7:29:45, 11.01s/it]Merge for (304, 322) already exists in the vocabulary.
81
+ Training BPE: 6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 150/2599 [31:58<7:29:13, 11.01s/it]Merge for (302, 321) already exists in the vocabulary.
82
+ Training BPE: 6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 152/2599 [32:20<7:29:14, 11.02s/it]Merge for (743, 294) already exists in the vocabulary.
83
+ Training BPE: 6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 156/2599 [33:03<7:21:49, 10.85s/it]Merge for (294, 2414) already exists in the vocabulary.
84
+ Training BPE: 6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 157/2599 [33:14<7:23:35, 10.90s/it]Merge for (403, 277) already exists in the vocabulary.
85
+ Training BPE: 6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 158/2599 [33:25<7:23:33, 10.90s/it]Merge for (643, 289) already exists in the vocabulary.
86
+ Training BPE: 6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 159/2599 [33:35<7:20:09, 10.82s/it]Merge for (306, 319) already exists in the vocabulary.
87
+ Training BPE: 6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 162/2599 [34:08<7:19:58, 10.83s/it]Merge for (277, 322) already exists in the vocabulary.
88
+ Training BPE: 6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 164/2599 [34:30<7:17:59, 10.79s/it]Merge for (703, 309) already exists in the vocabulary.
89
+ Training BPE: 6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 166/2599 [34:51<7:16:49, 10.77s/it]Merge for (292, 2403) already exists in the vocabulary.
90
+ Training BPE: 6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 168/2599 [35:13<7:15:41, 10.75s/it]Merge for (304, 327) already exists in the vocabulary.
91
+ Training BPE: 7%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 170/2599 [35:34<7:16:07, 10.77s/it]Merge for (403, 287) already exists in the vocabulary.
92
+ Training BPE: 7%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 174/2599 [36:17<7:13:15, 10.72s/it]Merge for (309, 326) already exists in the vocabulary.
93
+ Training BPE: 7%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 175/2599 [36:28<7:13:08, 10.72s/it]Merge for (301, 321) already exists in the vocabulary.
94
+ Training BPE: 7%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 179/2599 [37:10<7:10:36, 10.68s/it]Merge for (294, 319) already exists in the vocabulary.
95
+ Training BPE: 7%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 181/2599 [37:32<7:11:27, 10.71s/it]Merge for (284, 320) already exists in the vocabulary.
96
+ Training BPE: 7%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 189/2599 [38:58<7:19:21, 10.94s/it]Merge for (296, 318) already exists in the vocabulary.
97
+ Training BPE: 7%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 191/2599 [39:20<7:14:50, 10.83s/it]Merge for (302, 2537) already exists in the vocabulary.
98
+ Training BPE: 7%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 192/2599 [39:30<7:09:39, 10.71s/it]Merge for (302, 326) already exists in the vocabulary.
99
+ Training BPE: 7%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 193/2599 [39:41<7:07:00, 10.65s/it]Merge for (306, 321) already exists in the vocabulary.
100
+ Training BPE: 7%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 194/2599 [39:51<7:04:38, 10.59s/it]Merge for (279, 318) already exists in the vocabulary.
101
+ Training BPE: 8%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 195/2599 [40:02<7:03:07, 10.56s/it]Merge for (279, 2403) already exists in the vocabulary.
102
+ Training BPE: 8%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 196/2599 [40:12<7:03:23, 10.57s/it]Merge for (294, 318) already exists in the vocabulary.
103
+ Training BPE: 8%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 197/2599 [40:23<7:02:25, 10.55s/it]Merge for (284, 2414) already exists in the vocabulary.
104
+ Training BPE: 8%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 199/2599 [40:43<6:56:46, 10.42s/it]
105
+ Vocab size: 2600: เฐทเฑเฐŸ + เฑเฐฐ = เฐทเฑเฐŸเฑเฐฐ
106
+ Training BPE: 8%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 200/2599 [40:54<6:56:33, 10.42s/it]Merge for (294, 321) already exists in the vocabulary.
107
+ Training BPE: 8%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 202/2599 [41:15<6:55:18, 10.40s/it]Merge for (312, 326) already exists in the vocabulary.
108
+ Training BPE: 8%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 204/2599 [41:35<6:54:53, 10.39s/it]Merge for (313, 328) already exists in the vocabulary.
109
+ Training BPE: 8%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 205/2599 [41:46<6:54:28, 10.39s/it]Merge for (289, 318) already exists in the vocabulary.
110
+ Training BPE: 8%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 208/2599 [42:17<6:55:27, 10.43s/it]Merge for (292, 320) already exists in the vocabulary.
111
+ Training BPE: 8%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 214/2599 [43:19<6:49:01, 10.29s/it]Merge for (296, 320) already exists in the vocabulary.
112
+ Training BPE: 8%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 215/2599 [43:29<6:49:19, 10.30s/it]Merge for (294, 320) already exists in the vocabulary.
113
+ Training BPE: 8%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 216/2599 [43:40<6:49:14, 10.30s/it]Merge for (287, 319) already exists in the vocabulary.
114
+ Training BPE: 8%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 220/2599 [44:21<6:43:30, 10.18s/it]Merge for (309, 320) already exists in the vocabulary.
115
+ Training BPE: 9%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 222/2599 [44:41<6:43:40, 10.19s/it]Merge for (295, 2414) already exists in the vocabulary.
116
+ Training BPE: 9%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 230/2599 [46:03<6:43:24, 10.22s/it]Merge for (300, 320) already exists in the vocabulary.
117
+ Training BPE: 9%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 231/2599 [46:13<6:43:10, 10.22s/it]Merge for (310, 2403) already exists in the vocabulary.
118
+ Training BPE: 9%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 234/2599 [46:43<6:42:05, 10.20s/it]Merge for (783, 303) already exists in the vocabulary.
119
+ Training BPE: 9%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 240/2599 [47:45<6:38:23, 10.13s/it]Merge for (298, 327) already exists in the vocabulary.
120
+ Training BPE: 9%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 243/2599 [48:15<6:35:26, 10.07s/it]Merge for (310, 333) already exists in the vocabulary.
121
+ Training BPE: 9%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 246/2599 [48:45<6:33:10, 10.03s/it]Merge for (312, 318) already exists in the vocabulary.
122
+ Training BPE: 10%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 250/2599 [49:25<6:30:14, 9.97s/it]Merge for (306, 318) already exists in the vocabulary.
123
+ Training BPE: 10%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 251/2599 [49:35<6:31:08, 10.00s/it]Merge for (302, 328) already exists in the vocabulary.
124
+ Training BPE: 10%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 252/2599 [49:45<6:31:56, 10.02s/it]Merge for (309, 2414) already exists in the vocabulary.
125
+ Training BPE: 10%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 257/2599 [50:35<6:30:24, 10.00s/it]Merge for (298, 320) already exists in the vocabulary.
126
+ Training BPE: 10%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 258/2599 [50:45<6:30:24, 10.01s/it]Merge for (289, 321) already exists in the vocabulary.
127
+ Training BPE: 10%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 260/2599 [51:04<6:28:06, 9.96s/it]Merge for (300, 333) already exists in the vocabulary.
128
+ Training BPE: 10%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 263/2599 [51:34<6:26:37, 9.93s/it]Merge for (312, 321) already exists in the vocabulary.
129
+ Training BPE: 10%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 266/2599 [52:04<6:25:29, 9.91s/it]Merge for (311, 333) already exists in the vocabulary.
130
+ Training BPE: 10%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 268/2599 [52:24<6:24:38, 9.90s/it]Merge for (298, 321) already exists in the vocabulary.
131
+ Training BPE: 10%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 269/2599 [52:34<6:24:45, 9.91s/it]Merge for (312, 258) already exists in the vocabulary.
132
+ Training BPE: 10%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 271/2599 [52:54<6:31:20, 10.09s/it]Merge for (284, 318) already exists in the vocabulary.
133
+ Training BPE: 11%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 275/2599 [53:35<6:30:29, 10.08s/it]Merge for (302, 331) already exists in the vocabulary.
134
+ Training BPE: 11%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 278/2599 [54:04<6:19:09, 9.80s/it]Merge for (923, 310) already exists in the vocabulary.
135
+ Training BPE: 11%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 283/2599 [54:53<6:17:56, 9.79s/it]Merge for (743, 295) already exists in the vocabulary.
136
+ Training BPE: 11%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 286/2599 [55:22<6:16:20, 9.76s/it]Merge for (304, 320) already exists in the vocabulary.
137
+ Training BPE: 11%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 289/2599 [55:52<6:13:59, 9.71s/it]Merge for (309, 328) already exists in the vocabulary.
138
+ Training BPE: 11%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 291/2599 [56:11<6:13:59, 9.72s/it]Merge for (282, 319) already exists in the vocabulary.
139
+ Training BPE: 11%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 293/2599 [56:30<6:13:50, 9.73s/it]Merge for (279, 333) already exists in the vocabulary.
140
+ Training BPE: 11%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 297/2599 [57:09<6:09:01, 9.62s/it]Merge for (292, 2414) already exists in the vocabulary.
141
+ Training BPE: 12%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 299/2599 [57:28<6:09:03, 9.63s/it]
142
+ Vocab size: 2700: เฐš + เฐพเฐฐ = เฐšเฐพเฐฐ
143
+ Training BPE: 12%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 302/2599 [57:57<6:07:00, 9.59s/it]Merge for (302, 320) already exists in the vocabulary.
144
+ Training BPE: 12%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 308/2599 [58:54<6:05:50, 9.58s/it]Merge for (302, 327) already exists in the vocabulary.
145
+ Training BPE: 12%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 310/2599 [59:13<6:03:35, 9.53s/it]Merge for (304, 328) already exists in the vocabulary.
146
+ Training BPE: 12%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 321/2599 [1:00:58<5:58:43, 9.45s/it]Merge for (303, 2414) already exists in the vocabulary.
147
+ Training BPE: 12%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 322/2599 [1:01:07<5:58:57, 9.46s/it]Merge for (292, 318) already exists in the vocabulary.
148
+ Training BPE: 13%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 326/2599 [1:01:45<5:55:14, 9.38s/it]Merge for (289, 320) already exists in the vocabulary.
149
+ Training BPE: 13%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 331/2599 [1:02:32<5:54:40, 9.38s/it]Merge for (287, 333) already exists in the vocabulary.
150
+ Training BPE: 13%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 332/2599 [1:02:41<5:54:26, 9.38s/it]Merge for (287, 321) already exists in the vocabulary.
151
+ Training BPE: 13%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 341/2599 [1:04:05<5:50:36, 9.32s/it]Merge for (284, 327) already exists in the vocabulary.
152
+ Training BPE: 14%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 351/2599 [1:05:38<5:50:00, 9.34s/it]Merge for (277, 326) already exists in the vocabulary.
153
+ Training BPE: 14%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 353/2599 [1:05:57<5:48:10, 9.30s/it]Merge for (1023, 298) already exists in the vocabulary.
154
+ Training BPE: 14%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 361/2599 [1:07:11<5:45:45, 9.27s/it]Merge for (302, 322) already exists in the vocabulary.
155
+ Training BPE: 14%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 363/2599 [1:07:30<5:44:06, 9.23s/it]Merge for (287, 2403) already exists in the vocabulary.
156
+ Training BPE: 14%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 372/2599 [1:08:52<5:41:05, 9.19s/it]Merge for (295, 318) already exists in the vocabulary.
157
+ Training BPE: 15%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 377/2599 [1:09:38<5:38:03, 9.13s/it]Merge for (279, 330) already exists in the vocabulary.
158
+ Training BPE: 15%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 380/2599 [1:10:06<5:37:53, 9.14s/it]Merge for (298, 333) already exists in the vocabulary.
159
+ Training BPE: 15%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 385/2599 [1:10:51<5:34:22, 9.06s/it]Merge for (309, 333) already exists in the vocabulary.
160
+ Training BPE: 15%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 388/2599 [1:11:18<5:34:17, 9.07s/it]Merge for (302, 330) already exists in the vocabulary.
161
+ Training BPE: 15%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 391/2599 [1:11:45<5:32:07, 9.03s/it]Merge for (278, 2414) already exists in the vocabulary.
162
+ Training BPE: 15%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 392/2599 [1:11:54<5:30:20, 8.98s/it]Merge for (301, 318) already exists in the vocabulary.
163
+ Training BPE: 15%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 399/2599 [1:12:57<5:30:04, 9.00s/it]
164
+ Vocab size: 2800: เฐฒ + เฑ€เฐธ = เฐฒเฑ€เฐธ
165
+ Training BPE: 15%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 400/2599 [1:13:06<5:30:41, 9.02s/it]Merge for (843, 300) already exists in the vocabulary.
166
+ Training BPE: 16%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 405/2599 [1:13:52<5:30:04, 9.03s/it]Merge for (298, 330) already exists in the vocabulary.
167
+ Training BPE: 16%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 420/2599 [1:16:05<5:24:36, 8.94s/it]Merge for (282, 322) already exists in the vocabulary.
168
+ Training BPE: 16%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 422/2599 [1:16:23<5:23:56, 8.93s/it]Merge for (923, 292) already exists in the vocabulary.
169
+ Training BPE: 16%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 423/2599 [1:16:32<5:23:55, 8.93s/it]Merge for (301, 2414) already exists in the vocabulary.
170
+ Training BPE: 16%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 425/2599 [1:16:50<5:22:38, 8.90s/it]Merge for (279, 331) already exists in the vocabulary.
171
+ Training BPE: 16%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 428/2599 [1:17:17<5:24:58, 8.98s/it]Merge for (303, 319) already exists in the vocabulary.
172
+ Training BPE: 17%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 446/2599 [1:19:56<5:15:42, 8.80s/it]Merge for (313, 318) already exists in the vocabulary.
173
+ Training BPE: 17%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 449/2599 [1:20:23<5:17:26, 8.86s/it]Merge for (301, 319) already exists in the vocabulary.
174
+ Training BPE: 17%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 450/2599 [1:20:31<5:16:23, 8.83s/it]Merge for (277, 319) already exists in the vocabulary.
175
+ Training BPE: 17%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 452/2599 [1:20:49<5:15:23, 8.81s/it]Merge for (312, 331) already exists in the vocabulary.
176
+ Training BPE: 17%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 453/2599 [1:20:58<5:15:21, 8.82s/it]Merge for (284, 319) already exists in the vocabulary.
177
+ Training BPE: 17%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 454/2599 [1:21:07<5:14:49, 8.81s/it]Merge for (312, 327) already exists in the vocabulary.
178
+ Training BPE: 18%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 460/2599 [1:21:59<5:12:18, 8.76s/it]Merge for (287, 326) already exists in the vocabulary.
179
+ Training BPE: 18%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 462/2599 [1:22:17<5:13:12, 8.79s/it]Merge for (313, 326) already exists in the vocabulary.
180
+ Training BPE: 18%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 465/2599 [1:22:43<5:13:17, 8.81s/it]Merge for (284, 326) already exists in the vocabulary.
181
+ Training BPE: 18%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 471/2599 [1:23:36<5:09:57, 8.74s/it]Merge for (277, 323) already exists in the vocabulary.
182
+ Training BPE: 18%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 475/2599 [1:24:11<5:05:57, 8.64s/it]Merge for (298, 319) already exists in the vocabulary.
183
+ Training BPE: 19%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 485/2599 [1:25:37<5:03:47, 8.62s/it]Merge for (310, 319) already exists in the vocabulary.
184
+ Training BPE: 19%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 489/2599 [1:26:11<5:03:59, 8.64s/it]Merge for (312, 322) already exists in the vocabulary.
185
+ Training BPE: 19%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 494/2599 [1:26:55<5:02:17, 8.62s/it]Merge for (301, 322) already exists in the vocabulary.
186
+ Training BPE: 19%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 499/2599 [1:27:38<5:01:00, 8.60s/it]
187
+ Vocab size: 2900: เฐจ + เฑเฐจเฑเฐจ = เฐจเฑเฐจเฑเฐจ
188
+ Training BPE: 19%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 503/2599 [1:28:12<4:57:10, 8.51s/it]Merge for (279, 319) already exists in the vocabulary.
189
+ Training BPE: 20%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 515/2599 [1:29:54<4:55:09, 8.50s/it]Merge for (300, 321) already exists in the vocabulary.
190
+ Training BPE: 20%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 520/2599 [1:30:37<4:55:10, 8.52s/it]Merge for (312, 328) already exists in the vocabulary.
191
+ Training BPE: 20%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 524/2599 [1:31:11<4:57:02, 8.59s/it]Merge for (303, 322) already exists in the vocabulary.
192
+ Training BPE: 20%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 525/2599 [1:31:20<4:57:28, 8.61s/it]Merge for (963, 309) already exists in the vocabulary.
193
+ Training BPE: 20%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 526/2599 [1:31:28<4:56:30, 8.58s/it]Merge for (299, 319) already exists in the vocabulary.
194
+ Training BPE: 20%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 527/2599 [1:31:37<4:56:00, 8.57s/it]Merge for (300, 326) already exists in the vocabulary.
195
+ Training BPE: 21%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 535/2599 [1:32:44<4:51:41, 8.48s/it]Merge for (443, 279) already exists in the vocabulary.
196
+ Training BPE: 21%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 536/2599 [1:32:53<4:51:43, 8.48s/it]Merge for (300, 331) already exists in the vocabulary.
197
+ Training BPE: 21%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 537/2599 [1:33:01<4:52:00, 8.50s/it]Merge for (306, 320) already exists in the vocabulary.
198
+ Training BPE: 21%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 539/2599 [1:33:19<4:53:04, 8.54s/it]Merge for (703, 312) already exists in the vocabulary.
199
+ Training BPE: 22%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 563/2599 [1:36:40<4:45:12, 8.41s/it]Merge for (1023, 303) already exists in the vocabulary.
200
+ Training BPE: 22%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 566/2599 [1:37:06<4:44:06, 8.38s/it]Merge for (292, 330) already exists in the vocabulary.
201
+ Training BPE: 22%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 568/2599 [1:37:22<4:44:37, 8.41s/it]Merge for (294, 2403) already exists in the vocabulary.
202
+ Training BPE: 22%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 579/2599 [1:38:55<4:43:24, 8.42s/it]Merge for (306, 328) already exists in the vocabulary.
203
+ Training BPE: 22%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 581/2599 [1:39:12<4:43:46, 8.44s/it]Merge for (923, 282) already exists in the vocabulary.
204
+ Training BPE: 23%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 597/2599 [1:41:26<4:38:58, 8.36s/it]Merge for (309, 323) already exists in the vocabulary.
205
+ Training BPE: 23%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 599/2599 [1:41:43<4:39:47, 8.39s/it]
206
+ Vocab size: 3000: (เฐ†เฐ‚เฐงเฑเฐฐเฐœเฑเฐฏเฑ‹เฐคเฐฟ) + : = (เฐ†เฐ‚เฐงเฑเฐฐเฐœเฑเฐฏเฑ‹เฐคเฐฟ):
207
+ Training BPE: 23%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 601/2599 [1:42:00<4:38:34, 8.37s/it]Merge for (923, 302) already exists in the vocabulary.
208
+ Training BPE: 23%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 609/2599 [1:43:06<4:36:56, 8.35s/it]Merge for (923, 293) already exists in the vocabulary.
209
+ Training BPE: 23%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 610/2599 [1:43:15<4:37:18, 8.37s/it]Merge for (296, 331) already exists in the vocabulary.
210
+ Training BPE: 24%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 612/2599 [1:43:32<4:37:01, 8.37s/it]Merge for (300, 319) already exists in the vocabulary.
211
+ Training BPE: 24%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 613/2599 [1:43:40<4:35:10, 8.31s/it]Merge for (289, 2403) already exists in the vocabulary.
212
+ Training BPE: 24%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 614/2599 [1:43:48<4:34:45, 8.31s/it]Merge for (296, 326) already exists in the vocabulary.
213
+ Training BPE: 24%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 616/2599 [1:44:05<4:34:12, 8.30s/it]Merge for (310, 321) already exists in the vocabulary.
214
+ Training BPE: 24%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 619/2599 [1:44:30<4:36:34, 8.38s/it]Merge for (292, 327) already exists in the vocabulary.
215
+ Training BPE: 24%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 626/2599 [1:45:28<4:33:09, 8.31s/it]Merge for (284, 333) already exists in the vocabulary.
216
+ Training BPE: 24%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 633/2599 [1:46:26<4:31:57, 8.30s/it]Merge for (1003, 291) already exists in the vocabulary.
217
+ Training BPE: 25%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 637/2599 [1:46:59<4:31:02, 8.29s/it]Merge for (295, 319) already exists in the vocabulary.
218
+ Training BPE: 25%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 649/2599 [1:48:38<4:28:15, 8.25s/it]Merge for (278, 318) already exists in the vocabulary.
219
+ Training BPE: 25%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 660/2599 [1:50:09<4:27:54, 8.29s/it]Merge for (282, 328) already exists in the vocabulary.
220
+ Training BPE: 26%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 663/2599 [1:50:34<4:26:33, 8.26s/it]Merge for (313, 319) already exists in the vocabulary.
221
+ Training BPE: 26%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 671/2599 [1:51:40<4:24:56, 8.24s/it]Merge for (292, 321) already exists in the vocabulary.
222
+ Training BPE: 26%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 677/2599 [1:52:30<4:23:41, 8.23s/it]Merge for (292, 331) already exists in the vocabulary.
223
+ Training BPE: 27%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 699/2599 [1:55:28<4:16:20, 8.09s/it]
224
+ Vocab size: 3100: , + เฐจเฐตเฐ‚เฐฌเฐฐเฑ = , เฐจเฐตเฐ‚เฐฌเฐฐเฑ
225
+ Training BPE: 27%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 712/2599 [1:57:13<4:14:06, 8.08s/it]Merge for (306, 326) already exists in the vocabulary.
226
+ Training BPE: 28%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 716/2599 [1:57:46<4:13:56, 8.09s/it]Merge for (296, 322) already exists in the vocabulary.
227
+ Training BPE: 28%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 717/2599 [1:57:54<4:15:07, 8.13s/it]Merge for (277, 320) already exists in the vocabulary.
228
+ Training BPE: 29%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 749/2599 [2:02:11<4:06:19, 7.99s/it]Merge for (302, 333) already exists in the vocabulary.
229
+ Training BPE: 29%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 756/2599 [2:03:08<4:07:00, 8.04s/it]Merge for (287, 318) already exists in the vocabulary.
230
+ Training BPE: 29%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 761/2599 [2:03:48<4:06:58, 8.06s/it]Merge for (299, 331) already exists in the vocabulary.
231
+ Training BPE: 29%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 766/2599 [2:04:28<4:03:32, 7.97s/it]Merge for (292, 326) already exists in the vocabulary.
232
+ Training BPE: 30%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 771/2599 [2:05:08<4:01:41, 7.93s/it]Merge for (803, 292) already exists in the vocabulary.
233
+ Training BPE: 30%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 776/2599 [2:05:48<4:02:30, 7.98s/it]Merge for (306, 2457) already exists in the vocabulary.
234
+ Training BPE: 30%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 783/2599 [2:06:44<4:04:33, 8.08s/it]Merge for (403, 292) already exists in the vocabulary.
235
+ Training BPE: 31%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 794/2599 [2:08:14<4:04:44, 8.14s/it]Merge for (309, 2403) already exists in the vocabulary.
236
+ Training BPE: 31%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 799/2599 [2:08:54<4:04:07, 8.14s/it]
237
+ Vocab size: 3200: เฑ‹ + เฐœ = เฑ‹เฐœ
238
+ Training BPE: 31%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 804/2599 [2:09:35<4:03:09, 8.13s/it]Merge for (291, 318) already exists in the vocabulary.
239
+ Training BPE: 31%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 807/2599 [2:09:59<4:02:25, 8.12s/it]Merge for (309, 321) already exists in the vocabulary.
240
+ Training BPE: 32%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 823/2599 [2:12:07<3:56:31, 7.99s/it]Merge for (923, 309) already exists in the vocabulary.
241
+ Training BPE: 32%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ๏ฟฝ๏ฟฝโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 837/2599 [2:14:00<3:54:25, 7.98s/it]Merge for (309, 331) already exists in the vocabulary.
242
+ Training BPE: 32%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 844/2599 [2:14:56<3:57:36, 8.12s/it]Merge for (289, 326) already exists in the vocabulary.
243
+ Training BPE: 33%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 856/2599 [2:16:33<3:52:33, 8.01s/it]Merge for (313, 331) already exists in the vocabulary.
244
+ Training BPE: 33%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 868/2599 [2:18:09<3:51:08, 8.01s/it]Merge for (298, 2438) already exists in the vocabulary.
245
+ Training BPE: 34%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 871/2599 [2:18:33<3:51:14, 8.03s/it]Merge for (295, 333) already exists in the vocabulary.
246
+ Training BPE: 34%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 882/2599 [2:20:01<3:50:30, 8.05s/it]Merge for (298, 322) already exists in the vocabulary.
247
+ Training BPE: 34%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 889/2599 [2:20:58<3:48:58, 8.03s/it]Merge for (287, 331) already exists in the vocabulary.
248
+ Training BPE: 35%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 899/2599 [2:22:18<3:48:23, 8.06s/it]
249
+ Vocab size: 3300: เฑเฐ— + เฑ = เฑเฐ—เฑ
250
+ Training BPE: 35%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 907/2599 [2:23:23<3:47:26, 8.07s/it]Merge for (299, 333) already exists in the vocabulary.
251
+ Training BPE: 35%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 914/2599 [2:24:18<3:42:35, 7.93s/it]Merge for (1023, 309) already exists in the vocabulary.
252
+ Training BPE: 36%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 933/2599 [2:26:51<3:42:47, 8.02s/it]Merge for (300, 2403) already exists in the vocabulary.
253
+ Training BPE: 36%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 948/2599 [2:28:52<3:40:22, 8.01s/it]Merge for (279, 332) already exists in the vocabulary.
254
+ Training BPE: 37%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 974/2599 [2:32:20<3:38:58, 8.09s/it]Merge for (284, 328) already exists in the vocabulary.
255
+ Training BPE: 38%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 978/2599 [2:32:52<3:37:19, 8.04s/it]Merge for (279, 2414) already exists in the vocabulary.
256
+ Training BPE: 38%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 996/2599 [2:35:16<3:34:13, 8.02s/it]Merge for (282, 318) already exists in the vocabulary.
257
+ Training BPE: 38%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 999/2599 [2:35:40<3:33:57, 8.02s/it]
258
+ Vocab size: 3400: เฐต + เฑ = เฐตเฑ
259
+ Training BPE: 38%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 1000/2599 [2:35:48<3:34:52, 8.06s/it]Merge for (284, 331) already exists in the vocabulary.
260
+ Training BPE: 39%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 1019/2599 [2:38:19<3:27:43, 7.89s/it]Merge for (299, 326) already exists in the vocabulary.
261
+ Training BPE: 39%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 1025/2599 [2:39:07<3:29:29, 7.99s/it]Merge for (307, 318) already exists in the vocabulary.
262
+ Training BPE: 40%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 1034/2599 [2:40:19<3:25:44, 7.89s/it]Merge for (313, 320) already exists in the vocabulary.
263
+ Training BPE: 40%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 1039/2599 [2:40:58<3:25:01, 7.89s/it]Merge for (983, 296) already exists in the vocabulary.
264
+ Training BPE: 40%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ๏ฟฝ๏ฟฝโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 1048/2599 [2:42:10<3:26:27, 7.99s/it]Merge for (289, 327) already exists in the vocabulary.
265
+ Training BPE: 41%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 1074/2599 [2:45:37<3:19:53, 7.86s/it]Merge for (287, 328) already exists in the vocabulary.
266
+ Training BPE: 42%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 1097/2599 [2:48:39<3:18:38, 7.94s/it]Merge for (289, 328) already exists in the vocabulary.
267
+ Training BPE: 42%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 1098/2599 [2:48:47<3:20:05, 8.00s/it]Merge for (302, 323) already exists in the vocabulary.
268
+ Training BPE: 42%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 1099/2599 [2:48:55<3:19:12, 7.97s/it]
269
+ Vocab size: 3500: เฐฎ + เฑƒ = เฐฎเฑƒ
270
+ Training BPE: 43%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 1114/2599 [2:50:55<3:18:19, 8.01s/it]Merge for (311, 327) already exists in the vocabulary.
271
+ Training BPE: 44%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 1140/2599 [2:54:21<3:13:37, 7.96s/it]Merge for (294, 322) already exists in the vocabulary.
272
+ Training BPE: 45%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 1158/2599 [2:56:44<3:11:07, 7.96s/it]Merge for (299, 320) already exists in the vocabulary.
273
+ Training BPE: 45%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 1181/2599 [2:59:46<3:06:18, 7.88s/it]Merge for (983, 309) already exists in the vocabulary.
274
+ Training BPE: 46%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 1199/2599 [3:02:09<3:04:53, 7.92s/it]
275
+ Vocab size: 3600: เฐธเฑเฐŸ + เฑ‡ = เฐธเฑเฐŸเฑ‡
276
+ Training BPE: 47%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 1210/2599 [3:03:37<3:04:37, 7.98s/it]Merge for (311, 318) already exists in the vocabulary.
277
+ Training BPE: 47%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 1214/2599 [3:04:08<3:01:11, 7.85s/it]Merge for (300, 2412) already exists in the vocabulary.
278
+ Training BPE: 47%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 1224/2599 [3:05:26<2:59:35, 7.84s/it]Merge for (282, 331) already exists in the vocabulary.
279
+ Training BPE: 47%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 1226/2599 [3:05:42<3:00:33, 7.89s/it]Merge for (299, 328) already exists in the vocabulary.
280
+ Training BPE: 48%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 1241/2599 [3:07:40<2:57:38, 7.85s/it]Merge for (307, 333) already exists in the vocabulary.
281
+ Training BPE: 48%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 1254/2599 [3:09:23<2:57:35, 7.92s/it]Merge for (310, 327) already exists in the vocabulary.
282
+ Training BPE: 49%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 1268/2599 [3:11:12<2:53:54, 7.84s/it]Merge for (923, 279) already exists in the vocabulary.
283
+ Training BPE: 49%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 1274/2599 [3:11:59<2:50:48, 7.73s/it]Merge for (303, 321) already exists in the vocabulary.
284
+ Training BPE: 49%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 1279/2599 [3:12:39<2:53:24, 7.88s/it]Merge for (284, 321) already exists in the vocabulary.
285
+ Training BPE: 49%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 1280/2599 [3:12:47<2:53:37, 7.90s/it]Merge for (294, 331) already exists in the vocabulary.
286
+ Training BPE: 50%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 1298/2599 [3:15:08<2:49:45, 7.83s/it]Merge for (923, 298) already exists in the vocabulary.
287
+ Training BPE: 50%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 1299/2599 [3:15:16<2:49:19, 7.82s/it]
288
+ Vocab size: 3700: เฐฐเฑ + เฐช = เฐฐเฑเฐช
289
+ Training BPE: 50%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 1306/2599 [3:16:11<2:48:35, 7.82s/it]Merge for (284, 322) already exists in the vocabulary.
290
+ Training BPE: 50%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 1309/2599 [3:16:34<2:47:53, 7.81s/it]Merge for (310, 331) already exists in the vocabulary.
291
+ Training BPE: 51%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 1314/2599 [3:17:13<2:48:02, 7.85s/it]Merge for (287, 327) already exists in the vocabulary.
292
+ Training BPE: 51%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 1334/2599 [3:19:48<2:45:00, 7.83s/it]Merge for (282, 321) already exists in the vocabulary.
293
+ Training BPE: 52%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 1341/2599 [3:20:43<2:42:21, 7.74s/it]Merge for (277, 332) already exists in the vocabulary.
294
+ Training BPE: 52%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 1343/2599 [3:20:58<2:42:16, 7.75s/it]Merge for (277, 2412) already exists in the vocabulary.
295
+ Training BPE: 52%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 1346/2599 [3:21:22<2:42:56, 7.80s/it]Merge for (282, 320) already exists in the vocabulary.
296
+ Training BPE: 52%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 1349/2599 [3:21:45<2:41:49, 7.77s/it]Merge for (299, 2438) already exists in the vocabulary.
297
+ Training BPE: 54%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 1392/2599 [3:27:18<2:36:01, 7.76s/it]Merge for (289, 2412) already exists in the vocabulary.
298
+ Training BPE: 54%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 1399/2599 [3:28:12<2:33:45, 7.69s/it]
299
+ Vocab size: 3800: เฐซเฐฟเฐฐเฑเฐฏเฐพ + เฐฆเฑ = เฐซเฐฟเฐฐเฑเฐฏเฐพเฐฆเฑ
300
+ Training BPE: 55%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 1421/2599 [3:31:01<2:29:57, 7.64s/it]Merge for (294, 330) already exists in the vocabulary.
301
+ Training BPE: 55%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 1442/2599 [3:33:44<2:29:12, 7.74s/it]Merge for (313, 333) already exists in the vocabulary.
302
+ Training BPE: 57%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 1478/2599 [3:38:20<2:23:52, 7.70s/it]Merge for (783, 312) already exists in the vocabulary.
303
+ Training BPE: 57%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 1483/2599 [3:38:57<2:20:01, 7.53s/it]Merge for (1003, 288) already exists in the vocabulary.
304
+ Training BPE: 57%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 1491/2599 [3:39:59<2:21:18, 7.65s/it]Merge for (703, 302) already exists in the vocabulary.
305
+ Training BPE: 58%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 1499/2599 [3:41:01<2:20:24, 7.66s/it]
306
+ Vocab size: 3900: เฐฎเฐพเฐŸเฑเฐฒเฐพเฐก + เฐพเฐฐเฑ. = เฐฎเฐพเฐŸเฑเฐฒเฐพเฐกเฐพเฐฐเฑ.
307
+ Training BPE: 58%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 1508/2599 [3:42:10<2:18:48, 7.63s/it]Merge for (312, 330) already exists in the vocabulary.
308
+ Training BPE: 60%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 1565/2599 [3:49:23<2:10:01, 7.55s/it]Merge for (298, 2412) already exists in the vocabulary.
309
+ Training BPE: 61%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 1585/2599 [3:51:55<2:07:34, 7.55s/it]Merge for (300, 328) already exists in the vocabulary.
310
+ Training BPE: 61%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 1597/2599 [3:53:26<2:04:38, 7.46s/it]Merge for (296, 328) already exists in the vocabulary.
311
+ Training BPE: 62%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 1599/2599 [3:53:41<2:05:58, 7.56s/it]
312
+ Vocab size: 4000: เฐคเฐฟ + เฐจเฐฟ = เฐคเฐฟเฐจเฐฟ
313
+ Training BPE: 62%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 1611/2599 [3:55:11<2:01:21, 7.37s/it]Merge for (284, 2403) already exists in the vocabulary.
314
+ Training BPE: 62%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 1613/2599 [3:55:25<2:00:58, 7.36s/it]Merge for (303, 331) already exists in the vocabulary.
315
+ Training BPE: 63%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 1630/2599 [3:57:33<2:00:49, 7.48s/it]Merge for (543, 286) already exists in the vocabulary.
316
+ Training BPE: 63%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 1639/2599 [3:58:40<1:59:28, 7.47s/it]Merge for (1023, 312) already exists in the vocabulary.
317
+ Training BPE: 64%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 1653/2599 [4:00:24<1:57:20, 7.44s/it]Merge for (300, 327) already exists in the vocabulary.
318
+ Training BPE: 64%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 1664/2599 [4:01:47<1:57:03, 7.51s/it]Merge for (295, 321) already exists in the vocabulary.
319
+ Training BPE: 65%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 1698/2599 [4:05:59<1:51:53, 7.45s/it]Merge for (313, 321) already exists in the vocabulary.
320
+ Training BPE: 65%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 1699/2599 [4:06:07<1:51:03, 7.40s/it]
321
+ Vocab size: 4100: เฐน + เฑ = เฐนเฑ
322
+ Training BPE: 67%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 1750/2599 [4:12:21<1:42:27, 7.24s/it]Merge for (277, 2414) already exists in the vocabulary.
323
+ Training BPE: 69%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 1799/2599 [4:18:21<1:36:27, 7.23s/it]
324
+ Vocab size: 4200: เฐชเฐพเฐฐเฑ + เฐŸเฑ€ = เฐชเฐพเฐฐเฑเฐŸเฑ€
325
+ Training BPE: 70%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 1822/2599 [4:21:08<1:34:43, 7.31s/it]Merge for (294, 326) already exists in the vocabulary.
326
+ Training BPE: 70%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 1830/2599 [4:22:07<1:35:05, 7.42s/it]Merge for (300, 330) already exists in the vocabulary.
327
+ Training BPE: 72%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 1862/2599 [4:26:01<1:29:42, 7.30s/it]Merge for (923, 311) already exists in the vocabulary.
328
+ Training BPE: 73%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 1888/2599 [4:29:10<1:26:21, 7.29s/it]Merge for (299, 2403) already exists in the vocabulary.
329
+ Training BPE: 73%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 1899/2599 [4:30:30<1:24:22, 7.23s/it]
330
+ Vocab size: 4300: เฐธเฐฎ + เฐฏเฐ‚เฐฒเฑ‹ = เฐธเฐฎเฐฏเฐ‚เฐฒเฑ‹
331
+ Training BPE: 74%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 1918/2599 [4:32:47<1:22:06, 7.23s/it]Merge for (310, 258) already exists in the vocabulary.
332
+ Training BPE: 74%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 1929/2599 [4:34:06<1:20:24, 7.20s/it]Merge for (300, 332) already exists in the vocabulary.
333
+ Training BPE: 77%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 1999/2599 [4:42:31<1:11:27, 7.15s/it]
334
+ Vocab size: 4400: เฐชเฑเฐฐ + เฐธเฐพ = เฐชเฑเฐฐเฐธเฐพ
335
+ Merge for (295, 320) already exists in the vocabulary.
336
+ Training BPE: 78%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 2016/2599 [4:44:34<1:10:21, 7.24s/it]Merge for (923, 306) already exists in the vocabulary.
337
+ Training BPE: 78%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 2019/2599 [4:44:55<1:08:38, 7.10s/it]Merge for (1064, 327) already exists in the vocabulary.
338
+ Training BPE: 79%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 2043/2599 [4:47:47<1:06:17, 7.15s/it]Merge for (300, 322) already exists in the vocabulary.
339
+ Training BPE: 80%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 2074/2599 [4:51:29<1:03:09, 7.22s/it]Merge for (943, 282) already exists in the vocabulary.
340
+ Training BPE: 81%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 2099/2599 [4:54:26<58:57, 7.08s/it]
341
+ Vocab size: 4500: เฐ… + เฐงเฑเฐฏเฐ•เฑเฐทเฑเฐกเฑ = เฐ…เฐงเฑเฐฏเฐ•เฑเฐทเฑเฐกเฑ
342
+ Training BPE: 81%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 2106/2599 [4:55:16<58:50, 7.16s/it]Merge for (299, 321) already exists in the vocabulary.
343
+ Training BPE: 82%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 2140/2599 [4:59:19<55:56, 7.31s/it]Merge for (291, 2414) already exists in the vocabulary.
344
+ Training BPE: 83%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 2158/2599 [5:01:26<52:14, 7.11s/it]Merge for (279, 327) already exists in the vocabulary.
345
+ Training BPE: 84%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 2182/2599 [5:04:16<49:28, 7.12s/it]Merge for (311, 2414) already exists in the vocabulary.
346
+ Training BPE: 85%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 2199/2599 [5:06:16<46:36, 6.99s/it]
347
+ Vocab size: 4600: เฐฆเฑ‡ + เฐถเฐพ = เฐฆเฑ‡เฐถเฐพ
348
+ Training BPE: 85%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 2207/2599 [5:07:12<45:58, 7.04s/it]Merge for (312, 332) already exists in the vocabulary.
349
+ Training BPE: 86%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ๏ฟฝ๏ฟฝโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 2234/2599 [5:10:23<43:07, 7.09s/it]Merge for (503, 283) already exists in the vocabulary.
350
+ Training BPE: 87%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 2250/2599 [5:12:15<40:04, 6.89s/it]Merge for (310, 320) already exists in the vocabulary.
351
+ Training BPE: 87%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 2258/2599 [5:13:11<40:00, 7.04s/it]Merge for (299, 318) already exists in the vocabulary.
352
+ Training BPE: 88%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 2299/2599 [5:17:58<34:31, 6.91s/it]
353
+ Vocab size: 4700: เฐฐเฑ + เฐฒเฑ‹ = เฐฐเฑเฐฒเฑ‹
354
+ Training BPE: 91%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 2367/2599 [5:25:54<26:50, 6.94s/it]Merge for (843, 294) already exists in the vocabulary.
355
+ Training BPE: 92%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 2399/2599 [5:29:38<23:13, 6.97s/it]
356
+ Vocab size: 4800: เฐธ + เฐฆ = เฐธเฐฆ
357
+ Training BPE: 93%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 2414/2599 [5:31:22<21:35, 7.01s/it]Merge for (280, 318) already exists in the vocabulary.
358
+ Training BPE: 96%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 2499/2599 [5:41:03<11:12, 6.73s/it]
359
+ Vocab size: 4900: เฐญเฐตเฐฟเฐท + เฑเฐฏ = เฐญเฐตเฐฟเฐทเฑเฐฏ
360
+ Training BPE: 96%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 2505/2599 [5:41:43<10:27, 6.67s/it]Merge for (763, 309) already exists in the vocabulary.
361
+ Training BPE: 99%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž| 2574/2599 [5:49:29<02:50, 6.84s/it]Merge for (313, 332) already exists in the vocabulary.
362
+ Training BPE: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰| 2598/2599 [5:52:10<00:08, 8.13s/it]
363
+
364
+ Final statistics:
365
+ Final vocabulary size: 4,999
366
+ Number of merges: 2,599
367
+ Final compression ratio: 8.63x
368
+ Training time: 21135.62 seconds
369
+
370
+ Tokenizer mappings saved to telugu_tokenizer_vocab.json and telugu_tokenizer_merges.json
371
+
372
+ Test Results:
373
+ Original: เฐคเฑ†เฐฒเฑเฐ—เฑ เฐญเฐพเฐท
374
+ Encoded: [4149, 4717]
375
+ Decoded: เฐคเฑ†เฐฒเฑเฐ—เฑ เฐญเฐพเฐท
376
+ Matches original: True