Spaces:
Running
Running
Add application file
Browse files- README.md +100 -14
- app.py +247 -0
- requirements.txt +5 -0
README.md
CHANGED
@@ -1,14 +1,100 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Turkish Tiktokenizer Web App
|
2 |
+
|
3 |
+
A Streamlit-based web interface for the Turkish Morphological Tokenizer. This app provides an interactive way to tokenize Turkish text with real-time visualization and color-coded token display.
|
4 |
+
|
5 |
+
## Features
|
6 |
+
|
7 |
+
- π€ Turkish text tokenization with morphological analysis
|
8 |
+
- π¨ Color-coded token visualization
|
9 |
+
- π’ Token count and ID display
|
10 |
+
- π Special token highlighting (uppercase, space, newline, etc.)
|
11 |
+
- π Version selection from GitHub commit history
|
12 |
+
- π Direct integration with GitHub repository
|
13 |
+
|
14 |
+
## Demo
|
15 |
+
|
16 |
+
You can try the live demo at [Hugging Face Spaces](https://huggingface.co/spaces/YOUR_USERNAME/turkish-tiktokenizer) (Replace with your actual Spaces URL)
|
17 |
+
|
18 |
+
## Installation
|
19 |
+
|
20 |
+
1. Clone the repository:
|
21 |
+
```bash
|
22 |
+
git clone https://github.com/malibayram/tokenizer.git
|
23 |
+
cd tokenizer/streamlit_app
|
24 |
+
```
|
25 |
+
|
26 |
+
2. Install dependencies:
|
27 |
+
```bash
|
28 |
+
pip install -r requirements.txt
|
29 |
+
```
|
30 |
+
|
31 |
+
## Usage
|
32 |
+
|
33 |
+
1. Run the Streamlit app:
|
34 |
+
```bash
|
35 |
+
streamlit run app.py
|
36 |
+
```
|
37 |
+
|
38 |
+
2. Open your browser and navigate to http://localhost:8501
|
39 |
+
|
40 |
+
3. Enter Turkish text in the input area and click "Tokenize"
|
41 |
+
|
42 |
+
## How It Works
|
43 |
+
|
44 |
+
1. **Text Input**: Enter Turkish text in the left panel
|
45 |
+
2. **Tokenization**: Click the "Tokenize" button to process the text
|
46 |
+
3. **Visualization**:
|
47 |
+
- Token count is displayed at the top
|
48 |
+
- Tokens are shown with color-coding:
|
49 |
+
- Special tokens (uppercase, space, etc.) have predefined colors
|
50 |
+
- Regular tokens get unique colors for easy identification
|
51 |
+
- Token IDs are displayed below the visualization
|
52 |
+
|
53 |
+
## Code Structure
|
54 |
+
|
55 |
+
- `app.py`: Main Streamlit application
|
56 |
+
- UI components and layout
|
57 |
+
- GitHub integration
|
58 |
+
- Tokenization logic
|
59 |
+
- Color generation and visualization
|
60 |
+
- `requirements.txt`: Python dependencies
|
61 |
+
|
62 |
+
## Technical Details
|
63 |
+
|
64 |
+
- **Tokenizer Source**: Fetched directly from GitHub repository
|
65 |
+
- **Caching**: Uses Streamlit's caching for better performance
|
66 |
+
- **Color Generation**: HSV-based algorithm for visually distinct colors
|
67 |
+
- **Session State**: Maintains text and results between interactions
|
68 |
+
- **Error Handling**: Graceful handling of GitHub API and tokenization errors
|
69 |
+
|
70 |
+
## Deployment to Hugging Face Spaces
|
71 |
+
|
72 |
+
1. Create a new Space:
|
73 |
+
- Go to https://huggingface.co/spaces
|
74 |
+
- Click "Create new Space"
|
75 |
+
- Select "Streamlit" as the SDK
|
76 |
+
- Choose a name for your Space
|
77 |
+
|
78 |
+
2. Upload files:
|
79 |
+
- `app.py`
|
80 |
+
- `requirements.txt`
|
81 |
+
|
82 |
+
3. The app will automatically deploy and be available at your Space's URL
|
83 |
+
|
84 |
+
## Contributing
|
85 |
+
|
86 |
+
1. Fork the repository
|
87 |
+
2. Create your feature branch
|
88 |
+
3. Commit your changes
|
89 |
+
4. Push to the branch
|
90 |
+
5. Create a Pull Request
|
91 |
+
|
92 |
+
## License
|
93 |
+
|
94 |
+
MIT License - see the [LICENSE](../LICENSE) file for details
|
95 |
+
|
96 |
+
## Acknowledgments
|
97 |
+
|
98 |
+
- Built by dqbd
|
99 |
+
- Created with the generous help from Diagram
|
100 |
+
- Based on the [Turkish Morphological Tokenizer](https://github.com/malibayram/tokenizer)
|
app.py
ADDED
@@ -0,0 +1,247 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import streamlit as st
|
2 |
+
import sys
|
3 |
+
from datetime import datetime
|
4 |
+
from pathlib import Path
|
5 |
+
import base64
|
6 |
+
import colorsys
|
7 |
+
import shutil
|
8 |
+
import atexit
|
9 |
+
import requests
|
10 |
+
import importlib.util
|
11 |
+
|
12 |
+
# Set page config - MUST BE FIRST STREAMLIT COMMAND
|
13 |
+
st.set_page_config(
|
14 |
+
page_title="Turkish Tiktokenizer",
|
15 |
+
page_icon="πΉπ·",
|
16 |
+
layout="wide"
|
17 |
+
)
|
18 |
+
|
19 |
+
# Initialize session state
|
20 |
+
if 'text' not in st.session_state:
|
21 |
+
st.session_state.text = ""
|
22 |
+
if 'token_results' not in st.session_state:
|
23 |
+
st.session_state.token_results = None
|
24 |
+
|
25 |
+
# Constants
|
26 |
+
GITHUB_REPO = "malibayram/tokenizer"
|
27 |
+
GITHUB_BRANCH = "main"
|
28 |
+
|
29 |
+
# Special tokens and their IDs
|
30 |
+
SPECIAL_TOKENS = {
|
31 |
+
"<uppercase>": 0, # Uppercase letter marker
|
32 |
+
"<space>": 1, # Space character
|
33 |
+
"<newline>": 2, # Newline character
|
34 |
+
"<tab>": 3, # Tab character
|
35 |
+
"<unknown>": 4 # Unknown token
|
36 |
+
}
|
37 |
+
|
38 |
+
# Special token display symbols
|
39 |
+
SPECIAL_TOKEN_SYMBOLS = {
|
40 |
+
"<uppercase>": "[uppercase]", # Up arrow for uppercase
|
41 |
+
"<space>": "[space]", # Space symbol
|
42 |
+
"<newline>": "[newline]", # Return symbol
|
43 |
+
"<tab>": "[tab]", # Tab symbol
|
44 |
+
"<unknown>": "[unknown]" # Question mark for unknown
|
45 |
+
}
|
46 |
+
|
47 |
+
# Colors for special tokens
|
48 |
+
SPECIAL_COLORS = {
|
49 |
+
"<uppercase>": "#FF9999", # Light red for uppercase markers
|
50 |
+
"<space>": "#CCCCCC", # Gray for spaces
|
51 |
+
"<newline>": "#CCCCCC", # Gray for newlines
|
52 |
+
"<tab>": "#CCCCCC", # Gray for tabs
|
53 |
+
"<unknown>": "#FF0000" # Red for unknown tokens
|
54 |
+
}
|
55 |
+
|
56 |
+
# Required files mapping
|
57 |
+
REQUIRED_FILES = {
|
58 |
+
'tokenizer.py': 'turkish_tokenizer/turkish_tokenizer.py',
|
59 |
+
'kokler_v05.json': 'turkish_tokenizer/kokler_v05.json',
|
60 |
+
'ekler_v05.json': 'turkish_tokenizer/ekler_v05.json',
|
61 |
+
'bpe_v05.json': 'turkish_tokenizer/bpe_v05.json'
|
62 |
+
}
|
63 |
+
|
64 |
+
# Token ID ranges
|
65 |
+
TOKEN_RANGES = {
|
66 |
+
'special': (0, 4), # Special tokens
|
67 |
+
'root_words': (5, 20000), # Root words
|
68 |
+
'suffixes': (22268, 22767), # Suffixes
|
69 |
+
'bpe': (20000, None) # BPE tokens (20000+)
|
70 |
+
}
|
71 |
+
|
72 |
+
def generate_colors(n):
|
73 |
+
"""Generate n visually distinct colors."""
|
74 |
+
colors = []
|
75 |
+
for i in range(n):
|
76 |
+
hue = i / n
|
77 |
+
saturation = 0.3 + (i % 3) * 0.1 # Vary saturation between 0.3-0.5
|
78 |
+
value = 0.95 - (i % 2) * 0.1 # Vary value between 0.85-0.95
|
79 |
+
rgb = colorsys.hsv_to_rgb(hue, saturation, value)
|
80 |
+
hex_color = "#{:02x}{:02x}{:02x}".format(
|
81 |
+
int(rgb[0] * 255),
|
82 |
+
int(rgb[1] * 255),
|
83 |
+
int(rgb[2] * 255)
|
84 |
+
)
|
85 |
+
colors.append(hex_color)
|
86 |
+
return colors
|
87 |
+
|
88 |
+
def fetch_github_file(path, ref=GITHUB_BRANCH):
|
89 |
+
"""Fetch file content from GitHub repository."""
|
90 |
+
url = f"https://api.github.com/repos/{GITHUB_REPO}/contents/{path}?ref={ref}"
|
91 |
+
response = requests.get(url)
|
92 |
+
if response.status_code == 200:
|
93 |
+
content = base64.b64decode(response.json()['content']).decode('utf-8')
|
94 |
+
return content
|
95 |
+
else:
|
96 |
+
st.error(f"Could not fetch {path} from GitHub: {response.status_code}")
|
97 |
+
return None
|
98 |
+
|
99 |
+
@st.cache_resource
|
100 |
+
def load_tokenizer():
|
101 |
+
"""Load and initialize the tokenizer from GitHub."""
|
102 |
+
temp_dir = Path("temp_tokenizer")
|
103 |
+
temp_dir.mkdir(exist_ok=True)
|
104 |
+
|
105 |
+
# Fetch required files
|
106 |
+
for local_name, github_path in REQUIRED_FILES.items():
|
107 |
+
content = fetch_github_file(github_path)
|
108 |
+
if content is None:
|
109 |
+
return None
|
110 |
+
|
111 |
+
with open(temp_dir / local_name, 'w', encoding='utf-8') as f:
|
112 |
+
f.write(content)
|
113 |
+
|
114 |
+
# Modify tokenizer to use correct paths
|
115 |
+
tokenizer_path = temp_dir / "tokenizer.py"
|
116 |
+
with open(tokenizer_path, 'r', encoding='utf-8') as f:
|
117 |
+
tokenizer_code = f.read()
|
118 |
+
|
119 |
+
modified_code = tokenizer_code.replace(
|
120 |
+
'def load_json(filename):',
|
121 |
+
f'''def load_json(filename):
|
122 |
+
full_path = os.path.join("{temp_dir.absolute()}", filename)
|
123 |
+
with open(full_path, 'r', encoding='utf-8') as file:
|
124 |
+
return json.load(file)'''
|
125 |
+
)
|
126 |
+
|
127 |
+
with open(tokenizer_path, 'w', encoding='utf-8') as f:
|
128 |
+
f.write(modified_code)
|
129 |
+
|
130 |
+
# Load module
|
131 |
+
spec = importlib.util.spec_from_file_location("tokenizer", str(temp_dir / "tokenizer.py"))
|
132 |
+
module = importlib.util.module_from_spec(spec)
|
133 |
+
sys.modules["tokenizer"] = module
|
134 |
+
spec.loader.exec_module(module)
|
135 |
+
|
136 |
+
return module.tokenize
|
137 |
+
|
138 |
+
@st.cache_data(ttl=3600)
|
139 |
+
def get_commit_history():
|
140 |
+
"""Fetch commit history from GitHub."""
|
141 |
+
url = f"https://api.github.com/repos/{GITHUB_REPO}/commits"
|
142 |
+
try:
|
143 |
+
response = requests.get(url)
|
144 |
+
if response.status_code == 200:
|
145 |
+
commits = response.json()
|
146 |
+
versions = []
|
147 |
+
for commit in commits[:10]:
|
148 |
+
date = datetime.strptime(commit['commit']['author']['date'], '%Y-%m-%dT%H:%M:%SZ').strftime('%Y-%m-%d')
|
149 |
+
sha = commit['sha'][:7]
|
150 |
+
message = commit['commit']['message'].split('\n')[0][:50]
|
151 |
+
versions.append(f"{date} - {sha} - {message}")
|
152 |
+
return versions
|
153 |
+
return ["latest"]
|
154 |
+
except Exception as e:
|
155 |
+
st.warning(f"Could not fetch commit history: {str(e)}")
|
156 |
+
return ["latest"]
|
157 |
+
|
158 |
+
def render_tokens(tokens, token_colors):
|
159 |
+
"""Render colored token visualization."""
|
160 |
+
html_tokens = []
|
161 |
+
for token in tokens:
|
162 |
+
color = token_colors[token]
|
163 |
+
display_text = SPECIAL_TOKEN_SYMBOLS.get(token, token) # Use symbol for special tokens
|
164 |
+
html_tokens.append(
|
165 |
+
f'<span style="background-color: {color}; padding: 2px 4px; margin: 2px; border-radius: 3px;" title="{token}">{display_text}</span>'
|
166 |
+
)
|
167 |
+
return " ".join(html_tokens)
|
168 |
+
|
169 |
+
# Load tokenizer
|
170 |
+
tokenize = load_tokenizer()
|
171 |
+
if tokenize is None:
|
172 |
+
st.error("Failed to load tokenizer from GitHub")
|
173 |
+
st.stop()
|
174 |
+
|
175 |
+
# UI Layout
|
176 |
+
st.title("πΉπ· Turkish Tiktokenizer")
|
177 |
+
|
178 |
+
# Model selection
|
179 |
+
versions = get_commit_history()
|
180 |
+
model = st.selectbox("", versions, key="model_selection", label_visibility="collapsed")
|
181 |
+
|
182 |
+
# Main layout
|
183 |
+
col1, col2 = st.columns([0.4, 0.6])
|
184 |
+
|
185 |
+
# Input column
|
186 |
+
with col1:
|
187 |
+
text = st.text_area(
|
188 |
+
"Enter Turkish text to tokenize",
|
189 |
+
value=st.session_state.text,
|
190 |
+
height=200,
|
191 |
+
key="text_input",
|
192 |
+
label_visibility="collapsed",
|
193 |
+
placeholder="Enter Turkish text to tokenize"
|
194 |
+
)
|
195 |
+
|
196 |
+
if st.button("Tokenize", type="primary"):
|
197 |
+
st.session_state.text = text
|
198 |
+
if text.strip():
|
199 |
+
try:
|
200 |
+
st.session_state.token_results = tokenize(text)
|
201 |
+
except Exception as e:
|
202 |
+
st.session_state.token_results = None
|
203 |
+
st.error(f"Error tokenizing text: {str(e)}")
|
204 |
+
else:
|
205 |
+
st.session_state.token_results = None
|
206 |
+
|
207 |
+
# Results column
|
208 |
+
with col2:
|
209 |
+
st.markdown("Token count")
|
210 |
+
if st.session_state.token_results is not None:
|
211 |
+
result = st.session_state.token_results
|
212 |
+
token_count = len(result["tokens"])
|
213 |
+
st.markdown(f"### {token_count}")
|
214 |
+
|
215 |
+
st.markdown("Tokenized text")
|
216 |
+
|
217 |
+
# Generate token colors
|
218 |
+
regular_tokens = [t for t in result["tokens"] if t not in SPECIAL_COLORS]
|
219 |
+
regular_token_colors = dict(zip(regular_tokens, generate_colors(len(regular_tokens))))
|
220 |
+
token_colors = {**SPECIAL_COLORS, **regular_token_colors}
|
221 |
+
|
222 |
+
# Render tokens
|
223 |
+
with st.container():
|
224 |
+
st.markdown(render_tokens(result["tokens"], token_colors), unsafe_allow_html=True)
|
225 |
+
|
226 |
+
st.markdown("Token IDs")
|
227 |
+
st.code(", ".join(map(str, result["ids"])), language=None)
|
228 |
+
else:
|
229 |
+
st.markdown("### 0")
|
230 |
+
st.markdown("Tokenized text")
|
231 |
+
st.markdown("")
|
232 |
+
st.markdown("Token IDs")
|
233 |
+
st.text("")
|
234 |
+
|
235 |
+
# Footer
|
236 |
+
st.markdown("""
|
237 |
+
<div style="position: fixed; bottom: 0; width: 100%; text-align: center; padding: 10px; background-color: white;">
|
238 |
+
<a href="https://github.com/malibayram/tokenizer" target="_blank">View on GitHub</a>
|
239 |
+
</div>
|
240 |
+
""", unsafe_allow_html=True)
|
241 |
+
|
242 |
+
# Cleanup
|
243 |
+
def cleanup():
|
244 |
+
if Path("temp_tokenizer").exists():
|
245 |
+
shutil.rmtree("temp_tokenizer")
|
246 |
+
|
247 |
+
atexit.register(cleanup)
|
requirements.txt
ADDED
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
streamlit>=1.24.0
|
2 |
+
numpy>=1.21.0
|
3 |
+
json5>=0.9.0
|
4 |
+
requests>=2.31.0
|
5 |
+
pathlib>=1.0.1
|