alibayram commited on
Commit
26ddb6c
Β·
1 Parent(s): 91f52d5

Add application file

Browse files
Files changed (3) hide show
  1. README.md +100 -14
  2. app.py +247 -0
  3. requirements.txt +5 -0
README.md CHANGED
@@ -1,14 +1,100 @@
1
- ---
2
- title: Turkish Tiktokenizer
3
- emoji: πŸ‘
4
- colorFrom: red
5
- colorTo: red
6
- sdk: streamlit
7
- sdk_version: 1.41.1
8
- app_file: app.py
9
- pinned: false
10
- license: cc-by-nc-nd-4.0
11
- short_description: Turkish Morphological Tokenizer
12
- ---
13
-
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Turkish Tiktokenizer Web App
2
+
3
+ A Streamlit-based web interface for the Turkish Morphological Tokenizer. This app provides an interactive way to tokenize Turkish text with real-time visualization and color-coded token display.
4
+
5
+ ## Features
6
+
7
+ - πŸ”€ Turkish text tokenization with morphological analysis
8
+ - 🎨 Color-coded token visualization
9
+ - πŸ”’ Token count and ID display
10
+ - πŸ“Š Special token highlighting (uppercase, space, newline, etc.)
11
+ - πŸ”„ Version selection from GitHub commit history
12
+ - 🌐 Direct integration with GitHub repository
13
+
14
+ ## Demo
15
+
16
+ You can try the live demo at [Hugging Face Spaces](https://huggingface.co/spaces/YOUR_USERNAME/turkish-tiktokenizer) (Replace with your actual Spaces URL)
17
+
18
+ ## Installation
19
+
20
+ 1. Clone the repository:
21
+ ```bash
22
+ git clone https://github.com/malibayram/tokenizer.git
23
+ cd tokenizer/streamlit_app
24
+ ```
25
+
26
+ 2. Install dependencies:
27
+ ```bash
28
+ pip install -r requirements.txt
29
+ ```
30
+
31
+ ## Usage
32
+
33
+ 1. Run the Streamlit app:
34
+ ```bash
35
+ streamlit run app.py
36
+ ```
37
+
38
+ 2. Open your browser and navigate to http://localhost:8501
39
+
40
+ 3. Enter Turkish text in the input area and click "Tokenize"
41
+
42
+ ## How It Works
43
+
44
+ 1. **Text Input**: Enter Turkish text in the left panel
45
+ 2. **Tokenization**: Click the "Tokenize" button to process the text
46
+ 3. **Visualization**:
47
+ - Token count is displayed at the top
48
+ - Tokens are shown with color-coding:
49
+ - Special tokens (uppercase, space, etc.) have predefined colors
50
+ - Regular tokens get unique colors for easy identification
51
+ - Token IDs are displayed below the visualization
52
+
53
+ ## Code Structure
54
+
55
+ - `app.py`: Main Streamlit application
56
+ - UI components and layout
57
+ - GitHub integration
58
+ - Tokenization logic
59
+ - Color generation and visualization
60
+ - `requirements.txt`: Python dependencies
61
+
62
+ ## Technical Details
63
+
64
+ - **Tokenizer Source**: Fetched directly from GitHub repository
65
+ - **Caching**: Uses Streamlit's caching for better performance
66
+ - **Color Generation**: HSV-based algorithm for visually distinct colors
67
+ - **Session State**: Maintains text and results between interactions
68
+ - **Error Handling**: Graceful handling of GitHub API and tokenization errors
69
+
70
+ ## Deployment to Hugging Face Spaces
71
+
72
+ 1. Create a new Space:
73
+ - Go to https://huggingface.co/spaces
74
+ - Click "Create new Space"
75
+ - Select "Streamlit" as the SDK
76
+ - Choose a name for your Space
77
+
78
+ 2. Upload files:
79
+ - `app.py`
80
+ - `requirements.txt`
81
+
82
+ 3. The app will automatically deploy and be available at your Space's URL
83
+
84
+ ## Contributing
85
+
86
+ 1. Fork the repository
87
+ 2. Create your feature branch
88
+ 3. Commit your changes
89
+ 4. Push to the branch
90
+ 5. Create a Pull Request
91
+
92
+ ## License
93
+
94
+ MIT License - see the [LICENSE](../LICENSE) file for details
95
+
96
+ ## Acknowledgments
97
+
98
+ - Built by dqbd
99
+ - Created with the generous help from Diagram
100
+ - Based on the [Turkish Morphological Tokenizer](https://github.com/malibayram/tokenizer)
app.py ADDED
@@ -0,0 +1,247 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import sys
3
+ from datetime import datetime
4
+ from pathlib import Path
5
+ import base64
6
+ import colorsys
7
+ import shutil
8
+ import atexit
9
+ import requests
10
+ import importlib.util
11
+
12
+ # Set page config - MUST BE FIRST STREAMLIT COMMAND
13
+ st.set_page_config(
14
+ page_title="Turkish Tiktokenizer",
15
+ page_icon="πŸ‡ΉπŸ‡·",
16
+ layout="wide"
17
+ )
18
+
19
+ # Initialize session state
20
+ if 'text' not in st.session_state:
21
+ st.session_state.text = ""
22
+ if 'token_results' not in st.session_state:
23
+ st.session_state.token_results = None
24
+
25
+ # Constants
26
+ GITHUB_REPO = "malibayram/tokenizer"
27
+ GITHUB_BRANCH = "main"
28
+
29
+ # Special tokens and their IDs
30
+ SPECIAL_TOKENS = {
31
+ "<uppercase>": 0, # Uppercase letter marker
32
+ "<space>": 1, # Space character
33
+ "<newline>": 2, # Newline character
34
+ "<tab>": 3, # Tab character
35
+ "<unknown>": 4 # Unknown token
36
+ }
37
+
38
+ # Special token display symbols
39
+ SPECIAL_TOKEN_SYMBOLS = {
40
+ "<uppercase>": "[uppercase]", # Up arrow for uppercase
41
+ "<space>": "[space]", # Space symbol
42
+ "<newline>": "[newline]", # Return symbol
43
+ "<tab>": "[tab]", # Tab symbol
44
+ "<unknown>": "[unknown]" # Question mark for unknown
45
+ }
46
+
47
+ # Colors for special tokens
48
+ SPECIAL_COLORS = {
49
+ "<uppercase>": "#FF9999", # Light red for uppercase markers
50
+ "<space>": "#CCCCCC", # Gray for spaces
51
+ "<newline>": "#CCCCCC", # Gray for newlines
52
+ "<tab>": "#CCCCCC", # Gray for tabs
53
+ "<unknown>": "#FF0000" # Red for unknown tokens
54
+ }
55
+
56
+ # Required files mapping
57
+ REQUIRED_FILES = {
58
+ 'tokenizer.py': 'turkish_tokenizer/turkish_tokenizer.py',
59
+ 'kokler_v05.json': 'turkish_tokenizer/kokler_v05.json',
60
+ 'ekler_v05.json': 'turkish_tokenizer/ekler_v05.json',
61
+ 'bpe_v05.json': 'turkish_tokenizer/bpe_v05.json'
62
+ }
63
+
64
+ # Token ID ranges
65
+ TOKEN_RANGES = {
66
+ 'special': (0, 4), # Special tokens
67
+ 'root_words': (5, 20000), # Root words
68
+ 'suffixes': (22268, 22767), # Suffixes
69
+ 'bpe': (20000, None) # BPE tokens (20000+)
70
+ }
71
+
72
+ def generate_colors(n):
73
+ """Generate n visually distinct colors."""
74
+ colors = []
75
+ for i in range(n):
76
+ hue = i / n
77
+ saturation = 0.3 + (i % 3) * 0.1 # Vary saturation between 0.3-0.5
78
+ value = 0.95 - (i % 2) * 0.1 # Vary value between 0.85-0.95
79
+ rgb = colorsys.hsv_to_rgb(hue, saturation, value)
80
+ hex_color = "#{:02x}{:02x}{:02x}".format(
81
+ int(rgb[0] * 255),
82
+ int(rgb[1] * 255),
83
+ int(rgb[2] * 255)
84
+ )
85
+ colors.append(hex_color)
86
+ return colors
87
+
88
+ def fetch_github_file(path, ref=GITHUB_BRANCH):
89
+ """Fetch file content from GitHub repository."""
90
+ url = f"https://api.github.com/repos/{GITHUB_REPO}/contents/{path}?ref={ref}"
91
+ response = requests.get(url)
92
+ if response.status_code == 200:
93
+ content = base64.b64decode(response.json()['content']).decode('utf-8')
94
+ return content
95
+ else:
96
+ st.error(f"Could not fetch {path} from GitHub: {response.status_code}")
97
+ return None
98
+
99
+ @st.cache_resource
100
+ def load_tokenizer():
101
+ """Load and initialize the tokenizer from GitHub."""
102
+ temp_dir = Path("temp_tokenizer")
103
+ temp_dir.mkdir(exist_ok=True)
104
+
105
+ # Fetch required files
106
+ for local_name, github_path in REQUIRED_FILES.items():
107
+ content = fetch_github_file(github_path)
108
+ if content is None:
109
+ return None
110
+
111
+ with open(temp_dir / local_name, 'w', encoding='utf-8') as f:
112
+ f.write(content)
113
+
114
+ # Modify tokenizer to use correct paths
115
+ tokenizer_path = temp_dir / "tokenizer.py"
116
+ with open(tokenizer_path, 'r', encoding='utf-8') as f:
117
+ tokenizer_code = f.read()
118
+
119
+ modified_code = tokenizer_code.replace(
120
+ 'def load_json(filename):',
121
+ f'''def load_json(filename):
122
+ full_path = os.path.join("{temp_dir.absolute()}", filename)
123
+ with open(full_path, 'r', encoding='utf-8') as file:
124
+ return json.load(file)'''
125
+ )
126
+
127
+ with open(tokenizer_path, 'w', encoding='utf-8') as f:
128
+ f.write(modified_code)
129
+
130
+ # Load module
131
+ spec = importlib.util.spec_from_file_location("tokenizer", str(temp_dir / "tokenizer.py"))
132
+ module = importlib.util.module_from_spec(spec)
133
+ sys.modules["tokenizer"] = module
134
+ spec.loader.exec_module(module)
135
+
136
+ return module.tokenize
137
+
138
+ @st.cache_data(ttl=3600)
139
+ def get_commit_history():
140
+ """Fetch commit history from GitHub."""
141
+ url = f"https://api.github.com/repos/{GITHUB_REPO}/commits"
142
+ try:
143
+ response = requests.get(url)
144
+ if response.status_code == 200:
145
+ commits = response.json()
146
+ versions = []
147
+ for commit in commits[:10]:
148
+ date = datetime.strptime(commit['commit']['author']['date'], '%Y-%m-%dT%H:%M:%SZ').strftime('%Y-%m-%d')
149
+ sha = commit['sha'][:7]
150
+ message = commit['commit']['message'].split('\n')[0][:50]
151
+ versions.append(f"{date} - {sha} - {message}")
152
+ return versions
153
+ return ["latest"]
154
+ except Exception as e:
155
+ st.warning(f"Could not fetch commit history: {str(e)}")
156
+ return ["latest"]
157
+
158
+ def render_tokens(tokens, token_colors):
159
+ """Render colored token visualization."""
160
+ html_tokens = []
161
+ for token in tokens:
162
+ color = token_colors[token]
163
+ display_text = SPECIAL_TOKEN_SYMBOLS.get(token, token) # Use symbol for special tokens
164
+ html_tokens.append(
165
+ f'<span style="background-color: {color}; padding: 2px 4px; margin: 2px; border-radius: 3px;" title="{token}">{display_text}</span>'
166
+ )
167
+ return " ".join(html_tokens)
168
+
169
+ # Load tokenizer
170
+ tokenize = load_tokenizer()
171
+ if tokenize is None:
172
+ st.error("Failed to load tokenizer from GitHub")
173
+ st.stop()
174
+
175
+ # UI Layout
176
+ st.title("πŸ‡ΉπŸ‡· Turkish Tiktokenizer")
177
+
178
+ # Model selection
179
+ versions = get_commit_history()
180
+ model = st.selectbox("", versions, key="model_selection", label_visibility="collapsed")
181
+
182
+ # Main layout
183
+ col1, col2 = st.columns([0.4, 0.6])
184
+
185
+ # Input column
186
+ with col1:
187
+ text = st.text_area(
188
+ "Enter Turkish text to tokenize",
189
+ value=st.session_state.text,
190
+ height=200,
191
+ key="text_input",
192
+ label_visibility="collapsed",
193
+ placeholder="Enter Turkish text to tokenize"
194
+ )
195
+
196
+ if st.button("Tokenize", type="primary"):
197
+ st.session_state.text = text
198
+ if text.strip():
199
+ try:
200
+ st.session_state.token_results = tokenize(text)
201
+ except Exception as e:
202
+ st.session_state.token_results = None
203
+ st.error(f"Error tokenizing text: {str(e)}")
204
+ else:
205
+ st.session_state.token_results = None
206
+
207
+ # Results column
208
+ with col2:
209
+ st.markdown("Token count")
210
+ if st.session_state.token_results is not None:
211
+ result = st.session_state.token_results
212
+ token_count = len(result["tokens"])
213
+ st.markdown(f"### {token_count}")
214
+
215
+ st.markdown("Tokenized text")
216
+
217
+ # Generate token colors
218
+ regular_tokens = [t for t in result["tokens"] if t not in SPECIAL_COLORS]
219
+ regular_token_colors = dict(zip(regular_tokens, generate_colors(len(regular_tokens))))
220
+ token_colors = {**SPECIAL_COLORS, **regular_token_colors}
221
+
222
+ # Render tokens
223
+ with st.container():
224
+ st.markdown(render_tokens(result["tokens"], token_colors), unsafe_allow_html=True)
225
+
226
+ st.markdown("Token IDs")
227
+ st.code(", ".join(map(str, result["ids"])), language=None)
228
+ else:
229
+ st.markdown("### 0")
230
+ st.markdown("Tokenized text")
231
+ st.markdown("")
232
+ st.markdown("Token IDs")
233
+ st.text("")
234
+
235
+ # Footer
236
+ st.markdown("""
237
+ <div style="position: fixed; bottom: 0; width: 100%; text-align: center; padding: 10px; background-color: white;">
238
+ <a href="https://github.com/malibayram/tokenizer" target="_blank">View on GitHub</a>
239
+ </div>
240
+ """, unsafe_allow_html=True)
241
+
242
+ # Cleanup
243
+ def cleanup():
244
+ if Path("temp_tokenizer").exists():
245
+ shutil.rmtree("temp_tokenizer")
246
+
247
+ atexit.register(cleanup)
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ streamlit>=1.24.0
2
+ numpy>=1.21.0
3
+ json5>=0.9.0
4
+ requests>=2.31.0
5
+ pathlib>=1.0.1