alibayram's picture
Add application file
26ddb6c
|
raw
history blame
2.85 kB
# Turkish Tiktokenizer Web App
A Streamlit-based web interface for the Turkish Morphological Tokenizer. This app provides an interactive way to tokenize Turkish text with real-time visualization and color-coded token display.
## Features
- πŸ”€ Turkish text tokenization with morphological analysis
- 🎨 Color-coded token visualization
- πŸ”’ Token count and ID display
- πŸ“Š Special token highlighting (uppercase, space, newline, etc.)
- πŸ”„ Version selection from GitHub commit history
- 🌐 Direct integration with GitHub repository
## Demo
You can try the live demo at [Hugging Face Spaces](https://huggingface.co/spaces/YOUR_USERNAME/turkish-tiktokenizer) (Replace with your actual Spaces URL)
## Installation
1. Clone the repository:
```bash
git clone https://github.com/malibayram/tokenizer.git
cd tokenizer/streamlit_app
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
## Usage
1. Run the Streamlit app:
```bash
streamlit run app.py
```
2. Open your browser and navigate to http://localhost:8501
3. Enter Turkish text in the input area and click "Tokenize"
## How It Works
1. **Text Input**: Enter Turkish text in the left panel
2. **Tokenization**: Click the "Tokenize" button to process the text
3. **Visualization**:
- Token count is displayed at the top
- Tokens are shown with color-coding:
- Special tokens (uppercase, space, etc.) have predefined colors
- Regular tokens get unique colors for easy identification
- Token IDs are displayed below the visualization
## Code Structure
- `app.py`: Main Streamlit application
- UI components and layout
- GitHub integration
- Tokenization logic
- Color generation and visualization
- `requirements.txt`: Python dependencies
## Technical Details
- **Tokenizer Source**: Fetched directly from GitHub repository
- **Caching**: Uses Streamlit's caching for better performance
- **Color Generation**: HSV-based algorithm for visually distinct colors
- **Session State**: Maintains text and results between interactions
- **Error Handling**: Graceful handling of GitHub API and tokenization errors
## Deployment to Hugging Face Spaces
1. Create a new Space:
- Go to https://huggingface.co/spaces
- Click "Create new Space"
- Select "Streamlit" as the SDK
- Choose a name for your Space
2. Upload files:
- `app.py`
- `requirements.txt`
3. The app will automatically deploy and be available at your Space's URL
## Contributing
1. Fork the repository
2. Create your feature branch
3. Commit your changes
4. Push to the branch
5. Create a Pull Request
## License
MIT License - see the [LICENSE](../LICENSE) file for details
## Acknowledgments
- Built by dqbd
- Created with the generous help from Diagram
- Based on the [Turkish Morphological Tokenizer](https://github.com/malibayram/tokenizer)