Spaces:
Running
Running
<html> | |
<head> | |
<meta charset="utf-8"> | |
<meta name="description" content="Distributed Translation System for translating the DataTonic/dark_thoughts_case_study_merged dataset across multiple languages using RunPod and Ollama."> | |
<meta name="keywords" content="Distributed Translation, RunPod, Ollama, Dark Thoughts Dataset"> | |
<meta name="viewport" content="width=device-width, initial-scale=1"> | |
<title>Distributed Translation System for Dark Thoughts Dataset</title> | |
<link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" rel="stylesheet"> | |
<link rel="stylesheet" href="./static/css/bulma.min.css"> | |
<link rel="stylesheet" href="./static/css/bulma-carousel.min.css"> | |
<link rel="stylesheet" href="./static/css/bulma-slider.min.css"> | |
<link rel="stylesheet" href="./static/css/fontawesome.all.min.css"> | |
<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css"> | |
<link rel="stylesheet" href="./static/css/index.css"> | |
<link rel="icon" href="./static/images/favicon.svg"> | |
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script> | |
<script defer src="./static/js/fontawesome.all.min.js"></script> | |
<script src="./static/js/bulma-carousel.min.js"></script> | |
<script src="./static/js/bulma-slider.min.js"></script> | |
<script src="./static/js/index.js"></script> | |
</head> | |
<body> | |
<section class="hero"> | |
<div class="hero-body"> | |
<div class="container is-max-desktop"> | |
<div class="columns is-centered"> | |
<div class="column has-text-centered"> | |
<h1 class="title is-1 publication-title">Distributed Translation System for Dark Thoughts Dataset</h1> | |
<div class="is-size-5 publication-authors"> | |
<span class="author-block">Your Name or Team</span> | |
</div> | |
<div class="column has-text-centered"> | |
<div class="publication-links"> | |
<span class="link-block"> | |
<a href="https://github.com/yourusername/distributed-translation" target="_blank" class="external-link button is-normal is-rounded is-dark"> | |
<span class="icon"><i class="fab fa-github"></i></span> | |
<span>Code</span> | |
</a> | |
</span> | |
<span class="link-block"> | |
<a href="https://huggingface.co/datasets/DataTonic/dark_thoughts_case_study_merged" target="_blank" class="external-link button is-normal is-rounded is-dark"> | |
<span class="icon"><i class="far fa-images"></i></span> | |
<span>Data</span> | |
</a> | |
</span> | |
</div> | |
</div> | |
</div> | |
</div> | |
</div> | |
</div> | |
</section> | |
<section class="section"> | |
<div class="container is-max-desktop"> | |
<div class="columns is-centered has-text-centered"> | |
<div class="column is-four-fifths"> | |
<h2 class="title is-3">Overview</h2> | |
<div class="content has-text-justified"> | |
<p> | |
This project implements a distributed translation system using RunPod and Ollama to translate the <a href="https://huggingface.co/datasets/DataTonic/dark_thoughts_case_study_merged" target="_blank">DataTonic/dark_thoughts_case_study_merged</a> dataset across multiple languages. The system parses thinking content from responses and translates both components separately. | |
</p> | |
</div> | |
</div> | |
</div> | |
<div class="columns is-centered"> | |
<div class="column is-full-width"> | |
<h2 class="title is-3">Architecture</h2> | |
<div class="content has-text-justified"> | |
<p>The system consists of several components:</p> | |
<ol> | |
<li><strong>RunPod API Client</strong> (<code>runpodapi.py</code>): Handles communication with the RunPod API for creating, managing, and monitoring pods.</li> | |
<li><strong>RunPod Command Executor</strong> (<code>runcommandsrunpod.py</code>): Executes commands on RunPod instances and checks their readiness.</li> | |
<li><strong>RunPod Launcher</strong> (<code>runpodlauncher.py</code>): Manages the launching and coordination of multiple RunPod instances.</li> | |
<li><strong>RunPod Manager</strong> (<code>runpodmanager.py</code>): High-level manager for RunPod instances used for distributed translation.</li> | |
<li><strong>Ollama Client</strong> (<code>ollamaclient.py</code>): Async client for interacting with Ollama API and distributing translation tasks.</li> | |
<li><strong>Translation Coordinator</strong> (<code>translationcoordinator.py</code>): Orchestrates the translation process across dataset splits and languages.</li> | |
<li><strong>Data Processor</strong> (<code>dataprocessor.py</code>): Handles loading, processing, and saving the translated dataset.</li> | |
<li><strong>Main Script</strong> (<code>translate.py</code>): Entry point for running the distributed translation process.</li> | |
<li><strong>Test Scripts</strong> (<code>test_translation.py</code>, <code>test_parsing.py</code>): Tests the functionality of the distributed translation system.</li> | |
</ol> | |
</div> | |
</div> | |
</div> | |
<div class="columns is-centered"> | |
<div class="column is-full-width"> | |
<h2 class="title is-3">Requirements</h2> | |
<div class="content has-text-justified"> | |
<ul> | |
<li>Python 3.8+</li> | |
<li>RunPod API key</li> | |
<li>Access to RunPod GPU instances</li> | |
<li>The following Python packages: <code>aiohttp</code>, <code>asyncio</code>, <code>datasets</code>, <code>pandas</code>, <code>tqdm</code>, <code>requests</code>, <code>pydantic</code></li> | |
</ul> | |
</div> | |
</div> | |
</div> | |
<div class="columns is-centered"> | |
<div class="column is-full-width"> | |
<h2 class="title is-3">Installation</h2> | |
<div class="content has-text-justified"> | |
<ol> | |
<li>Clone the repository: | |
<pre><code>git clone https://github.com/yourusername/distributed-translation.git | |
cd distributed-translation</code></pre> | |
</li> | |
<li>Install the required packages: | |
<pre><code>pip install -r requirements.txt</code></pre> | |
</li> | |
<li>Set up your RunPod API key: | |
<pre><code>export RUNPOD_API_KEY=your_runpod_api_key</code></pre> | |
</li> | |
</ol> | |
</div> | |
</div> | |
</div> | |
<div class="columns is-centered"> | |
<div class="column is-full-width"> | |
<h2 class="title is-3">Dataset Structure</h2> | |
<div class="content has-text-justified"> | |
<p>The system works with the DataTonic/dark_thoughts_case_study_merged dataset, which contains:</p> | |
<ul> | |
<li>English split: 20,711 examples</li> | |
<li>Chinese split: 20,204 examples</li> | |
</ul> | |
<p>The system parses thinking content (text before <code></think></code>) from responses and translates both components separately.</p> | |
<p>The final dataset structure follows this model:</p> | |
<pre><code>class Feature(BaseModel): | |
id: int | |
thinking: str | |
response: str | |
thinking_translated: str | |
response_translated: str | |
query: str | |
source_data: str | |
category: str | |
endpoint: str | |
source: str</code></pre> | |
</div> | |
</div> | |
</div> | |
<div class="columns is-centered"> | |
<div class="column is-full-width"> | |
<h2 class="title is-3">Usage</h2> | |
<h3 class="title is-4">Running the Translation Process</h3> | |
<div class="content has-text-justified"> | |
<p>To run the full translation process:</p> | |
<pre><code>python translate.py --pod-count 40 --batch-size 16 --max-tokens 100</code></pre> | |
<p>Additional options:</p> | |
<pre><code>--api-key TEXT RunPod API key (defaults to RUNPOD_API_KEY environment variable) | |
--pod-count INTEGER Number of RunPod instances to launch (default: 40) | |
--dataset TEXT Dataset name or path (default: DataTonic/dark_thoughts_case_study_merged) | |
--output-dir TEXT Output directory for translated data (default: translated_dataset) | |
--batch-size INTEGER Batch size for translation (default: 16) | |
--max-tokens INTEGER Maximum number of tokens to generate (default: 100) | |
--gpu-type TEXT GPU type ID for RunPod instances (default: NVIDIA RTX A5000) | |
--image TEXT Docker image name (default: tonic01/ollama-gemmax2) | |
--model TEXT Model name for translation (default: gemmax2) | |
--cleanup Terminate all pods after completion | |
--prepare-only Only prepare the dataset without translating | |
--process-only Only process the translated dataset | |
--validate Validate dataset structure after processing</code></pre> | |
</div> | |
<h3 class="title is-4">Testing the System</h3> | |
<div class="content has-text-justified"> | |
<p>To test the system components:</p> | |
<pre><code>python test_translation.py --test all</code></pre> | |
<p>To test the parsing functionality:</p> | |
<pre><code>python test_parsing.py --test all</code></pre> | |
</div> | |
</div> | |
</div> | |
<div class="columns is-centered"> | |
<div class="column is-full-width"> | |
<h2 class="title is-3">Translation Process</h2> | |
<div class="content has-text-justified"> | |
<p>The translation process follows these steps:</p> | |
<ol> | |
<li><strong>Preparation</strong>: Parse the dataset to separate thinking content from responses.</li> | |
<li><strong>Setup</strong>: Launch 40 RunPod instances with the <code>tonic01/ollama-gemmax2</code> Docker image.</li> | |
<li><strong>Readiness Check</strong>: Wait for all pods to be ready and for Ollama to be initialized with the required model.</li> | |
<li><strong>Translation</strong>: | |
<ul> | |
<li>For each dataset split (English and Chinese):</li> | |
<li>Translate thinking and response fields separately to all target languages.</li> | |
<li>Skip empty thinking content to optimize translation.</li> | |
<li>Save intermediate results periodically.</li> | |
</ul> | |
</li> | |
<li><strong>Processing</strong>: Merge translations and create a Hugging Face dataset structure.</li> | |
<li><strong>Validation</strong>: Ensure the dataset structure matches the required Feature model.</li> | |
<li><strong>Cleanup</strong>: Terminate all pods if requested.</li> | |
</ol> | |
</div> | |
</div> | |
</div> | |
<div class="columns is-centered"> | |
<div class="column is-full-width"> | |
<h2 class="title is-3">Supported Languages</h2> | |
<div class="content has-text-justified"> | |
<p>The system supports translation between the following languages:</p> | |
<p>Arabic, Bengali, Czech, German, English, Spanish, Persian, French, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Burmese, Dutch, Polish, Portuguese, Russian, Thai, Tagalog, Turkish, Urdu, Vietnamese, Chinese.</p> | |
</div> | |
</div> | |
</div> | |
<div class="columns is-centered"> | |
<div class="column is-full-width"> | |
<h2 class="title is-3">Error Handling and Recovery</h2> | |
<div class="content has-text-justified"> | |
<p>The system includes several error handling and recovery mechanisms:</p> | |
<ul> | |
<li><strong>Retry Logic</strong>: Failed translations are automatically retried.</li> | |
<li><strong>Checkpointing</strong>: Intermediate results are saved periodically to allow resuming from failures.</li> | |
<li><strong>Health Checks</strong>: Pod and Ollama health are checked before starting translation.</li> | |
<li><strong>Empty Content Handling</strong>: Empty thinking content is handled efficiently to avoid unnecessary translations.</li> | |
<li><strong>Graceful Termination</strong>: Resources are properly cleaned up on completion or failure.</li> | |
</ul> | |
</div> | |
</div> | |
</div> | |
<div class="columns is-centered"> | |
<div class="column is-full-width"> | |
<h2 class="title is-3">Docker Image Requirements</h2> | |
<div class="content has-text-justified"> | |
<p>The <code>tonic01/ollama-gemmax2</code> Docker image should have:</p> | |
<ol> | |
<li>Ollama installed and configured to run on port 11434</li> | |
<li>The GemmaX2-28-2B-v0.1 model pre-loaded or configured to load automatically</li> | |
<li>Sufficient GPU memory (at least 24GB recommended)</li> | |
</ol> | |
</div> | |
</div> | |
</div> | |
<div class="columns is-centered"> | |
<div class="column is-full-width"> | |
<h2 class="title is-3">Example Workflow</h2> | |
<div class="content has-text-justified"> | |
<ol> | |
<li><strong>Prepare Dataset</strong>: | |
<pre><code>python translate.py --prepare-only</code></pre> | |
</li> | |
<li><strong>Run Translation</strong>: | |
<pre><code>python translate.py --pod-count 40</code></pre> | |
</li> | |
<li><strong>Process Results Only</strong>: | |
<pre><code>python translate.py --process-only --validate</code></pre> | |
</li> | |
<li><strong>Cleanup</strong>: | |
<pre><code>python test_translation.py --test termination</code></pre> | |
</li> | |
</ol> | |
</div> | |
</div> | |
</div> | |
<div class="columns is-centered"> | |
<div class="column is-full-width"> | |
<h2 class="title is-3">Troubleshooting</h2> | |
<div class="content has-text-justified"> | |
<ul> | |
<li><strong>API Key Issues</strong>: Ensure your RunPod API key is correctly set in the environment variable or passed as a parameter.</li> | |
<li><strong>GPU Availability</strong>: Check RunPod for GPU availability if pod creation fails.</li> | |
<li><strong>Model Loading</strong>: If Ollama readiness check times out, the model may be too large for the selected GPU type.</li> | |
<li><strong>Translation Errors</strong>: Check the logs for specific error messages. Most translation errors are automatically retried.</li> | |
<li><strong>Dataset Structure</strong>: Run with the <code>--validate</code> flag to ensure the dataset structure matches the required Feature model.</li> | |
</ul> | |
</div> | |
</div> | |
</div> | |
</div> | |
</section> | |
<section class="section" id="License"> | |
<div class="container is-max-desktop content"> | |
<h2 class="title">License</h2> | |
<div class="content has-text-justified"> | |
<p>This project is licensed under the Apache 2.0 License - see the <a href="LICENSE" target="_blank">LICENSE</a> file for details.</p> | |
</div> | |
</div> | |
</section> | |
<footer class="footer"> | |
<div class="container"> | |
<div class="content has-text-centered"> | |
<a class="icon-link" href="https://github.com/yourusername/distributed-translation" target="_blank"> | |
<i class="fab fa-github"></i> | |
</a> | |
</div> | |
<div class="columns is-centered"> | |
<div class="column is-8"> | |
<div class="content"> | |
<p> | |
This website is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/" target="_blank">Creative Commons Attribution-ShareAlike 4.0 International License</a>. | |
</p> | |
<p> | |
This means you are free to borrow the <a href="https://github.com/yourusername/distributed-translation" target="_blank">source code</a> of this website, we just ask that you link back to this page in the footer. | |
</p> | |
</div> | |
</div> | |
</div> | |
</div> | |
</footer> | |
</body> | |
</html> |