PipelinesTranslation / index.html
Tonic's picture
Update index.html
aeed15a verified
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="description" content="Distributed Translation System for translating the DataTonic/dark_thoughts_case_study_merged dataset across multiple languages using RunPod and Ollama.">
<meta name="keywords" content="Distributed Translation, RunPod, Ollama, Dark Thoughts Dataset">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Distributed Translation System for Dark Thoughts Dataset</title>
<link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" rel="stylesheet">
<link rel="stylesheet" href="./static/css/bulma.min.css">
<link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
<link rel="stylesheet" href="./static/css/bulma-slider.min.css">
<link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
<link rel="stylesheet" href="./static/css/index.css">
<link rel="icon" href="./static/images/favicon.svg">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script defer src="./static/js/fontawesome.all.min.js"></script>
<script src="./static/js/bulma-carousel.min.js"></script>
<script src="./static/js/bulma-slider.min.js"></script>
<script src="./static/js/index.js"></script>
</head>
<body>
<section class="hero">
<div class="hero-body">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column has-text-centered">
<h1 class="title is-1 publication-title">Distributed Translation System for Dark Thoughts Dataset</h1>
<div class="is-size-5 publication-authors">
<span class="author-block">Your Name or Team</span>
</div>
<div class="column has-text-centered">
<div class="publication-links">
<span class="link-block">
<a href="https://github.com/yourusername/distributed-translation" target="_blank" class="external-link button is-normal is-rounded is-dark">
<span class="icon"><i class="fab fa-github"></i></span>
<span>Code</span>
</a>
</span>
<span class="link-block">
<a href="https://huggingface.co/datasets/DataTonic/dark_thoughts_case_study_merged" target="_blank" class="external-link button is-normal is-rounded is-dark">
<span class="icon"><i class="far fa-images"></i></span>
<span>Data</span>
</a>
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Overview</h2>
<div class="content has-text-justified">
<p>
This project implements a distributed translation system using RunPod and Ollama to translate the <a href="https://huggingface.co/datasets/DataTonic/dark_thoughts_case_study_merged" target="_blank">DataTonic/dark_thoughts_case_study_merged</a> dataset across multiple languages. The system parses thinking content from responses and translates both components separately.
</p>
</div>
</div>
</div>
<div class="columns is-centered">
<div class="column is-full-width">
<h2 class="title is-3">Architecture</h2>
<div class="content has-text-justified">
<p>The system consists of several components:</p>
<ol>
<li><strong>RunPod API Client</strong> (<code>runpodapi.py</code>): Handles communication with the RunPod API for creating, managing, and monitoring pods.</li>
<li><strong>RunPod Command Executor</strong> (<code>runcommandsrunpod.py</code>): Executes commands on RunPod instances and checks their readiness.</li>
<li><strong>RunPod Launcher</strong> (<code>runpodlauncher.py</code>): Manages the launching and coordination of multiple RunPod instances.</li>
<li><strong>RunPod Manager</strong> (<code>runpodmanager.py</code>): High-level manager for RunPod instances used for distributed translation.</li>
<li><strong>Ollama Client</strong> (<code>ollamaclient.py</code>): Async client for interacting with Ollama API and distributing translation tasks.</li>
<li><strong>Translation Coordinator</strong> (<code>translationcoordinator.py</code>): Orchestrates the translation process across dataset splits and languages.</li>
<li><strong>Data Processor</strong> (<code>dataprocessor.py</code>): Handles loading, processing, and saving the translated dataset.</li>
<li><strong>Main Script</strong> (<code>translate.py</code>): Entry point for running the distributed translation process.</li>
<li><strong>Test Scripts</strong> (<code>test_translation.py</code>, <code>test_parsing.py</code>): Tests the functionality of the distributed translation system.</li>
</ol>
</div>
</div>
</div>
<div class="columns is-centered">
<div class="column is-full-width">
<h2 class="title is-3">Requirements</h2>
<div class="content has-text-justified">
<ul>
<li>Python 3.8+</li>
<li>RunPod API key</li>
<li>Access to RunPod GPU instances</li>
<li>The following Python packages: <code>aiohttp</code>, <code>asyncio</code>, <code>datasets</code>, <code>pandas</code>, <code>tqdm</code>, <code>requests</code>, <code>pydantic</code></li>
</ul>
</div>
</div>
</div>
<div class="columns is-centered">
<div class="column is-full-width">
<h2 class="title is-3">Installation</h2>
<div class="content has-text-justified">
<ol>
<li>Clone the repository:
<pre><code>git clone https://github.com/yourusername/distributed-translation.git
cd distributed-translation</code></pre>
</li>
<li>Install the required packages:
<pre><code>pip install -r requirements.txt</code></pre>
</li>
<li>Set up your RunPod API key:
<pre><code>export RUNPOD_API_KEY=your_runpod_api_key</code></pre>
</li>
</ol>
</div>
</div>
</div>
<div class="columns is-centered">
<div class="column is-full-width">
<h2 class="title is-3">Dataset Structure</h2>
<div class="content has-text-justified">
<p>The system works with the DataTonic/dark_thoughts_case_study_merged dataset, which contains:</p>
<ul>
<li>English split: 20,711 examples</li>
<li>Chinese split: 20,204 examples</li>
</ul>
<p>The system parses thinking content (text before <code>&lt;/think&gt;</code>) from responses and translates both components separately.</p>
<p>The final dataset structure follows this model:</p>
<pre><code>class Feature(BaseModel):
id: int
thinking: str
response: str
thinking_translated: str
response_translated: str
query: str
source_data: str
category: str
endpoint: str
source: str</code></pre>
</div>
</div>
</div>
<div class="columns is-centered">
<div class="column is-full-width">
<h2 class="title is-3">Usage</h2>
<h3 class="title is-4">Running the Translation Process</h3>
<div class="content has-text-justified">
<p>To run the full translation process:</p>
<pre><code>python translate.py --pod-count 40 --batch-size 16 --max-tokens 100</code></pre>
<p>Additional options:</p>
<pre><code>--api-key TEXT RunPod API key (defaults to RUNPOD_API_KEY environment variable)
--pod-count INTEGER Number of RunPod instances to launch (default: 40)
--dataset TEXT Dataset name or path (default: DataTonic/dark_thoughts_case_study_merged)
--output-dir TEXT Output directory for translated data (default: translated_dataset)
--batch-size INTEGER Batch size for translation (default: 16)
--max-tokens INTEGER Maximum number of tokens to generate (default: 100)
--gpu-type TEXT GPU type ID for RunPod instances (default: NVIDIA RTX A5000)
--image TEXT Docker image name (default: tonic01/ollama-gemmax2)
--model TEXT Model name for translation (default: gemmax2)
--cleanup Terminate all pods after completion
--prepare-only Only prepare the dataset without translating
--process-only Only process the translated dataset
--validate Validate dataset structure after processing</code></pre>
</div>
<h3 class="title is-4">Testing the System</h3>
<div class="content has-text-justified">
<p>To test the system components:</p>
<pre><code>python test_translation.py --test all</code></pre>
<p>To test the parsing functionality:</p>
<pre><code>python test_parsing.py --test all</code></pre>
</div>
</div>
</div>
<div class="columns is-centered">
<div class="column is-full-width">
<h2 class="title is-3">Translation Process</h2>
<div class="content has-text-justified">
<p>The translation process follows these steps:</p>
<ol>
<li><strong>Preparation</strong>: Parse the dataset to separate thinking content from responses.</li>
<li><strong>Setup</strong>: Launch 40 RunPod instances with the <code>tonic01/ollama-gemmax2</code> Docker image.</li>
<li><strong>Readiness Check</strong>: Wait for all pods to be ready and for Ollama to be initialized with the required model.</li>
<li><strong>Translation</strong>:
<ul>
<li>For each dataset split (English and Chinese):</li>
<li>Translate thinking and response fields separately to all target languages.</li>
<li>Skip empty thinking content to optimize translation.</li>
<li>Save intermediate results periodically.</li>
</ul>
</li>
<li><strong>Processing</strong>: Merge translations and create a Hugging Face dataset structure.</li>
<li><strong>Validation</strong>: Ensure the dataset structure matches the required Feature model.</li>
<li><strong>Cleanup</strong>: Terminate all pods if requested.</li>
</ol>
</div>
</div>
</div>
<div class="columns is-centered">
<div class="column is-full-width">
<h2 class="title is-3">Supported Languages</h2>
<div class="content has-text-justified">
<p>The system supports translation between the following languages:</p>
<p>Arabic, Bengali, Czech, German, English, Spanish, Persian, French, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Burmese, Dutch, Polish, Portuguese, Russian, Thai, Tagalog, Turkish, Urdu, Vietnamese, Chinese.</p>
</div>
</div>
</div>
<div class="columns is-centered">
<div class="column is-full-width">
<h2 class="title is-3">Error Handling and Recovery</h2>
<div class="content has-text-justified">
<p>The system includes several error handling and recovery mechanisms:</p>
<ul>
<li><strong>Retry Logic</strong>: Failed translations are automatically retried.</li>
<li><strong>Checkpointing</strong>: Intermediate results are saved periodically to allow resuming from failures.</li>
<li><strong>Health Checks</strong>: Pod and Ollama health are checked before starting translation.</li>
<li><strong>Empty Content Handling</strong>: Empty thinking content is handled efficiently to avoid unnecessary translations.</li>
<li><strong>Graceful Termination</strong>: Resources are properly cleaned up on completion or failure.</li>
</ul>
</div>
</div>
</div>
<div class="columns is-centered">
<div class="column is-full-width">
<h2 class="title is-3">Docker Image Requirements</h2>
<div class="content has-text-justified">
<p>The <code>tonic01/ollama-gemmax2</code> Docker image should have:</p>
<ol>
<li>Ollama installed and configured to run on port 11434</li>
<li>The GemmaX2-28-2B-v0.1 model pre-loaded or configured to load automatically</li>
<li>Sufficient GPU memory (at least 24GB recommended)</li>
</ol>
</div>
</div>
</div>
<div class="columns is-centered">
<div class="column is-full-width">
<h2 class="title is-3">Example Workflow</h2>
<div class="content has-text-justified">
<ol>
<li><strong>Prepare Dataset</strong>:
<pre><code>python translate.py --prepare-only</code></pre>
</li>
<li><strong>Run Translation</strong>:
<pre><code>python translate.py --pod-count 40</code></pre>
</li>
<li><strong>Process Results Only</strong>:
<pre><code>python translate.py --process-only --validate</code></pre>
</li>
<li><strong>Cleanup</strong>:
<pre><code>python test_translation.py --test termination</code></pre>
</li>
</ol>
</div>
</div>
</div>
<div class="columns is-centered">
<div class="column is-full-width">
<h2 class="title is-3">Troubleshooting</h2>
<div class="content has-text-justified">
<ul>
<li><strong>API Key Issues</strong>: Ensure your RunPod API key is correctly set in the environment variable or passed as a parameter.</li>
<li><strong>GPU Availability</strong>: Check RunPod for GPU availability if pod creation fails.</li>
<li><strong>Model Loading</strong>: If Ollama readiness check times out, the model may be too large for the selected GPU type.</li>
<li><strong>Translation Errors</strong>: Check the logs for specific error messages. Most translation errors are automatically retried.</li>
<li><strong>Dataset Structure</strong>: Run with the <code>--validate</code> flag to ensure the dataset structure matches the required Feature model.</li>
</ul>
</div>
</div>
</div>
</div>
</section>
<section class="section" id="License">
<div class="container is-max-desktop content">
<h2 class="title">License</h2>
<div class="content has-text-justified">
<p>This project is licensed under the Apache 2.0 License - see the <a href="LICENSE" target="_blank">LICENSE</a> file for details.</p>
</div>
</div>
</section>
<footer class="footer">
<div class="container">
<div class="content has-text-centered">
<a class="icon-link" href="https://github.com/yourusername/distributed-translation" target="_blank">
<i class="fab fa-github"></i>
</a>
</div>
<div class="columns is-centered">
<div class="column is-8">
<div class="content">
<p>
This website is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/" target="_blank">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
</p>
<p>
This means you are free to borrow the <a href="https://github.com/yourusername/distributed-translation" target="_blank">source code</a> of this website, we just ask that you link back to this page in the footer.
</p>
</div>
</div>
</div>
</div>
</footer>
</body>
</html>