--- title: Semantic Deduplication emoji: ๐Ÿงน colorFrom: green colorTo: green sdk: gradio sdk_version: 5.32.1 app_file: app.py pinned: false license: mit short_description: Deduplicate HuggingFace datasets in seconds hf_oauth: true hf_oauth_scopes: - write-repos - manage-repos --- # Semantic Text Deduplication Using SemHash This Gradio application performs **semantic deduplication** on HuggingFace datasets using [SemHash](https://github.com/MinishLab/semhash) with [Model2Vec](https://github.com/MinishLab/model2vec) embeddings. ## Features - **Two deduplication modes**: - **Single dataset**: Find and remove duplicates within one dataset - **Cross-dataset**: Remove entries from Dataset 2 that are similar to entries in Dataset 1 - **Customizable similarity threshold**: Control how strict the deduplication should be (0.0 = very loose, 1.0 = exact matches only) - **Detailed results**: View statistics and examples of found duplicates with word-level differences highlighted - **Hub Integration**: ๐Ÿ†• **Push deduplicated datasets directly to the Hugging Face Hub** after logging in ## How to Use ### 1. Choose Deduplication Type - **Cross-dataset**: Useful for removing training data contamination from test sets - **Single dataset**: Clean up duplicate entries within a single dataset ### 2. Configure Datasets - Enter the HuggingFace dataset names (e.g., `SetFit/amazon_massive_scenario_en-US`) - Specify the dataset splits (e.g., `train`, `test`, `validation`) - Set the text column name (usually `text`, `sentence`, or `content`) ### 3. Set Similarity Threshold - **0.9** (default): Good balance between precision and recall - **Higher values** (0.95-0.99): More conservative, only removes very similar texts - **Lower values** (0.7-0.85): More aggressive, may remove semantically similar but different texts ### 4. Run Deduplication Click **"Deduplicate"** to start the process. You'll see: - Loading progress for datasets - Deduplication progress - Results with statistics and example duplicates ### 5. Push to Hub (New!) After deduplication completes: 1. **Log in** with your Hugging Face account using the login button 2. Enter a **dataset name** for your cleaned dataset 3. Click **"Push to Hub"** to upload the deduplicated dataset The dataset will be saved as `your-username/dataset-name` and be publicly available. ## Notes - The app preserves all original columns from the datasets - Only the text similarity is used for deduplication decisions - Deduplicated datasets maintain the same structure as the original - OAuth login is required only for pushing to the Hub, not for deduplication