|
--- |
|
title: Semantic Deduplication |
|
emoji: 🧹 |
|
colorFrom: green |
|
colorTo: green |
|
sdk: gradio |
|
sdk_version: 5.0.2 |
|
app_file: app.py |
|
pinned: false |
|
license: mit |
|
short_description: Deduplicate HuggingFace datasets in seconds |
|
hf_oauth: true |
|
hf_oauth_scopes: |
|
- write-repo |
|
- manage-repo |
|
--- |
|
|
|
# Semantic Text Deduplication Using SemHash |
|
|
|
This Gradio application performs **semantic deduplication** on HuggingFace datasets using [SemHash](https://github.com/MinishLab/semhash) with [Model2Vec](https://github.com/MinishLab/model2vec) embeddings. |
|
|
|
## Features |
|
|
|
- **Two deduplication modes**: |
|
- **Single dataset**: Find and remove duplicates within one dataset |
|
- **Cross-dataset**: Remove entries from Dataset 2 that are similar to entries in Dataset 1 |
|
|
|
- **Customizable similarity threshold**: Control how strict the deduplication should be (0.0 = very loose, 1.0 = exact matches only) |
|
|
|
- **Detailed results**: View statistics and examples of found duplicates with word-level differences highlighted |
|
|
|
- **Hub Integration**: 🆕 **Push deduplicated datasets directly to the Hugging Face Hub** after logging in |
|
|
|
## How to Use |
|
|
|
### 1. Choose Deduplication Type |
|
- **Cross-dataset**: Useful for removing training data contamination from test sets |
|
- **Single dataset**: Clean up duplicate entries within a single dataset |
|
|
|
### 2. Configure Datasets |
|
- Enter the HuggingFace dataset names (e.g., `SetFit/amazon_massive_scenario_en-US`) |
|
- Specify the dataset splits (e.g., `train`, `test`, `validation`) |
|
- Set the text column name (usually `text`, `sentence`, or `content`) |
|
|
|
### 3. Set Similarity Threshold |
|
- **0.9** (default): Good balance between precision and recall |
|
- **Higher values** (0.95-0.99): More conservative, only removes very similar texts |
|
- **Lower values** (0.7-0.85): More aggressive, may remove semantically similar but different texts |
|
|
|
### 4. Run Deduplication |
|
Click **"Deduplicate"** to start the process. You'll see: |
|
- Loading progress for datasets |
|
- Deduplication progress |
|
- Results with statistics and example duplicates |
|
|
|
### 5. Push to Hub (New!) |
|
After deduplication completes: |
|
1. **Log in** with your Hugging Face account using the login button |
|
2. Enter a **dataset name** for your cleaned dataset |
|
3. Click **"Push to Hub"** to upload the deduplicated dataset |
|
|
|
The dataset will be saved as `your-username/dataset-name` and be publicly available. |
|
|
|
## Technical Details |
|
|
|
- **Embedding Model**: Uses `minishlab/potion-base-8M` (Model2Vec) for fast, efficient text embeddings |
|
- **Deduplication Algorithm**: SemHash for scalable semantic similarity detection |
|
- **Backend**: Runs on CPU (may be slow for large datasets on free tier) |
|
|
|
## Local Usage |
|
|
|
For faster processing of large datasets, run locally: |
|
|
|
```bash |
|
git clone <repository-url> |
|
cd semantic-deduplication |
|
pip install -r requirements.txt |
|
python app.py |
|
``` |
|
|
|
## Examples |
|
|
|
### Cross-dataset Deduplication |
|
Remove test set contamination: |
|
- **Dataset 1**: `your-org/training-data` (split: `train`) |
|
- **Dataset 2**: `your-org/test-data` (split: `test`) |
|
- **Result**: Clean test set with training examples removed |
|
|
|
### Single Dataset Cleaning |
|
Remove duplicates from a dataset: |
|
- **Dataset 1**: `common_voice` (split: `train`) |
|
- **Result**: Training set with duplicate audio transcriptions removed |
|
|
|
## Notes |
|
|
|
- The app preserves all original columns from the datasets |
|
- Only the text similarity is used for deduplication decisions |
|
- Deduplicated datasets maintain the same structure as the original |
|
- OAuth login is required only for pushing to the Hub, not for deduplication |
|
|