Spaces:

minishlab
/

semantic-deduplication

Running

App Files Files Community

semantic-deduplication / README.md

burtenshaw

add friendly readme

074bcd7 about 2 months ago

preview code

raw

history blame

3.53 kB

	---
	title: Semantic Deduplication
	emoji: 🧹
	colorFrom: green
	colorTo: green
	sdk: gradio
	sdk_version: 5.0.2
	app_file: app.py
	pinned: false
	license: mit
	short_description: Deduplicate HuggingFace datasets in seconds
	hf_oauth: true
	hf_oauth_scopes:
	- write-repo
	- manage-repo
	---

	# Semantic Text Deduplication Using SemHash

	This Gradio application performs semantic deduplication on HuggingFace datasets using [SemHash](https://github.com/MinishLab/semhash) with [Model2Vec](https://github.com/MinishLab/model2vec) embeddings.

	## Features

	- Two deduplication modes:
	- Single dataset: Find and remove duplicates within one dataset
	- Cross-dataset: Remove entries from Dataset 2 that are similar to entries in Dataset 1

	- Customizable similarity threshold: Control how strict the deduplication should be (0.0 = very loose, 1.0 = exact matches only)

	- Detailed results: View statistics and examples of found duplicates with word-level differences highlighted

	- Hub Integration: 🆕 Push deduplicated datasets directly to the Hugging Face Hub after logging in

	## How to Use

	### 1. Choose Deduplication Type
	- Cross-dataset: Useful for removing training data contamination from test sets
	- Single dataset: Clean up duplicate entries within a single dataset

	### 2. Configure Datasets
	- Enter the HuggingFace dataset names (e.g., `SetFit/amazon_massive_scenario_en-US`)
	- Specify the dataset splits (e.g., `train`, `test`, `validation`)
	- Set the text column name (usually `text`, `sentence`, or `content`)

	### 3. Set Similarity Threshold
	- 0.9 (default): Good balance between precision and recall
	- Higher values (0.95-0.99): More conservative, only removes very similar texts
	- Lower values (0.7-0.85): More aggressive, may remove semantically similar but different texts

	### 4. Run Deduplication
	Click "Deduplicate" to start the process. You'll see:
	- Loading progress for datasets
	- Deduplication progress
	- Results with statistics and example duplicates

	### 5. Push to Hub (New!)
	After deduplication completes:
	1. Log in with your Hugging Face account using the login button
	2. Enter a dataset name for your cleaned dataset
	3. Click "Push to Hub" to upload the deduplicated dataset

	The dataset will be saved as `your-username/dataset-name` and be publicly available.

	## Technical Details

	- Embedding Model: Uses `minishlab/potion-base-8M` (Model2Vec) for fast, efficient text embeddings
	- Deduplication Algorithm: SemHash for scalable semantic similarity detection
	- Backend: Runs on CPU (may be slow for large datasets on free tier)

	## Local Usage

	For faster processing of large datasets, run locally:

	```bash
	git clone <repository-url>
	cd semantic-deduplication
	pip install -r requirements.txt
	python app.py
	```

	## Examples

	### Cross-dataset Deduplication
	Remove test set contamination:
	- Dataset 1: `your-org/training-data` (split: `train`)
	- Dataset 2: `your-org/test-data` (split: `test`)
	- Result: Clean test set with training examples removed

	### Single Dataset Cleaning
	Remove duplicates from a dataset:
	- Dataset 1: `common_voice` (split: `train`)
	- Result: Training set with duplicate audio transcriptions removed

	## Notes

	- The app preserves all original columns from the datasets
	- Only the text similarity is used for deduplication decisions
	- Deduplicated datasets maintain the same structure as the original
	- OAuth login is required only for pushing to the Hub, not for deduplication