Spaces:

C10X
/

Dataset-Quality-Scorer

Sleeping

App Files Files Community

C10X commited on Jun 18

Commit

dc235ba

verified ·

1 Parent(s): 1dd398d

Upload readme (4).md

Browse files

Files changed (1) hide show

readme (4).md +74 -0

readme (4).md ADDED Viewed

	@@ -0,0 +1,74 @@

+---
+title: Dataset Quality Scorer
+emoji: 🎯
+colorFrom: purple
+colorTo: blue
+sdk: gradio
+sdk_version: 4.20.0
+app_file: app.py
+pinned: true
+license: apache-2.0
+models:
+  - openbmb/Ultra-FineWeb-classifier
+---
+# Dataset Quality Scorer 🎯
+Score your text datasets using the Ultra-FineWeb classifier for quality assessment.
+## Features
+- 📊 **Fast Quality Scoring**: Process thousands of samples quickly using FastText
+- 🤗 **Hub Integration**: Direct search and load from Hugging Face datasets
+- 📈 **Visual Analytics**: Quality distribution plots and detailed statistics
+- ☁️ **One-Click Upload**: Share your scored datasets on Hugging Face Hub
+- 🔒 **Private Repos**: Option to create private scored datasets
+- 📱 **Mobile Friendly**: Responsive design works on all devices
+## How It Works
+1. **Select Dataset**: Search and select any text dataset from Hugging Face Hub
+2. **Configure**: Choose split, text column, and sample size
+3. **Score**: The Ultra-FineWeb classifier scores each text (0-1 quality scale)
+4. **Analyze**: View distribution plots and quality statistics
+5. **Share**: Upload scored dataset to your Hugging Face account
+## Quality Score Interpretation
+- 🟢 **High Quality (≥0.8)**: Well-written, coherent, informative text
+- 🟡 **Medium Quality (0.5-0.8)**: Acceptable quality with some issues
+- 🔴 **Low Quality (<0.5)**: Poor quality, may contain errors or low coherence
+## Model Information
+This space uses the [Ultra-FineWeb classifier](https://huggingface.co/openbmb/Ultra-FineWeb-classifier), a FastText model trained to assess text quality based on the FineWeb dataset standards.
+## API Usage
+You can also use this scorer programmatically:
+```python
+from datasets import load_dataset
+import requests
+# Load and score your dataset
+dataset = load_dataset("your-dataset")
+# ... scoring logic
+```
+## Limitations
+- Maximum 100,000 samples per run
+- Text-only datasets supported
+- English language optimized
+- First run downloads ~350MB model
+## Privacy & Security
+- Login required only for uploading to Hub
+- Datasets are processed locally in the Space
+- No data is stored permanently
+## Credits
+Built by the C10X team using the Ultra-FineWeb classifier from OpenBMB.