Spaces:
Sleeping
Sleeping
metadata
title: Dataset_Quality_Scorer
emoji: π―
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: 4.20.0
app_file: app.py
pinned: true
license: apache-2.0
models:
- openbmb/Ultra-FineWeb-classifier
Score your text datasets using the Ultra-FineWeb classifier for quality assessment.
Features
- π Fast Quality Scoring: Process thousands of samples quickly using FastText
- π€ Hub Integration: Direct search and load from Hugging Face datasets
- π Visual Analytics: Quality distribution plots and detailed statistics
- βοΈ One-Click Upload: Share your scored datasets on Hugging Face Hub
- π Private Repos: Option to create private scored datasets
- π± Mobile Friendly: Responsive design works on all devices
How It Works
- Select Dataset: Search and select any text dataset from Hugging Face Hub
- Configure: Choose split, text column, and sample size
- Score: The Ultra-FineWeb classifier scores each text (0-1 quality scale)
- Analyze: View distribution plots and quality statistics
- Share: Upload scored dataset to your Hugging Face account
Quality Score Interpretation
- π’ High Quality (β₯0.8): Well-written, coherent, informative text
- π‘ Medium Quality (0.5-0.8): Acceptable quality with some issues
- π΄ Low Quality (<0.5): Poor quality, may contain errors or low coherence
Model Information
This space uses the Ultra-FineWeb classifier, a FastText model trained to assess text quality based on the FineWeb dataset standards.
API Usage
You can also use this scorer programmatically:
from datasets import load_dataset
import requests
# Load and score your dataset
dataset = load_dataset("your-dataset")
# ... scoring logic
Limitations
- Maximum 100,000 samples per run
- Text-only datasets supported
- English language optimized
- First run downloads ~350MB model
Privacy & Security
- Login required only for uploading to Hub
- Datasets are processed locally in the Space
- No data is stored permanently
Credits
Built by the C10X team using the Ultra-FineWeb classifier from OpenBMB.