Spaces:

C10X
/

Dataset-Quality-Scorer

Sleeping

App Files Files Community

Dataset-Quality-Scorer / readme.md

C10X's picture

Update readme.md

c7ea9a1 verified 2 months ago

|

2.2 kB

metadata

title: Dataset_Quality_Scorer
emoji: 🎯
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: 4.20.0
app_file: app.py
pinned: true
license: apache-2.0
models:
  - openbmb/Ultra-FineWeb-classifier

Score your text datasets using the Ultra-FineWeb classifier for quality assessment.

Features

📊 Fast Quality Scoring: Process thousands of samples quickly using FastText
🤗 Hub Integration: Direct search and load from Hugging Face datasets
📈 Visual Analytics: Quality distribution plots and detailed statistics
☁️ One-Click Upload: Share your scored datasets on Hugging Face Hub
🔒 Private Repos: Option to create private scored datasets
📱 Mobile Friendly: Responsive design works on all devices

How It Works

Select Dataset: Search and select any text dataset from Hugging Face Hub
Configure: Choose split, text column, and sample size
Score: The Ultra-FineWeb classifier scores each text (0-1 quality scale)
Analyze: View distribution plots and quality statistics
Share: Upload scored dataset to your Hugging Face account

Quality Score Interpretation

🟢 High Quality (≥0.8): Well-written, coherent, informative text
🟡 Medium Quality (0.5-0.8): Acceptable quality with some issues
🔴 Low Quality (<0.5): Poor quality, may contain errors or low coherence

Model Information

This space uses the Ultra-FineWeb classifier, a FastText model trained to assess text quality based on the FineWeb dataset standards.

API Usage

You can also use this scorer programmatically:

from datasets import load_dataset
import requests

# Load and score your dataset
dataset = load_dataset("your-dataset")
# ... scoring logic

Limitations

Maximum 100,000 samples per run
Text-only datasets supported
English language optimized
First run downloads ~350MB model

Privacy & Security

Login required only for uploading to Hub
Datasets are processed locally in the Space
No data is stored permanently

Credits

Built by the C10X team using the Ultra-FineWeb classifier from OpenBMB.