C10X's picture
Update readme.md
c7ea9a1 verified
|
raw
history blame
2.2 kB
metadata
title: Dataset_Quality_Scorer
emoji: 🎯
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: 4.20.0
app_file: app.py
pinned: true
license: apache-2.0
models:
  - openbmb/Ultra-FineWeb-classifier

Score your text datasets using the Ultra-FineWeb classifier for quality assessment.

Features

  • πŸ“Š Fast Quality Scoring: Process thousands of samples quickly using FastText
  • πŸ€— Hub Integration: Direct search and load from Hugging Face datasets
  • πŸ“ˆ Visual Analytics: Quality distribution plots and detailed statistics
  • ☁️ One-Click Upload: Share your scored datasets on Hugging Face Hub
  • πŸ”’ Private Repos: Option to create private scored datasets
  • πŸ“± Mobile Friendly: Responsive design works on all devices

How It Works

  1. Select Dataset: Search and select any text dataset from Hugging Face Hub
  2. Configure: Choose split, text column, and sample size
  3. Score: The Ultra-FineWeb classifier scores each text (0-1 quality scale)
  4. Analyze: View distribution plots and quality statistics
  5. Share: Upload scored dataset to your Hugging Face account

Quality Score Interpretation

  • 🟒 High Quality (β‰₯0.8): Well-written, coherent, informative text
  • 🟑 Medium Quality (0.5-0.8): Acceptable quality with some issues
  • πŸ”΄ Low Quality (<0.5): Poor quality, may contain errors or low coherence

Model Information

This space uses the Ultra-FineWeb classifier, a FastText model trained to assess text quality based on the FineWeb dataset standards.

API Usage

You can also use this scorer programmatically:

from datasets import load_dataset
import requests

# Load and score your dataset
dataset = load_dataset("your-dataset")
# ... scoring logic

Limitations

  • Maximum 100,000 samples per run
  • Text-only datasets supported
  • English language optimized
  • First run downloads ~350MB model

Privacy & Security

  • Login required only for uploading to Hub
  • Datasets are processed locally in the Space
  • No data is stored permanently

Credits

Built by the C10X team using the Ultra-FineWeb classifier from OpenBMB.