C10X commited on
Commit
dc235ba
Β·
verified Β·
1 Parent(s): 1dd398d

Upload readme (4).md

Browse files
Files changed (1) hide show
  1. readme (4).md +74 -0
readme (4).md ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Dataset Quality Scorer
3
+ emoji: 🎯
4
+ colorFrom: purple
5
+ colorTo: blue
6
+ sdk: gradio
7
+ sdk_version: 4.20.0
8
+ app_file: app.py
9
+ pinned: true
10
+ license: apache-2.0
11
+ models:
12
+ - openbmb/Ultra-FineWeb-classifier
13
+ ---
14
+
15
+ # Dataset Quality Scorer 🎯
16
+
17
+ Score your text datasets using the Ultra-FineWeb classifier for quality assessment.
18
+
19
+ ## Features
20
+
21
+ - πŸ“Š **Fast Quality Scoring**: Process thousands of samples quickly using FastText
22
+ - πŸ€— **Hub Integration**: Direct search and load from Hugging Face datasets
23
+ - πŸ“ˆ **Visual Analytics**: Quality distribution plots and detailed statistics
24
+ - ☁️ **One-Click Upload**: Share your scored datasets on Hugging Face Hub
25
+ - πŸ”’ **Private Repos**: Option to create private scored datasets
26
+ - πŸ“± **Mobile Friendly**: Responsive design works on all devices
27
+
28
+ ## How It Works
29
+
30
+ 1. **Select Dataset**: Search and select any text dataset from Hugging Face Hub
31
+ 2. **Configure**: Choose split, text column, and sample size
32
+ 3. **Score**: The Ultra-FineWeb classifier scores each text (0-1 quality scale)
33
+ 4. **Analyze**: View distribution plots and quality statistics
34
+ 5. **Share**: Upload scored dataset to your Hugging Face account
35
+
36
+ ## Quality Score Interpretation
37
+
38
+ - 🟒 **High Quality (β‰₯0.8)**: Well-written, coherent, informative text
39
+ - 🟑 **Medium Quality (0.5-0.8)**: Acceptable quality with some issues
40
+ - πŸ”΄ **Low Quality (<0.5)**: Poor quality, may contain errors or low coherence
41
+
42
+ ## Model Information
43
+
44
+ This space uses the [Ultra-FineWeb classifier](https://huggingface.co/openbmb/Ultra-FineWeb-classifier), a FastText model trained to assess text quality based on the FineWeb dataset standards.
45
+
46
+ ## API Usage
47
+
48
+ You can also use this scorer programmatically:
49
+
50
+ ```python
51
+ from datasets import load_dataset
52
+ import requests
53
+
54
+ # Load and score your dataset
55
+ dataset = load_dataset("your-dataset")
56
+ # ... scoring logic
57
+ ```
58
+
59
+ ## Limitations
60
+
61
+ - Maximum 100,000 samples per run
62
+ - Text-only datasets supported
63
+ - English language optimized
64
+ - First run downloads ~350MB model
65
+
66
+ ## Privacy & Security
67
+
68
+ - Login required only for uploading to Hub
69
+ - Datasets are processed locally in the Space
70
+ - No data is stored permanently
71
+
72
+ ## Credits
73
+
74
+ Built by the C10X team using the Ultra-FineWeb classifier from OpenBMB.