orionweller commited on
Commit
8f1d1e1
·
verified ·
1 Parent(s): 7e7cce3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -39
README.md CHANGED
@@ -11,44 +11,45 @@ license: mit
11
  short_description: See if you can predict the masked tokens / next token!
12
  ---
13
 
14
- ## MLM and NTP Testing App
15
- This Hugging Face Gradio space tests users on two fundamental NLP tasks:
16
-
17
- Masked Language Modeling (MLM) - Guess the masked words in a text
18
- Next Token Prediction (NTP) - Predict how a text continues
19
-
20
- #### Features
21
-
22
- Switch between MLM and NTP tasks with a simple radio button
23
- Adjust masking/cutting ratio to control difficulty
24
- Sample texts from the cc_news dataset (100 samples)
25
- Track and display user accuracy for both tasks
26
- Detailed feedback on answers
27
-
28
- #### How to Use
29
- ##### For MLM Task
30
-
31
- Select "mlm" in the Task Type radio button
32
- Adjust mask ratio as desired (higher = more difficult)
33
- Click "New Sample" to get a text with [MASK] tokens
34
- Enter your guesses for the masked words, separated by spaces or commas
35
- Click "Check Answer" to see your accuracy
36
 
37
- ##### For NTP Task
38
-
39
- Select "ntp" in the Task Type radio button
40
- Adjust cut ratio as desired (higher = more text is hidden)
41
- Click "New Sample" to get a partial text
42
- Type your prediction of how the text continues
43
- Click "Check Answer" to see your accuracy and the actual continuation
44
-
45
- #### Statistics
46
-
47
- The app keeps track of your accuracy for both tasks
48
- Click "Reset Stats" to start fresh
49
-
50
- #### Technical Details
51
 
52
- Uses HuggingFace's cc_news dataset (vblagoje/cc_news)
53
- Employs streaming to efficiently sample 100 documents
54
- Uses BERT tokenizer for consistent tokenization
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  short_description: See if you can predict the masked tokens / next token!
12
  ---
13
 
14
+ # MLM and NTP Testing App
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
+ This Hugging Face Gradio space tests users on two fundamental NLP tasks:
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
+ 1. **Masked Language Modeling (MLM)** - Guess the masked words in a text
19
+ 2. **Next Token Prediction (NTP)** - Predict how a text continues
20
+
21
+ ## Features
22
+
23
+ - Switch between MLM and NTP tasks with a simple radio button
24
+ - Adjust masking/cutting ratio to control difficulty
25
+ - Sample texts from the cc_news dataset (100 samples, limited to 2 sentences)
26
+ - Track and display user accuracy for both tasks
27
+ - Detailed feedback on answers
28
+ - Token-by-token prediction for NTP task with immediate feedback
29
+
30
+ ## How to Use
31
+
32
+ ### For MLM Task
33
+ 1. Select "mlm" in the Task Type radio button
34
+ 2. Adjust mask ratio as desired (higher = more difficult)
35
+ 3. Click "New Sample" to get a text with [MASK] tokens
36
+ 4. Enter your guesses for the masked words, separated by spaces or commas
37
+ 5. Click "Check Answer" to see your accuracy
38
+
39
+ ### For NTP Task
40
+ 1. Select "ntp" in the Task Type radio button
41
+ 2. Adjust cut ratio as desired (higher = more text is hidden)
42
+ 3. Click "New Sample" to get a partial text
43
+ 4. Type your prediction for the next token/word
44
+ 5. Click "Check Answer" to see if you're correct
45
+ 6. Continue predicting the next tokens one by one
46
+
47
+ ## Statistics
48
+ - The app keeps track of your accuracy for both tasks
49
+ - Click "Reset Stats" to start fresh
50
+
51
+ ## Technical Details
52
+ - Uses HuggingFace's mlfoundations/dclm-baseline-1.0-parquet dataset
53
+ - Employs streaming to efficiently sample 100 documents
54
+ - Uses BERT tokenizer for consistent tokenization
55
+ - Limits samples to two sentences for better user experience