orionweller commited on
Commit
d1414a2
·
verified ·
1 Parent(s): ee1d16c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +41 -1
README.md CHANGED
@@ -11,4 +11,44 @@ license: mit
11
  short_description: See if you can predict the masked tokens / next token!
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  short_description: See if you can predict the masked tokens / next token!
12
  ---
13
 
14
+ ## MLM and NTP Testing App
15
+ This Hugging Face Gradio space tests users on two fundamental NLP tasks:
16
+
17
+ Masked Language Modeling (MLM) - Guess the masked words in a text
18
+ Next Token Prediction (NTP) - Predict how a text continues
19
+
20
+ #### Features
21
+
22
+ Switch between MLM and NTP tasks with a simple radio button
23
+ Adjust masking/cutting ratio to control difficulty
24
+ Sample texts from the cc_news dataset (100 samples)
25
+ Track and display user accuracy for both tasks
26
+ Detailed feedback on answers
27
+
28
+ #### How to Use
29
+ ##### For MLM Task
30
+
31
+ Select "mlm" in the Task Type radio button
32
+ Adjust mask ratio as desired (higher = more difficult)
33
+ Click "New Sample" to get a text with [MASK] tokens
34
+ Enter your guesses for the masked words, separated by spaces or commas
35
+ Click "Check Answer" to see your accuracy
36
+
37
+ ##### For NTP Task
38
+
39
+ Select "ntp" in the Task Type radio button
40
+ Adjust cut ratio as desired (higher = more text is hidden)
41
+ Click "New Sample" to get a partial text
42
+ Type your prediction of how the text continues
43
+ Click "Check Answer" to see your accuracy and the actual continuation
44
+
45
+ #### Statistics
46
+
47
+ The app keeps track of your accuracy for both tasks
48
+ Click "Reset Stats" to start fresh
49
+
50
+ #### Technical Details
51
+
52
+ Uses HuggingFace's cc_news dataset (vblagoje/cc_news)
53
+ Employs streaming to efficiently sample 100 documents
54
+ Uses BERT tokenizer for consistent tokenization