File size: 1,813 Bytes
620a878
 
 
 
 
 
 
 
 
 
 
 
 
8f1d1e1
d1414a2
8f1d1e1
d1414a2
8f1d1e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
---
title: Human Mlm Clm Predictor
emoji: πŸ“‰
colorFrom: red
colorTo: red
sdk: gradio
sdk_version: 5.20.0
app_file: app.py
pinned: false
license: mit
short_description: See if you can predict the masked tokens / next token!
---

# MLM and NTP Testing App

This Hugging Face Gradio space tests users on two fundamental NLP tasks:

1. **Masked Language Modeling (MLM)** - Guess the masked words in a text
2. **Next Token Prediction (NTP)** - Predict how a text continues

## Features

- Switch between MLM and NTP tasks with a simple radio button
- Adjust masking/cutting ratio to control difficulty
- Sample texts from the cc_news dataset (100 samples, limited to 2 sentences)
- Track and display user accuracy for both tasks
- Detailed feedback on answers
- Token-by-token prediction for NTP task with immediate feedback

## How to Use

### For MLM Task
1. Select "mlm" in the Task Type radio button
2. Adjust mask ratio as desired (higher = more difficult)
3. Click "New Sample" to get a text with [MASK] tokens
4. Enter your guesses for the masked words, separated by spaces or commas
5. Click "Check Answer" to see your accuracy

### For NTP Task
1. Select "ntp" in the Task Type radio button
2. Adjust cut ratio as desired (higher = more text is hidden)
3. Click "New Sample" to get a partial text
4. Type your prediction for the next token/word
5. Click "Check Answer" to see if you're correct
6. Continue predicting the next tokens one by one

## Statistics
- The app keeps track of your accuracy for both tasks
- Click "Reset Stats" to start fresh

## Technical Details
- Uses HuggingFace's mlfoundations/dclm-baseline-1.0-parquet dataset
- Employs streaming to efficiently sample 100 documents
- Uses BERT tokenizer for consistent tokenization
- Limits samples to two sentences for better user experience