Spaces:
Sleeping
Sleeping
title: Human Mlm Clm Predictor | |
emoji: π | |
colorFrom: red | |
colorTo: red | |
sdk: gradio | |
sdk_version: 5.20.0 | |
app_file: app.py | |
pinned: false | |
license: mit | |
short_description: See if you can predict the masked tokens / next token! | |
# MLM and NTP Testing App | |
This Hugging Face Gradio space tests users on two fundamental NLP tasks: | |
1. **Masked Language Modeling (MLM)** - Guess the masked words in a text | |
2. **Next Token Prediction (NTP)** - Predict how a text continues | |
## Features | |
- Switch between MLM and NTP tasks with a simple radio button | |
- Adjust masking/cutting ratio to control difficulty | |
- Sample texts from the cc_news dataset (100 samples, limited to 2 sentences) | |
- Track and display user accuracy for both tasks | |
- Detailed feedback on answers | |
- Token-by-token prediction for NTP task with immediate feedback | |
## How to Use | |
### For MLM Task | |
1. Select "mlm" in the Task Type radio button | |
2. Adjust mask ratio as desired (higher = more difficult) | |
3. Click "New Sample" to get a text with [MASK] tokens | |
4. Enter your guesses for the masked words, separated by spaces or commas | |
5. Click "Check Answer" to see your accuracy | |
### For NTP Task | |
1. Select "ntp" in the Task Type radio button | |
2. Adjust cut ratio as desired (higher = more text is hidden) | |
3. Click "New Sample" to get a partial text | |
4. Type your prediction for the next token/word | |
5. Click "Check Answer" to see if you're correct | |
6. Continue predicting the next tokens one by one | |
## Statistics | |
- The app keeps track of your accuracy for both tasks | |
- Click "Reset Stats" to start fresh | |
## Technical Details | |
- Uses HuggingFace's mlfoundations/dclm-baseline-1.0-parquet dataset | |
- Employs streaming to efficiently sample 100 documents | |
- Uses BERT tokenizer for consistent tokenization | |
- Limits samples to two sentences for better user experience |