Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
5.22.0
metadata
title: Human Mlm Clm Predictor
emoji: π
colorFrom: red
colorTo: red
sdk: gradio
sdk_version: 5.20.0
app_file: app.py
pinned: false
license: mit
short_description: See if you can predict the masked tokens / next token!
MLM and NTP Testing App
This Hugging Face Gradio space tests users on two fundamental NLP tasks:
- Masked Language Modeling (MLM) - Guess the masked words in a text
- Next Token Prediction (NTP) - Predict how a text continues
Features
- Switch between MLM and NTP tasks with a simple radio button
- Adjust masking/cutting ratio to control difficulty
- Sample texts from the cc_news dataset (100 samples, limited to 2 sentences)
- Track and display user accuracy for both tasks
- Detailed feedback on answers
- Token-by-token prediction for NTP task with immediate feedback
How to Use
For MLM Task
- Select "mlm" in the Task Type radio button
- Adjust mask ratio as desired (higher = more difficult)
- Click "New Sample" to get a text with [MASK] tokens
- Enter your guesses for the masked words, separated by spaces or commas
- Click "Check Answer" to see your accuracy
For NTP Task
- Select "ntp" in the Task Type radio button
- Adjust cut ratio as desired (higher = more text is hidden)
- Click "New Sample" to get a partial text
- Type your prediction for the next token/word
- Click "Check Answer" to see if you're correct
- Continue predicting the next tokens one by one
Statistics
- The app keeps track of your accuracy for both tasks
- Click "Reset Stats" to start fresh
Technical Details
- Uses HuggingFace's mlfoundations/dclm-baseline-1.0-parquet dataset
- Employs streaming to efficiently sample 100 documents
- Uses BERT tokenizer for consistent tokenization
- Limits samples to two sentences for better user experience