--- title: Human Mlm Clm Predictor emoji: 📉 colorFrom: red colorTo: red sdk: gradio sdk_version: 5.20.0 app_file: app.py pinned: false license: mit short_description: See if you can predict the masked tokens / next token! --- # MLM and NTP Testing App This Hugging Face Gradio space tests users on two fundamental NLP tasks: 1. **Masked Language Modeling (MLM)** - Guess the masked words in a text 2. **Next Token Prediction (NTP)** - Predict how a text continues ## Features - Switch between MLM and NTP tasks with a simple radio button - Adjust masking/cutting ratio to control difficulty - Sample texts from the cc_news dataset (100 samples, limited to 2 sentences) - Track and display user accuracy for both tasks - Detailed feedback on answers - Token-by-token prediction for NTP task with immediate feedback ## How to Use ### For MLM Task 1. Select "mlm" in the Task Type radio button 2. Adjust mask ratio as desired (higher = more difficult) 3. Click "New Sample" to get a text with [MASK] tokens 4. Enter your guesses for the masked words, separated by spaces or commas 5. Click "Check Answer" to see your accuracy ### For NTP Task 1. Select "ntp" in the Task Type radio button 2. Adjust cut ratio as desired (higher = more text is hidden) 3. Click "New Sample" to get a partial text 4. Type your prediction for the next token/word 5. Click "Check Answer" to see if you're correct 6. Continue predicting the next tokens one by one ## Statistics - The app keeps track of your accuracy for both tasks - Click "Reset Stats" to start fresh ## Technical Details - Uses HuggingFace's mlfoundations/dclm-baseline-1.0-parquet dataset - Employs streaming to efficiently sample 100 documents - Uses BERT tokenizer for consistent tokenization - Limits samples to two sentences for better user experience