orionweller's picture
Update README.md
8f1d1e1 verified

A newer version of the Gradio SDK is available: 5.22.0

Upgrade
metadata
title: Human Mlm Clm Predictor
emoji: πŸ“‰
colorFrom: red
colorTo: red
sdk: gradio
sdk_version: 5.20.0
app_file: app.py
pinned: false
license: mit
short_description: See if you can predict the masked tokens / next token!

MLM and NTP Testing App

This Hugging Face Gradio space tests users on two fundamental NLP tasks:

  1. Masked Language Modeling (MLM) - Guess the masked words in a text
  2. Next Token Prediction (NTP) - Predict how a text continues

Features

  • Switch between MLM and NTP tasks with a simple radio button
  • Adjust masking/cutting ratio to control difficulty
  • Sample texts from the cc_news dataset (100 samples, limited to 2 sentences)
  • Track and display user accuracy for both tasks
  • Detailed feedback on answers
  • Token-by-token prediction for NTP task with immediate feedback

How to Use

For MLM Task

  1. Select "mlm" in the Task Type radio button
  2. Adjust mask ratio as desired (higher = more difficult)
  3. Click "New Sample" to get a text with [MASK] tokens
  4. Enter your guesses for the masked words, separated by spaces or commas
  5. Click "Check Answer" to see your accuracy

For NTP Task

  1. Select "ntp" in the Task Type radio button
  2. Adjust cut ratio as desired (higher = more text is hidden)
  3. Click "New Sample" to get a partial text
  4. Type your prediction for the next token/word
  5. Click "Check Answer" to see if you're correct
  6. Continue predicting the next tokens one by one

Statistics

  • The app keeps track of your accuracy for both tasks
  • Click "Reset Stats" to start fresh

Technical Details

  • Uses HuggingFace's mlfoundations/dclm-baseline-1.0-parquet dataset
  • Employs streaming to efficiently sample 100 documents
  • Uses BERT tokenizer for consistent tokenization
  • Limits samples to two sentences for better user experience