Spaces:

orionweller
/

human-mlm-clm-predictor

Runtime error

App Files Files Community

human-mlm-clm-predictor / README.md

orionweller's picture

Update README.md

8f1d1e1 verified 4 months ago

|

history blame contribute delete

1.81 kB

A newer version of the Gradio SDK is available: 5.36.2

Upgrade

metadata

title: Human Mlm Clm Predictor
emoji: 📉
colorFrom: red
colorTo: red
sdk: gradio
sdk_version: 5.20.0
app_file: app.py
pinned: false
license: mit
short_description: See if you can predict the masked tokens / next token!

MLM and NTP Testing App

This Hugging Face Gradio space tests users on two fundamental NLP tasks:

Masked Language Modeling (MLM) - Guess the masked words in a text
Next Token Prediction (NTP) - Predict how a text continues

Features

Switch between MLM and NTP tasks with a simple radio button
Adjust masking/cutting ratio to control difficulty
Sample texts from the cc_news dataset (100 samples, limited to 2 sentences)
Track and display user accuracy for both tasks
Detailed feedback on answers
Token-by-token prediction for NTP task with immediate feedback

How to Use

For MLM Task

Select "mlm" in the Task Type radio button
Adjust mask ratio as desired (higher = more difficult)
Click "New Sample" to get a text with [MASK] tokens
Enter your guesses for the masked words, separated by spaces or commas
Click "Check Answer" to see your accuracy

For NTP Task

Select "ntp" in the Task Type radio button
Adjust cut ratio as desired (higher = more text is hidden)
Click "New Sample" to get a partial text
Type your prediction for the next token/word
Click "Check Answer" to see if you're correct
Continue predicting the next tokens one by one

Statistics

The app keeps track of your accuracy for both tasks
Click "Reset Stats" to start fresh

Technical Details

Uses HuggingFace's mlfoundations/dclm-baseline-1.0-parquet dataset
Employs streaming to efficiently sample 100 documents
Uses BERT tokenizer for consistent tokenization
Limits samples to two sentences for better user experience