Spaces:

orionweller
/

human-mlm-clm-predictor

Sleeping

App Files Files Community

human-mlm-clm-predictor / README.md

orionweller

Update README.md

8f1d1e1 verified 22 days ago

preview code

raw

history blame contribute delete

1.81 kB

	---
	title: Human Mlm Clm Predictor
	emoji: 📉
	colorFrom: red
	colorTo: red
	sdk: gradio
	sdk_version: 5.20.0
	app_file: app.py
	pinned: false
	license: mit
	short_description: See if you can predict the masked tokens / next token!
	---

	# MLM and NTP Testing App

	This Hugging Face Gradio space tests users on two fundamental NLP tasks:

	1. Masked Language Modeling (MLM) - Guess the masked words in a text
	2. Next Token Prediction (NTP) - Predict how a text continues

	## Features

	- Switch between MLM and NTP tasks with a simple radio button
	- Adjust masking/cutting ratio to control difficulty
	- Sample texts from the cc_news dataset (100 samples, limited to 2 sentences)
	- Track and display user accuracy for both tasks
	- Detailed feedback on answers
	- Token-by-token prediction for NTP task with immediate feedback

	## How to Use

	### For MLM Task
	1. Select "mlm" in the Task Type radio button
	2. Adjust mask ratio as desired (higher = more difficult)
	3. Click "New Sample" to get a text with [MASK] tokens
	4. Enter your guesses for the masked words, separated by spaces or commas
	5. Click "Check Answer" to see your accuracy

	### For NTP Task
	1. Select "ntp" in the Task Type radio button
	2. Adjust cut ratio as desired (higher = more text is hidden)
	3. Click "New Sample" to get a partial text
	4. Type your prediction for the next token/word
	5. Click "Check Answer" to see if you're correct
	6. Continue predicting the next tokens one by one

	## Statistics
	- The app keeps track of your accuracy for both tasks
	- Click "Reset Stats" to start fresh

	## Technical Details
	- Uses HuggingFace's mlfoundations/dclm-baseline-1.0-parquet dataset
	- Employs streaming to efficiently sample 100 documents
	- Uses BERT tokenizer for consistent tokenization
	- Limits samples to two sentences for better user experience