license: mit
pipeline_tag: text-classification
library_name: fasttext
This is the fastText pretraining data filter targeting the PIQA task, discussed in the main text of the Perplexity Correlations paper: https://arxiv.org/abs/2409.05816. This filter helps select high-quality pretraining data by identifying strong correlations between LLM perplexity on a given text and its downstream performance on a target benchmark (PIQA in this case). The filter is trained using perplexity correlations from a sample of 90 LLMs from the Open LLM Leaderboard on texts from tens of thousands of web domains.
The filter is implemented using the fasttext
library and can be used to select or weight pretraining data samples based on their predicted likelihood of improving downstream performance.
For more information on the methodology and usage, please refer to the Perplexity Correlations paper and the project repository.
import fasttext
# Load the pre-trained fastText model
model = fasttext.load_model('fasttext_filter.bin')
# Example usage: Get the 'include' probability for a piece of text
text = "Some text to filter."
prediction = model.predict(text)[0] # Prediction is 'include' or 'exclude'
probability = model.predict_proba(text)[0][0] if prediction[0] == '__label__include' else model.predict_proba(text)[0][1] # probability of 'include'
print(f"Prediction: {prediction[0]}, Probability: {probability}")