license: mit | |
pipeline_tag: text-classification | |
library_name: fasttext | |
This is the fastText pretraining data filter targeting the PIQA task, discussed in the main text of the Perplexity Correlations paper: https://arxiv.org/abs/2409.05816. This filter helps select high-quality pretraining data by identifying strong correlations between LLM perplexity on a given text and its downstream performance on a target benchmark (PIQA in this case). The filter is trained using perplexity correlations from a sample of 90 LLMs from the Open LLM Leaderboard on texts from tens of thousands of web domains. | |
The filter is implemented using the `fasttext` library and can be used to select or weight pretraining data samples based on their predicted likelihood of improving downstream performance. | |
For more information on the methodology and usage, please refer to the [Perplexity Correlations paper](https://arxiv.org/abs/2409.05816) and the [project repository](https://github.com/TristanThrush/perplexity-correlations). | |
```python | |
import fasttext | |
# Load the pre-trained fastText model | |
model = fasttext.load_model('fasttext_filter.bin') | |
# Example usage: Get the 'include' probability for a piece of text | |
text = "Some text to filter." | |
prediction = model.predict(text)[0] # Prediction is 'include' or 'exclude' | |
probability = model.predict_proba(text)[0][0] if prediction[0] == '__label__include' else model.predict_proba(text)[0][1] # probability of 'include' | |
print(f"Prediction: {prediction[0]}, Probability: {probability}") | |
``` |