File size: 1,532 Bytes
0a22e98 3451b53 0a22e98 3451b53 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
---
license: mit
pipeline_tag: text-classification
library_name: fasttext
---
This is the fastText pretraining data filter targeting the PIQA task, discussed in the main text of the Perplexity Correlations paper: https://arxiv.org/abs/2409.05816. This filter helps select high-quality pretraining data by identifying strong correlations between LLM perplexity on a given text and its downstream performance on a target benchmark (PIQA in this case). The filter is trained using perplexity correlations from a sample of 90 LLMs from the Open LLM Leaderboard on texts from tens of thousands of web domains.
The filter is implemented using the `fasttext` library and can be used to select or weight pretraining data samples based on their predicted likelihood of improving downstream performance.
For more information on the methodology and usage, please refer to the [Perplexity Correlations paper](https://arxiv.org/abs/2409.05816) and the [project repository](https://github.com/TristanThrush/perplexity-correlations).
```python
import fasttext
# Load the pre-trained fastText model
model = fasttext.load_model('fasttext_filter.bin')
# Example usage: Get the 'include' probability for a piece of text
text = "Some text to filter."
prediction = model.predict(text)[0] # Prediction is 'include' or 'exclude'
probability = model.predict_proba(text)[0][0] if prediction[0] == '__label__include' else model.predict_proba(text)[0][1] # probability of 'include'
print(f"Prediction: {prediction[0]}, Probability: {probability}")
``` |