DataWise/gte-large-en-v1.5_SEC_docs_ft

This is a finetuned version of Alibaba-NLP/gte-large-en-v1.5 optimized for SEC financial documents retrieval. It supports text input with a context length of up to 8192 tokens, mapping it to a 1024-dimensional dense vector space, enabling semantic textual similarity, semantic search, and clustering. Fine-tuning was conducted using a dataset of SEC documents to improve domain-specific retrieval accuracy.

The dataset consists of 9400 query-context pairs for training, 1770 pairs for validation, and 1190 pairs for testing, created from a total of 6 PDFs using gpt-4o-mini. Fine-tuning was conducted over 5 epochs (after some experimentation with less and more epochs) using LlamaIndex’s pipeline, optimizing the model for retrieval. The dataset can be found here. For more details about the original model, please refer to its model card.