This is a finetuned version of Alibaba-NLP/gte-large-en-v1.5 optimized for SEC financial documents retrieval. It supports text input with a context length of up to 8192 tokens, mapping it to a 1024-dimensional dense vector space, enabling semantic textual similarity, semantic search, and clustering. Fine-tuning was conducted using a dataset of SEC documents to improve domain-specific retrieval accuracy.

The dataset consists of 9400 query-context pairs for training, 1770 pairs for validation, and 1190 pairs for testing, created from a total of 6 PDFs using gpt-4o-mini. Fine-tuning was conducted over 5 epochs (after some experimentation with less and more epochs) using LlamaIndex’s pipeline, optimizing the model for retrieval. The dataset can be found here. For more details about the original model, please refer to its model card.

Downloads last month
512
Safetensors
Model size
434M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.