DataWise
/

gte-large-en-v1.5_SEC_docs_ft

Sentence Similarity

sentence-transformers

text-embeddings-inference

Model card Files Files and versions Community

terilias commited on Nov 20, 2024

Commit

55d56a1

·

verified ·

1 Parent(s): 22b3e68

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -9,6 +9,6 @@ tags:
   - sentence-similarity
 ---
-This is a finetuned version of [Alibaba-NLP/gte-large-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5) optimized for [SEC](https://www.sec.gov/search-filings) financial documents retrieval. It supports text input with a context length of up to 8192 tokens, , mapping it to a 1024-dimensional dense vector space, enabling semantic textual similarity, semantic search, and clustering. Fine-tuning was conducted using a dataset of SEC documents to improve domain-specific retrieval accuracy.
 The dataset consists of 9400 query-context pairs for training, 1770 pairs for validation, and 1190 pairs for testing, created from a total of 6 PDFs using gpt-4o-mini. Fine-tuning was conducted over 5 epochs (after some experimentation with less and more epochs) using [LlamaIndex’s pipeline](https://docs.llamaindex.ai/en/stable/examples/finetuning/embeddings/finetune_embedding/), optimizing the model for retrieval. The dataset can be found [here](https://huggingface.co/DataWise/gte-large-en-v1.5_SEC_docs_ft/tree/main/data_for_fine_tuning). For more details about the original model, please refer to its [model card](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5).

   - sentence-similarity
 ---
+This is a finetuned version of [Alibaba-NLP/gte-large-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5) optimized for [SEC](https://www.sec.gov/search-filings) financial documents retrieval. It supports text input with a context length of up to 8192 tokens, mapping it to a 1024-dimensional dense vector space, enabling semantic textual similarity, semantic search, and clustering. Fine-tuning was conducted using a dataset of SEC documents to improve domain-specific retrieval accuracy.
 The dataset consists of 9400 query-context pairs for training, 1770 pairs for validation, and 1190 pairs for testing, created from a total of 6 PDFs using gpt-4o-mini. Fine-tuning was conducted over 5 epochs (after some experimentation with less and more epochs) using [LlamaIndex’s pipeline](https://docs.llamaindex.ai/en/stable/examples/finetuning/embeddings/finetune_embedding/), optimizing the model for retrieval. The dataset can be found [here](https://huggingface.co/DataWise/gte-large-en-v1.5_SEC_docs_ft/tree/main/data_for_fine_tuning). For more details about the original model, please refer to its [model card](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5).