terilias commited on
Commit
772497d
·
verified ·
1 Parent(s): c1e2fb6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -1,3 +1,3 @@
1
  This is a finetuned version of [Alibaba-NLP/gte-large-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5) optimized for SEC document retrieval. It maps text to a 1024-dimensional dense vector space, enabling semantic textual similarity, semantic search, and clustering. Fine-tuning was conducted using a dataset of SEC documents to improve domain-specific retrieval accuracy.
2
 
3
- The dataset consists of 9400 query-context pairs for training, 1770 pairs for validation, and 1190 pairs for testing, created from a total of 6 PDFs using gpt-4o-mini. Fine-tuning was conducted over 5 epochs (after some experimentation with less and more epochs) using LlamaIndex’s pipeline, optimizing the model for retrieval. The dataset, along with detailed instructions for reproducibility, will be uploaded soon.
 
1
  This is a finetuned version of [Alibaba-NLP/gte-large-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5) optimized for SEC document retrieval. It maps text to a 1024-dimensional dense vector space, enabling semantic textual similarity, semantic search, and clustering. Fine-tuning was conducted using a dataset of SEC documents to improve domain-specific retrieval accuracy.
2
 
3
+ The dataset consists of 9400 query-context pairs for training, 1770 pairs for validation, and 1190 pairs for testing, created from a total of 6 PDFs using gpt-4o-mini. Fine-tuning was conducted over 5 epochs (after some experimentation with less and more epochs) using [LlamaIndex’s pipeline](https://docs.llamaindex.ai/en/stable/examples/finetuning/embeddings/finetune_embedding/), optimizing the model for retrieval. The dataset, along with detailed instructions for reproducibility, will be uploaded soon.