Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
This is a finetuned version of [Alibaba-NLP/gte-large-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5) optimized for SEC document retrieval. It maps text to a 1024-dimensional dense vector space, enabling semantic textual similarity, semantic search, and clustering. Fine-tuning was conducted using a dataset of SEC documents to improve domain-specific retrieval accuracy.
|
2 |
|
3 |
The dataset consists of 9400 query-context pairs for training, 1770 pairs for validation, and 1190 pairs for testing, created from a total of 6 PDFs using gpt-4o-mini. Fine-tuning was conducted over 5 epochs (after some experimentation with less and more epochs) using [LlamaIndex’s pipeline](https://docs.llamaindex.ai/en/stable/examples/finetuning/embeddings/finetune_embedding/), optimizing the model for retrieval. The dataset can be found [here](https://huggingface.co/DataWise/gte-large-en-v1.5_SEC_docs_ft/tree/main/data_for_fine_tuning). For more details about the original model, please refer to its [model card](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5).
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
tags:
|
6 |
+
- sentence-transformers
|
7 |
+
- gte
|
8 |
+
- SEC
|
9 |
+
- sentence-similarity
|
10 |
+
---
|
11 |
+
|
12 |
This is a finetuned version of [Alibaba-NLP/gte-large-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5) optimized for SEC document retrieval. It maps text to a 1024-dimensional dense vector space, enabling semantic textual similarity, semantic search, and clustering. Fine-tuning was conducted using a dataset of SEC documents to improve domain-specific retrieval accuracy.
|
13 |
|
14 |
The dataset consists of 9400 query-context pairs for training, 1770 pairs for validation, and 1190 pairs for testing, created from a total of 6 PDFs using gpt-4o-mini. Fine-tuning was conducted over 5 epochs (after some experimentation with less and more epochs) using [LlamaIndex’s pipeline](https://docs.llamaindex.ai/en/stable/examples/finetuning/embeddings/finetune_embedding/), optimizing the model for retrieval. The dataset can be found [here](https://huggingface.co/DataWise/gte-large-en-v1.5_SEC_docs_ft/tree/main/data_for_fine_tuning). For more details about the original model, please refer to its [model card](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5).
|