Update README.md
Browse files
README.md
CHANGED
@@ -11,7 +11,7 @@ tags:
|
|
11 |
- bag-of-words
|
12 |
---
|
13 |
|
14 |
-
# opensearch-neural-sparse-encoding-doc-v3-
|
15 |
|
16 |
## Select the model
|
17 |
The model should be selected considering search relevance, model inference and retrieval efficiency(FLOPS). We benchmark models' performance on a subset of BEIR benchmark: TrecCovid,NFCorpus,NQ,HotpotQA,FiQA,ArguAna,Touche,DBPedia,SCIDOCS,FEVER,Climate FEVER,SciFact,Quora.
|
@@ -26,9 +26,12 @@ Overall, the v3 series of models have better search relevance, efficiency and in
|
|
26 |
| [opensearch-neural-sparse-encoding-doc-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill) | ✔️ | 67M | 0.504 | 1.8 |
|
27 |
| [opensearch-neural-sparse-encoding-doc-v2-mini](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini) | ✔️ | 23M | 0.497 | 1.7 |
|
28 |
| [opensearch-neural-sparse-encoding-doc-v3-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill) | ✔️ | 67M | 0.517 | 1.8 |
|
|
|
29 |
|
30 |
## Overview
|
31 |
-
- **Paper**:
|
|
|
|
|
32 |
- **Codes**: [opensearch-sparse-model-tuning-sample](https://github.com/zhichao-aws/opensearch-sparse-model-tuning-sample/tree/l0_enhance)
|
33 |
|
34 |
This is a learned sparse retrieval model. It encodes the documents to 30522 dimensional **sparse vectors**. For queries, it just use a tokenizer and a weight look-up table to generate sparse vectors. The non-zero dimension index means the corresponding token in the vocabulary, and the weight means the importance of the token. And the similarity score is the inner product of query/document sparse vectors.
|
@@ -75,7 +78,7 @@ def transform_sparse_vector_to_dict(sparse_vector):
|
|
75 |
# download the idf file from model hub. idf is used to give weights for query tokens
|
76 |
def get_tokenizer_idf(tokenizer):
|
77 |
from huggingface_hub import hf_hub_download
|
78 |
-
local_cached_path = hf_hub_download(repo_id="opensearch-project/opensearch-neural-sparse-encoding-doc-v3-
|
79 |
with open(local_cached_path) as f:
|
80 |
idf = json.load(f)
|
81 |
idf_vector = [0]*tokenizer.vocab_size
|
@@ -85,8 +88,8 @@ def get_tokenizer_idf(tokenizer):
|
|
85 |
return torch.tensor(idf_vector)
|
86 |
|
87 |
# load the model
|
88 |
-
model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-
|
89 |
-
tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-
|
90 |
idf = get_tokenizer_idf(tokenizer)
|
91 |
|
92 |
# set the special tokens and id_to_token transform for post-process
|
@@ -118,7 +121,7 @@ document_sparse_vector = get_sparse_vector(feature_document, output)
|
|
118 |
|
119 |
# get similarity score
|
120 |
sim_score = torch.matmul(query_sparse_vector[0],document_sparse_vector[0])
|
121 |
-
print(sim_score) # tensor(
|
122 |
|
123 |
|
124 |
query_token_weight = transform_sparse_vector_to_dict(query_sparse_vector)[0]
|
@@ -127,15 +130,12 @@ for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reve
|
|
127 |
if token in document_query_token_weight:
|
128 |
print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
|
129 |
|
130 |
-
|
131 |
-
|
132 |
# result:
|
133 |
-
# score in query: 5.7729, score in document: 0.
|
134 |
-
# score in query: 4.5684, score in document:
|
135 |
-
# score in query: 3.5895, score in document: 0.
|
136 |
-
# score in query:
|
137 |
-
# score in query: 2.7699, score in document: 0.0787, token: what
|
138 |
-
# score in query: 0.4989, score in document: 0.0417, token: in
|
139 |
```
|
140 |
|
141 |
The above code sample shows an example of neural sparse search. Although there is no overlap token in original query and document, but this model performs a good match.
|
@@ -152,6 +152,7 @@ The above code sample shows an example of neural sparse search. Although there i
|
|
152 |
| [opensearch-neural-sparse-encoding-doc-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill) | 0.504 | 0.690 | 0.343 | 0.528 | 0.675 | 0.357 | 0.496 | 0.287 | 0.418 | 0.166 | 0.818 | 0.224 | 0.715 | 0.841 |
|
153 |
| [opensearch-neural-sparse-encoding-doc-v2-mini](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini) | 0.497 | 0.709 | 0.336 | 0.510 | 0.666 | 0.338 | 0.480 | 0.285 | 0.407 | 0.164 | 0.812 | 0.216 | 0.699 | 0.837 |
|
154 |
| [opensearch-neural-sparse-encoding-doc-v3-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill) | 0.517 | 0.724 | 0.345 | 0.544 | 0.694 | 0.356 | 0.520 | 0.294 | 0.424 | 0.163 | 0.845 | 0.239 | 0.708 | 0.863 |
|
|
|
155 |
</div>
|
156 |
|
157 |
## License
|
|
|
11 |
- bag-of-words
|
12 |
---
|
13 |
|
14 |
+
# opensearch-neural-sparse-encoding-doc-v3-gte
|
15 |
|
16 |
## Select the model
|
17 |
The model should be selected considering search relevance, model inference and retrieval efficiency(FLOPS). We benchmark models' performance on a subset of BEIR benchmark: TrecCovid,NFCorpus,NQ,HotpotQA,FiQA,ArguAna,Touche,DBPedia,SCIDOCS,FEVER,Climate FEVER,SciFact,Quora.
|
|
|
26 |
| [opensearch-neural-sparse-encoding-doc-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill) | ✔️ | 67M | 0.504 | 1.8 |
|
27 |
| [opensearch-neural-sparse-encoding-doc-v2-mini](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini) | ✔️ | 23M | 0.497 | 1.7 |
|
28 |
| [opensearch-neural-sparse-encoding-doc-v3-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill) | ✔️ | 67M | 0.517 | 1.8 |
|
29 |
+
| [opensearch-neural-sparse-encoding-doc-v3-gte](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte) | ✔️ | 133M | 0.546 | 1.7 |
|
30 |
|
31 |
## Overview
|
32 |
+
- **Paper**:
|
33 |
+
- [Exploring $\ell_0$ Sparsification for Inference-free Sparse Retrievers ](https://arxiv.org/abs/2504.14839)
|
34 |
+
- [Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers](https://arxiv.org/abs/2411.04403)
|
35 |
- **Codes**: [opensearch-sparse-model-tuning-sample](https://github.com/zhichao-aws/opensearch-sparse-model-tuning-sample/tree/l0_enhance)
|
36 |
|
37 |
This is a learned sparse retrieval model. It encodes the documents to 30522 dimensional **sparse vectors**. For queries, it just use a tokenizer and a weight look-up table to generate sparse vectors. The non-zero dimension index means the corresponding token in the vocabulary, and the weight means the importance of the token. And the similarity score is the inner product of query/document sparse vectors.
|
|
|
78 |
# download the idf file from model hub. idf is used to give weights for query tokens
|
79 |
def get_tokenizer_idf(tokenizer):
|
80 |
from huggingface_hub import hf_hub_download
|
81 |
+
local_cached_path = hf_hub_download(repo_id="opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte", filename="idf.json")
|
82 |
with open(local_cached_path) as f:
|
83 |
idf = json.load(f)
|
84 |
idf_vector = [0]*tokenizer.vocab_size
|
|
|
88 |
return torch.tensor(idf_vector)
|
89 |
|
90 |
# load the model
|
91 |
+
model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte", trust_remote_code=True)
|
92 |
+
tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte")
|
93 |
idf = get_tokenizer_idf(tokenizer)
|
94 |
|
95 |
# set the special tokens and id_to_token transform for post-process
|
|
|
121 |
|
122 |
# get similarity score
|
123 |
sim_score = torch.matmul(query_sparse_vector[0],document_sparse_vector[0])
|
124 |
+
print(sim_score) # tensor(12.5747, grad_fn=<DotBackward0>)
|
125 |
|
126 |
|
127 |
query_token_weight = transform_sparse_vector_to_dict(query_sparse_vector)[0]
|
|
|
130 |
if token in document_query_token_weight:
|
131 |
print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
|
132 |
|
133 |
+
|
|
|
134 |
# result:
|
135 |
+
# score in query: 5.7729, score in document: 0.9703, token: ny
|
136 |
+
# score in query: 4.5684, score in document: 1.0387, token: weather
|
137 |
+
# score in query: 3.5895, score in document: 0.5861, token: now
|
138 |
+
# score in query: 0.4989, score in document: 0.2494, token: in
|
|
|
|
|
139 |
```
|
140 |
|
141 |
The above code sample shows an example of neural sparse search. Although there is no overlap token in original query and document, but this model performs a good match.
|
|
|
152 |
| [opensearch-neural-sparse-encoding-doc-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill) | 0.504 | 0.690 | 0.343 | 0.528 | 0.675 | 0.357 | 0.496 | 0.287 | 0.418 | 0.166 | 0.818 | 0.224 | 0.715 | 0.841 |
|
153 |
| [opensearch-neural-sparse-encoding-doc-v2-mini](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini) | 0.497 | 0.709 | 0.336 | 0.510 | 0.666 | 0.338 | 0.480 | 0.285 | 0.407 | 0.164 | 0.812 | 0.216 | 0.699 | 0.837 |
|
154 |
| [opensearch-neural-sparse-encoding-doc-v3-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill) | 0.517 | 0.724 | 0.345 | 0.544 | 0.694 | 0.356 | 0.520 | 0.294 | 0.424 | 0.163 | 0.845 | 0.239 | 0.708 | 0.863 |
|
155 |
+
| [opensearch-neural-sparse-encoding-doc-v3-gte](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte) | 0.546 | 0.734 | 0.360 | 0.582 | 0.716 | 0.407 | 0.520 | 0.389 | 0.455 | 0.167 | 0.860 | 0.312 | 0.725 | 0.873 |
|
156 |
</div>
|
157 |
|
158 |
## License
|