zhichao-geng commited on
Commit
d9f3f74
·
verified ·
1 Parent(s): 9bdfa3d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -14
README.md CHANGED
@@ -11,7 +11,7 @@ tags:
11
  - bag-of-words
12
  ---
13
 
14
- # opensearch-neural-sparse-encoding-doc-v3-distill
15
 
16
  ## Select the model
17
  The model should be selected considering search relevance, model inference and retrieval efficiency(FLOPS). We benchmark models' performance on a subset of BEIR benchmark: TrecCovid,NFCorpus,NQ,HotpotQA,FiQA,ArguAna,Touche,DBPedia,SCIDOCS,FEVER,Climate FEVER,SciFact,Quora.
@@ -26,9 +26,12 @@ Overall, the v3 series of models have better search relevance, efficiency and in
26
  | [opensearch-neural-sparse-encoding-doc-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill) | ✔️ | 67M | 0.504 | 1.8 |
27
  | [opensearch-neural-sparse-encoding-doc-v2-mini](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini) | ✔️ | 23M | 0.497 | 1.7 |
28
  | [opensearch-neural-sparse-encoding-doc-v3-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill) | ✔️ | 67M | 0.517 | 1.8 |
 
29
 
30
  ## Overview
31
- - **Paper**: [Exploring $\ell_0$ Sparsification for Inference-free Sparse Retrievers ](https://arxiv.org/abs/2504.14839)
 
 
32
  - **Codes**: [opensearch-sparse-model-tuning-sample](https://github.com/zhichao-aws/opensearch-sparse-model-tuning-sample/tree/l0_enhance)
33
 
34
  This is a learned sparse retrieval model. It encodes the documents to 30522 dimensional **sparse vectors**. For queries, it just use a tokenizer and a weight look-up table to generate sparse vectors. The non-zero dimension index means the corresponding token in the vocabulary, and the weight means the importance of the token. And the similarity score is the inner product of query/document sparse vectors.
@@ -75,7 +78,7 @@ def transform_sparse_vector_to_dict(sparse_vector):
75
  # download the idf file from model hub. idf is used to give weights for query tokens
76
  def get_tokenizer_idf(tokenizer):
77
  from huggingface_hub import hf_hub_download
78
- local_cached_path = hf_hub_download(repo_id="opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill", filename="idf.json")
79
  with open(local_cached_path) as f:
80
  idf = json.load(f)
81
  idf_vector = [0]*tokenizer.vocab_size
@@ -85,8 +88,8 @@ def get_tokenizer_idf(tokenizer):
85
  return torch.tensor(idf_vector)
86
 
87
  # load the model
88
- model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill")
89
- tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill")
90
  idf = get_tokenizer_idf(tokenizer)
91
 
92
  # set the special tokens and id_to_token transform for post-process
@@ -118,7 +121,7 @@ document_sparse_vector = get_sparse_vector(feature_document, output)
118
 
119
  # get similarity score
120
  sim_score = torch.matmul(query_sparse_vector[0],document_sparse_vector[0])
121
- print(sim_score) # tensor(11.1105, grad_fn=<DotBackward0>)
122
 
123
 
124
  query_token_weight = transform_sparse_vector_to_dict(query_sparse_vector)[0]
@@ -127,15 +130,12 @@ for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reve
127
  if token in document_query_token_weight:
128
  print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
129
 
130
-
131
-
132
  # result:
133
- # score in query: 5.7729, score in document: 0.8049, token: ny
134
- # score in query: 4.5684, score in document: 0.9710, token: weather
135
- # score in query: 3.5895, score in document: 0.4720, token: now
136
- # score in query: 3.3313, score in document: 0.0286, token: ?
137
- # score in query: 2.7699, score in document: 0.0787, token: what
138
- # score in query: 0.4989, score in document: 0.0417, token: in
139
  ```
140
 
141
  The above code sample shows an example of neural sparse search. Although there is no overlap token in original query and document, but this model performs a good match.
@@ -152,6 +152,7 @@ The above code sample shows an example of neural sparse search. Although there i
152
  | [opensearch-neural-sparse-encoding-doc-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill) | 0.504 | 0.690 | 0.343 | 0.528 | 0.675 | 0.357 | 0.496 | 0.287 | 0.418 | 0.166 | 0.818 | 0.224 | 0.715 | 0.841 |
153
  | [opensearch-neural-sparse-encoding-doc-v2-mini](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini) | 0.497 | 0.709 | 0.336 | 0.510 | 0.666 | 0.338 | 0.480 | 0.285 | 0.407 | 0.164 | 0.812 | 0.216 | 0.699 | 0.837 |
154
  | [opensearch-neural-sparse-encoding-doc-v3-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill) | 0.517 | 0.724 | 0.345 | 0.544 | 0.694 | 0.356 | 0.520 | 0.294 | 0.424 | 0.163 | 0.845 | 0.239 | 0.708 | 0.863 |
 
155
  </div>
156
 
157
  ## License
 
11
  - bag-of-words
12
  ---
13
 
14
+ # opensearch-neural-sparse-encoding-doc-v3-gte
15
 
16
  ## Select the model
17
  The model should be selected considering search relevance, model inference and retrieval efficiency(FLOPS). We benchmark models' performance on a subset of BEIR benchmark: TrecCovid,NFCorpus,NQ,HotpotQA,FiQA,ArguAna,Touche,DBPedia,SCIDOCS,FEVER,Climate FEVER,SciFact,Quora.
 
26
  | [opensearch-neural-sparse-encoding-doc-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill) | ✔️ | 67M | 0.504 | 1.8 |
27
  | [opensearch-neural-sparse-encoding-doc-v2-mini](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini) | ✔️ | 23M | 0.497 | 1.7 |
28
  | [opensearch-neural-sparse-encoding-doc-v3-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill) | ✔️ | 67M | 0.517 | 1.8 |
29
+ | [opensearch-neural-sparse-encoding-doc-v3-gte](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte) | ✔️ | 133M | 0.546 | 1.7 |
30
 
31
  ## Overview
32
+ - **Paper**:
33
+ - [Exploring $\ell_0$ Sparsification for Inference-free Sparse Retrievers ](https://arxiv.org/abs/2504.14839)
34
+ - [Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers](https://arxiv.org/abs/2411.04403)
35
  - **Codes**: [opensearch-sparse-model-tuning-sample](https://github.com/zhichao-aws/opensearch-sparse-model-tuning-sample/tree/l0_enhance)
36
 
37
  This is a learned sparse retrieval model. It encodes the documents to 30522 dimensional **sparse vectors**. For queries, it just use a tokenizer and a weight look-up table to generate sparse vectors. The non-zero dimension index means the corresponding token in the vocabulary, and the weight means the importance of the token. And the similarity score is the inner product of query/document sparse vectors.
 
78
  # download the idf file from model hub. idf is used to give weights for query tokens
79
  def get_tokenizer_idf(tokenizer):
80
  from huggingface_hub import hf_hub_download
81
+ local_cached_path = hf_hub_download(repo_id="opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte", filename="idf.json")
82
  with open(local_cached_path) as f:
83
  idf = json.load(f)
84
  idf_vector = [0]*tokenizer.vocab_size
 
88
  return torch.tensor(idf_vector)
89
 
90
  # load the model
91
+ model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte", trust_remote_code=True)
92
+ tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte")
93
  idf = get_tokenizer_idf(tokenizer)
94
 
95
  # set the special tokens and id_to_token transform for post-process
 
121
 
122
  # get similarity score
123
  sim_score = torch.matmul(query_sparse_vector[0],document_sparse_vector[0])
124
+ print(sim_score) # tensor(12.5747, grad_fn=<DotBackward0>)
125
 
126
 
127
  query_token_weight = transform_sparse_vector_to_dict(query_sparse_vector)[0]
 
130
  if token in document_query_token_weight:
131
  print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
132
 
133
+
 
134
  # result:
135
+ # score in query: 5.7729, score in document: 0.9703, token: ny
136
+ # score in query: 4.5684, score in document: 1.0387, token: weather
137
+ # score in query: 3.5895, score in document: 0.5861, token: now
138
+ # score in query: 0.4989, score in document: 0.2494, token: in
 
 
139
  ```
140
 
141
  The above code sample shows an example of neural sparse search. Although there is no overlap token in original query and document, but this model performs a good match.
 
152
  | [opensearch-neural-sparse-encoding-doc-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill) | 0.504 | 0.690 | 0.343 | 0.528 | 0.675 | 0.357 | 0.496 | 0.287 | 0.418 | 0.166 | 0.818 | 0.224 | 0.715 | 0.841 |
153
  | [opensearch-neural-sparse-encoding-doc-v2-mini](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini) | 0.497 | 0.709 | 0.336 | 0.510 | 0.666 | 0.338 | 0.480 | 0.285 | 0.407 | 0.164 | 0.812 | 0.216 | 0.699 | 0.837 |
154
  | [opensearch-neural-sparse-encoding-doc-v3-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill) | 0.517 | 0.724 | 0.345 | 0.544 | 0.694 | 0.356 | 0.520 | 0.294 | 0.424 | 0.163 | 0.845 | 0.239 | 0.708 | 0.863 |
155
+ | [opensearch-neural-sparse-encoding-doc-v3-gte](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte) | 0.546 | 0.734 | 0.360 | 0.582 | 0.716 | 0.407 | 0.520 | 0.389 | 0.455 | 0.167 | 0.860 | 0.312 | 0.725 | 0.873 |
156
  </div>
157
 
158
  ## License