spacemanidol commited on
Commit
9fd85e7
·
verified ·
1 Parent(s): f9400d2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -22
README.md CHANGED
@@ -89,6 +89,7 @@ language:
89
  <h1 align="center">Snowflake's Arctic-embed-l-v2.0</h1>
90
  <h4 align="center">
91
  <p>
 
92
  <a href=#models>Models</a> |
93
  <a href=#usage>Usage</a> |
94
  <a href="#evaluation">Evaluation</a> |
@@ -100,16 +101,24 @@ language:
100
  </h4>
101
 
102
 
 
 
 
103
  ## Models
 
 
 
104
 
 
 
 
 
 
105
 
106
- MIRACL (4) Voyage misc. (9) CLEF (5) CLEF, max context length Multilingual CLEF
107
- Snowflake's snowflake-arctic-embed-l-v2.0 is a multilingual text embedding models that focuses on providing
108
- BEIR
109
- 0.556 0.558 0.655 0.529 0.541 0.543
110
- 0.543 0.543 0.644 0.519 0.528 0.534
111
 
112
- Focused on
 
 
113
 
114
  | Model Name | # params | # non-emb params | # dimensions | BEIR (15) | MIRACL (4) | CLEF (Focused) | CLEF (Full) |
115
  |---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
@@ -126,7 +135,8 @@ Focused on
126
  | snowflake-arctic-m-v2.0 | 305M | 113M | 768 | 0.554 | 0.552 | 0.517 | 0.539 |
127
  | snowflake-arctic-l-v2.0 | 568M | 303M | 1024 | 0.556 | 0.558 | 0.529 | 0.543 |
128
 
129
- MRL
 
130
 
131
  | Model | | BEIR (15) | Relative Performance | MIRACL (4) | Relative Performance | CLEF (5) | Relative Performance | CLEF (Full) | Relative Performance |
132
  |---|---|:---:|:---:|:---:|:---:|:---:|---|---|---|
@@ -135,30 +145,41 @@ MRL
135
  | snowflake-arctic-m-v2.0 | 768 | 0.554 | N/A | 0.552 | N/A | 0.517 | N/A | 0.539 | N/A |
136
  | snowflake-arctic-m-v2.0 | 256 | 0.544 | -1.81% | 0.54 | -2.17% | 0.506 | -2.13% | 0.523 | -3.06% |
137
 
 
138
 
 
139
 
140
- The `snowflake-arctic-embedding` models achieve **state-of-the-art performance on the MTEB/BEIR leaderboard** for each of their size variants. Evaluation is performed using these [scripts](https://github.com/Snowflake-Labs/snowflake-arctic-embed/tree/main/src). As shown below, each class of model size achieves SOTA retrieval accuracy compared to other top models.
141
-
142
-
143
- The models are trained by leveraging existing open-source text representation models, such as bert-base-uncased, and are trained in a multi-stage pipeline to optimize their retrieval performance. First, the models are trained with large batches of query-document pairs where negatives are derived in-batch—pretraining leverages about 400m samples of a mix of public datasets and proprietary web search data. Following pretraining models are further optimized with long training on a smaller dataset (about 1m samples) of triplets of query, positive document, and negative document derived from hard harmful mining. Mining of the negatives and data curation is crucial to retrieval accuracy. A detailed technical report can be found [here](https://arxiv.org/abs/2405.05374).
144
 
 
 
 
145
 
146
- | Name | MTEB Retrieval Score (NDCG @ 10) | Parameters (Millions) | Embedding Dimension |
147
- | ----------------------------------------------------------------------- | -------------------------------- | --------------------- | ------------------- |
148
- | [snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs/) | 50.15 | 22 | 384 |
149
- | [snowflake-arctic-embed-s](https://huggingface.co/Snowflake/snowflake-arctic-embed-s/) | 51.98 | 33 | 384 |
150
- | [snowflake-arctic-embed-m](https://huggingface.co/Snowflake/snowflake-arctic-embed-m/) | 54.90 | 110 | 768 |
151
- | [snowflake-arctic-embed-m-long](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-long/) | 54.83 | 137 | 768 |
152
- | [snowflake-arctic-embed-l](https://huggingface.co/Snowflake/snowflake-arctic-embed-l/) | 55.98 | 335 | 1024 |
153
 
 
 
 
154
 
 
 
155
 
156
- ## Usage
 
 
 
 
 
 
157
 
158
- ### Using Huggingface transformers
 
159
 
160
 
161
- You can use the transformers package to use an snowflake-arctic-embed model, as shown below. For optimal retrieval quality, use the CLS token to embed each text portion and use the query prefix below (just on the query).
162
 
163
  ```python
164
  import torch
@@ -169,7 +190,7 @@ tokenizer = AutoTokenizer.from_pretrained(model_name)
169
  model = AutoModel.from_pretrained(model_name, add_pooling_layer=False)
170
  model.eval()
171
 
172
- query_prefix = 'Represent this sentence for searching relevant passages: '
173
  queries = ['what is snowflake?', 'Where can I get the best tacos?']
174
  queries_with_prefix = ["{}{}".format(query_prefix, i) for i in queries]
175
  query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=512)
 
89
  <h1 align="center">Snowflake's Arctic-embed-l-v2.0</h1>
90
  <h4 align="center">
91
  <p>
92
+ <a href=#news>News</a> |
93
  <a href=#models>Models</a> |
94
  <a href=#usage>Usage</a> |
95
  <a href="#evaluation">Evaluation</a> |
 
101
  </h4>
102
 
103
 
104
+ ## News
105
+ 12/04/2024: Release of `snowflake-arctic-embed-m-v2.0` and `snowflake-arctic-embed-m-v2.0` our newest models with multilingual workloads in mind.
106
+
107
  ## Models
108
+ Snowflake arctic-embed-l-v2.0 is the newest addition to the suite of embedding models Snowflake has released optimizing for retrieval performance and inference efficiency.
109
+ Arctic Embed 2.0 introduces a new standard for multilingual embedding models, combining high-quality multilingual text retrieval without sacrificing performance in English.
110
+ Released under the permissive Apache 2.0 license, Arctic Embed 2.0 is ideal for applications that demand reliable, enterprise-grade multilingual search and retrieval at scale.
111
 
112
+ Key Features:
113
+ Multilingual without compromise: Excels in English and non-English retrieval, outperforming leading open-source and proprietary models on benchmarks like MTEB Retrieval, CLEF, and MIRACL.
114
+ Inference efficiency: With its 300m non-embedding parameters inference is fast and efficient for any scale.
115
+ Compression-friendly: Achieves high-quality retrieval with embeddings as small as 128 bytes/vector using Matryoshka Representation Learning (MRL) and quantization-aware embedding training.
116
+ Drop-In Replacement: arctic-embed-l-v2.0 builds on [XMLR-Large](https://huggingface.co/FacebookAI/xlm-roberta-large) which allows direct drop-in inference replacement with any form of new libraries, kernels, inferene engines etc.
117
 
 
 
 
 
 
118
 
119
+ ### Quality Benchmarks
120
+ Unlike most other open-source models, Arctic-embed-l-v2.0 excels across English (via MTEB Retrieval) and multilingual (via MIRACL and CLEF).
121
+ You no longer need to support models to empower high-quality English and multilingual retrieval.
122
 
123
  | Model Name | # params | # non-emb params | # dimensions | BEIR (15) | MIRACL (4) | CLEF (Focused) | CLEF (Full) |
124
  |---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
 
135
  | snowflake-arctic-m-v2.0 | 305M | 113M | 768 | 0.554 | 0.552 | 0.517 | 0.539 |
136
  | snowflake-arctic-l-v2.0 | 568M | 303M | 1024 | 0.556 | 0.558 | 0.529 | 0.543 |
137
 
138
+ Aside from high-quality retrieval arctic delivers embeddings that are easily compressible. Leverage vector truncation via MRL to decrease vector size by 3-4x with less than 3% degredation in quality.
139
+ Combine MRLed vectors with vector compression (Int4) to power retrieval in 128 bytes per doc.
140
 
141
  | Model | | BEIR (15) | Relative Performance | MIRACL (4) | Relative Performance | CLEF (5) | Relative Performance | CLEF (Full) | Relative Performance |
142
  |---|---|:---:|:---:|:---:|:---:|:---:|---|---|---|
 
145
  | snowflake-arctic-m-v2.0 | 768 | 0.554 | N/A | 0.552 | N/A | 0.517 | N/A | 0.539 | N/A |
146
  | snowflake-arctic-m-v2.0 | 256 | 0.544 | -1.81% | 0.54 | -2.17% | 0.506 | -2.13% | 0.523 | -3.06% |
147
 
148
+ ## Usage
149
 
150
+ ### Using Sentence Transformers
151
 
152
+ ``
153
+ from sentence_transformers import SentenceTransformer
 
 
154
 
155
+ # Load the model
156
+ model_name = 'Snowflake/snowflake-arctic-embed-l-v2.0'
157
+ model = SentenceTransformer(model_name)
158
 
159
+ # Define the queries and documents
160
+ queries = ['what is snowflake?', 'Where can I get the best tacos?']
161
+ documents = ['The Data Cloud!', 'Mexico City of Course!']
 
 
 
 
162
 
163
+ # Compute embeddings: use `prompt_name="query"` to encode queries!
164
+ query_embeddings = model.encode(queries, prompt_name="query")
165
+ document_embeddings = model.encode(documents)
166
 
167
+ # Compute cosine similarity scores
168
+ scores = model.similarity(query_embeddings, document_embeddings)
169
 
170
+ # Output the results
171
+ for query, query_scores in zip(queries, scores):
172
+ doc_score_pairs = list(zip(documents, query_scores))
173
+ doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
174
+ print("Query:", query)
175
+ for document, score in doc_score_pairs:
176
+ print(score, document)
177
 
178
+ ```
179
+ ### Using Huggingface Transformers
180
 
181
 
182
+ You can use the transformers package to use Snowflake's arctic-embed model, as shown below. For optimal retrieval quality, use the CLS token to embed each text portion and use the query prefix below (just on the query).
183
 
184
  ```python
185
  import torch
 
190
  model = AutoModel.from_pretrained(model_name, add_pooling_layer=False)
191
  model.eval()
192
 
193
+ query_prefix = 'Query: '
194
  queries = ['what is snowflake?', 'Where can I get the best tacos?']
195
  queries_with_prefix = ["{}{}".format(query_prefix, i) for i in queries]
196
  query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=512)