spacemanidol commited on
Commit
11a03be
·
verified ·
1 Parent(s): 9fd85e7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -23
README.md CHANGED
@@ -60,7 +60,6 @@ language:
60
  - my
61
  - ne
62
  - nl
63
- - 'no'
64
  - pa
65
  - pl
66
  - pt
@@ -110,46 +109,50 @@ Arctic Embed 2.0 introduces a new standard for multilingual embedding models, co
110
  Released under the permissive Apache 2.0 license, Arctic Embed 2.0 is ideal for applications that demand reliable, enterprise-grade multilingual search and retrieval at scale.
111
 
112
  Key Features:
113
- Multilingual without compromise: Excels in English and non-English retrieval, outperforming leading open-source and proprietary models on benchmarks like MTEB Retrieval, CLEF, and MIRACL.
114
- Inference efficiency: With its 300m non-embedding parameters inference is fast and efficient for any scale.
115
- Compression-friendly: Achieves high-quality retrieval with embeddings as small as 128 bytes/vector using Matryoshka Representation Learning (MRL) and quantization-aware embedding training.
116
- Drop-In Replacement: arctic-embed-l-v2.0 builds on [XMLR-Large](https://huggingface.co/FacebookAI/xlm-roberta-large) which allows direct drop-in inference replacement with any form of new libraries, kernels, inferene engines etc.
 
 
 
 
117
 
118
 
119
  ### Quality Benchmarks
120
  Unlike most other open-source models, Arctic-embed-l-v2.0 excels across English (via MTEB Retrieval) and multilingual (via MIRACL and CLEF).
121
- You no longer need to support models to empower high-quality English and multilingual retrieval.
122
 
123
  | Model Name | # params | # non-emb params | # dimensions | BEIR (15) | MIRACL (4) | CLEF (Focused) | CLEF (Full) |
124
  |---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
125
- | me5 base | 560M | 303M | 1024 | 0.514 | 0.540 | 0.430 | 0.346 |
126
- | bge-m3 (BAAI) | 568M | 303M | 1024 | 0.488 | 0.568 | 0.408 | 0.413 |
127
- | gte (Alibaba) | 305M | 113M | 768 | 0.511 | 0.523 | 0.477 | 0.531 |
128
- | Arctic-M | 109M | 86M | 768 | 0.549 | 0.249 | 0.344 | 0.291 |
129
- | snowflake-arctic-m | 335M | 303M | 1024 | 0.560 | 0.348 | 0.382 | 0.337 |
130
- | me5 base | 560M | 303M | 1024 | 0.514 | 0.540 | 0.430 | 0.346 |
131
- | bge-m3 (BAAI) | 568M | 303M | 1024 | 0.488 | 0.568 | 0.408 | 0.413 |
132
- | gte (Alibaba) | 305M | 113M | 768 | 0.511 | 0.523 | 0.477 | 0.531 |
133
- | snowflake-arctic-m | 109M | 86M | 768 | 0.549 | 0.249 | 0.344 | 0.291 |
134
- | snowflake-arctic-l | 335M | 303M | 1024 | 0.560 | 0.348 | 0.382 | 0.337 |
135
- | snowflake-arctic-m-v2.0 | 305M | 113M | 768 | 0.554 | 0.552 | 0.517 | 0.539 |
136
- | snowflake-arctic-l-v2.0 | 568M | 303M | 1024 | 0.556 | 0.558 | 0.529 | 0.543 |
137
 
138
  Aside from high-quality retrieval arctic delivers embeddings that are easily compressible. Leverage vector truncation via MRL to decrease vector size by 3-4x with less than 3% degredation in quality.
139
  Combine MRLed vectors with vector compression (Int4) to power retrieval in 128 bytes per doc.
140
 
141
  | Model | | BEIR (15) | Relative Performance | MIRACL (4) | Relative Performance | CLEF (5) | Relative Performance | CLEF (Full) | Relative Performance |
142
  |---|---|:---:|:---:|:---:|:---:|:---:|---|---|---|
143
- | snowflake-arctic-l-v2.0 | 1024 | 0.556 | N/A | 0.558 | N/A | 0.529 | N/A | 0.543 | N/A |
144
- | snowflake-arctic-l-v2.0 | 256 | 0.543 | -0.18% | 0.543 | -2.70% | 0.519 | -1.81% | 0.534 | -1.53% |
145
- | snowflake-arctic-m-v2.0 | 768 | 0.554 | N/A | 0.552 | N/A | 0.517 | N/A | 0.539 | N/A |
146
- | snowflake-arctic-m-v2.0 | 256 | 0.544 | -1.81% | 0.54 | -2.17% | 0.506 | -2.13% | 0.523 | -3.06% |
147
 
148
  ## Usage
149
 
150
  ### Using Sentence Transformers
151
 
152
- ``
153
  from sentence_transformers import SentenceTransformer
154
 
155
  # Load the model
@@ -176,6 +179,7 @@ for query, query_scores in zip(queries, scores):
176
  print(score, document)
177
 
178
  ```
 
179
  ### Using Huggingface Transformers
180
 
181
 
 
60
  - my
61
  - ne
62
  - nl
 
63
  - pa
64
  - pl
65
  - pt
 
109
  Released under the permissive Apache 2.0 license, Arctic Embed 2.0 is ideal for applications that demand reliable, enterprise-grade multilingual search and retrieval at scale.
110
 
111
  Key Features:
112
+
113
+ 1. Multilingual without compromise: Excels in English and non-English retrieval, outperforming leading open-source and proprietary models on benchmarks like MTEB Retrieval, CLEF, and MIRACL.
114
+
115
+ 2. Inference efficiency: With its 300m non-embedding parameters inference is fast and efficient for any scale.
116
+
117
+ 3. Compression-friendly: Achieves high-quality retrieval with embeddings as small as 128 bytes/vector using Matryoshka Representation Learning (MRL) and quantization-aware embedding training.
118
+
119
+ 4. Drop-In Replacement: arctic-embed-l-v2.0 builds on [XMLR-Large](https://huggingface.co/FacebookAI/xlm-roberta-large) which allows direct drop-in inference replacement with any form of new libraries, kernels, inferene engines etc.
120
 
121
 
122
  ### Quality Benchmarks
123
  Unlike most other open-source models, Arctic-embed-l-v2.0 excels across English (via MTEB Retrieval) and multilingual (via MIRACL and CLEF).
124
+ You no longer need to support models to empower high-quality English and multilingual retrieval. All numbers mentioned below are the average NDCG@10 across the dataset being discussed.
125
 
126
  | Model Name | # params | # non-emb params | # dimensions | BEIR (15) | MIRACL (4) | CLEF (Focused) | CLEF (Full) |
127
  |---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
128
+ | me5 base | 560M | 303M | 1024 | 51.4 | 54.0 | 43.0 | 34.6 |
129
+ | bge-m3 (BAAI) | 568M | 303M | 1024 | 48.8 | 56.8 | 40.8 | 41.3 |
130
+ | gte (Alibaba) | 305M | 113M | 768 | 51.1 | 52.3 | 47.7 | 53.1 |
131
+ | Arctic-M (v1.0) | 109M | 86M | 768 | 54.9 | 24.9 | 34.4 | 29.1 |
132
+ | snowflake-arctic-m | 335M | 303M | 1024 | 56.0 | 34.8 | 38.2 | 33.7 |
133
+ | me5 base | 560M | 303M | 1024 | 51.4 | 54.0 | 43.0 | 34.6 |
134
+ | bge-m3 (BAAI) | 568M | 303M | 1024 | 48.8 | 56.8 | 40.8 | 41.3 |
135
+ | gte (Alibaba) | 305M | 113M | 768 | 51.1 | 52.3 | 47.7 | 53.1 |
136
+ | snowflake-arctic-m | 109M | 86M | 768 | 54.9 | 24.9 | 34.4 | 29.1 |
137
+ | snowflake-arctic-l | 335M | 303M | 1024 | 56.0 | 34.8 | 38.2 | 33.7 |
138
+ | snowflake-arctic-m-v2.0 | 305M | 113M | 768 | 55.4 | 55.2 | 51.7 | 53.9 |
139
+ | **snowflake-arctic-l-v2.0** | 568M | 303M | 1024 | 55.6 | 55.8 | 52.9 | **54.3** |
140
 
141
  Aside from high-quality retrieval arctic delivers embeddings that are easily compressible. Leverage vector truncation via MRL to decrease vector size by 3-4x with less than 3% degredation in quality.
142
  Combine MRLed vectors with vector compression (Int4) to power retrieval in 128 bytes per doc.
143
 
144
  | Model | | BEIR (15) | Relative Performance | MIRACL (4) | Relative Performance | CLEF (5) | Relative Performance | CLEF (Full) | Relative Performance |
145
  |---|---|:---:|:---:|:---:|:---:|:---:|---|---|---|
146
+ | snowflake-arctic-l-v2.0 | 1024 | 55.6 | N/A | 55.8 | N/A | 52.9 | N/A | 54.3 | N/A |
147
+ | snowflake-arctic-l-v2.0 | 256 | 54.3 | -0.18% | 54.3 | -2.70% | 0.519 | -1.81% | 53.4 | -1.53% |
148
+ | snowflake-arctic-m-v2.0 | 768 | 55.4 | N/A | 55.2 | N/A | 51.7 | N/A | 53.9 | N/A |
149
+ | snowflake-arctic-m-v2.0 | 256 | 54.4 | -1.81% | 54.0 | -2.17% | 50.6 | -2.13% | 52.3 | -3.06% |
150
 
151
  ## Usage
152
 
153
  ### Using Sentence Transformers
154
 
155
+ ```python
156
  from sentence_transformers import SentenceTransformer
157
 
158
  # Load the model
 
179
  print(score, document)
180
 
181
  ```
182
+
183
  ### Using Huggingface Transformers
184
 
185