Update README.md
Browse files
README.md
CHANGED
@@ -7,175 +7,133 @@ tags:
|
|
7 |
license: mit
|
8 |
---
|
9 |
|
10 |
-
For more details please refer to our github repo: https://github.com/FlagOpen/FlagEmbedding
|
11 |
|
12 |
-
#
|
13 |
|
14 |
-
|
15 |
-
- Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
|
16 |
-
- Multi-Linguality: It can support more than 100 working languages.
|
17 |
-
- Multi-Granularity: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens.
|
18 |
|
|
|
19 |
|
|
|
20 |
|
21 |
-
|
|
|
|
|
|
|
22 |
|
23 |
-
|
24 |
-
- Hybrid retrieval leverages the strengths of various methods, offering higher accuracy and stronger generalization capabilities.
|
25 |
-
A classic example: using both embedding retrieval and the BM25 algorithm.
|
26 |
-
Now, you can try to use BGE-M3, which supports both embedding and sparse retrieval.
|
27 |
-
This allows you to obtain token weights (similar to the BM25) without any additional cost when generate dense embeddings.
|
28 |
-
To use hybrid retrieval, you can refer to [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb
|
29 |
-
) and [Milvus](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
|
30 |
|
31 |
-
|
32 |
-
|
|
|
|
|
33 |
|
|
|
34 |
|
35 |
-
|
36 |
-
- 2024/
|
37 |
-
|
38 |
-
|
|
|
|
|
39 |
|
40 |
-
|
41 |
|
42 |
-
|
43 |
-
- 2024/3/20: **Thanks Milvus team!** Now you can use hybrid retrieval of bge-m3 in Milvus: [pymilvus/examples
|
44 |
-
/hello_hybrid_sparse_dense.py](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
|
45 |
-
- 2024/3/8: **Thanks for the [experimental results](https://towardsdatascience.com/openai-vs-open-source-multilingual-embedding-models-e5ccb7c90f05) from @[Yannael](https://huggingface.co/Yannael). In this benchmark, BGE-M3 achieves top performance in both English and other languages, surpassing models such as OpenAI.**
|
46 |
-
- 2024/3/2: Release unified fine-tuning [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune) and [data](https://huggingface.co/datasets/Shitao/bge-m3-data)
|
47 |
-
- 2024/2/6: We release the [MLDR](https://huggingface.co/datasets/Shitao/MLDR) (a long document retrieval dataset covering 13 languages) and [evaluation pipeline](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR).
|
48 |
-
- 2024/2/1: **Thanks for the excellent tool from Vespa.** You can easily use multiple modes of BGE-M3 following this [notebook](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb)
|
49 |
-
|
50 |
-
|
51 |
-
## Specs
|
52 |
-
|
53 |
-
- Model
|
54 |
-
|
55 |
-
| Model Name | Dimension | Sequence Length | Introduction |
|
56 |
|:----:|:---:|:---:|:---:|
|
57 |
-
| [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 1024 | 8192 |
|
58 |
-
| [BAAI/bge-m3-unsupervised](https://huggingface.co/BAAI/bge-m3-unsupervised) | 1024 | 8192 |
|
59 |
-
| [BAAI/bge-m3-retromae](https://huggingface.co/BAAI/bge-m3-retromae) | -- | 8192 |
|
60 |
-
| [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | 1024 | 512 | English model |
|
61 |
-
| [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) |
|
62 |
-
| [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) |
|
63 |
-
|
64 |
-
- Data
|
65 |
-
|
66 |
-
| Dataset | Introduction |
|
67 |
-
|:----------------------------------------------------------:|:-------------------------------------------------:|
|
68 |
-
| [MLDR](https://huggingface.co/datasets/Shitao/MLDR) | Docuemtn Retrieval Dataset, covering 13 languages |
|
69 |
-
| [bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data) | Fine-tuning data used by bge-m3 |
|
70 |
|
|
|
71 |
|
|
|
|
|
|
|
|
|
72 |
|
73 |
## FAQ
|
74 |
|
75 |
-
|
76 |
-
|
77 |
-
- Dense retrieval: map the text into a single embedding, e.g., [DPR](https://arxiv.org/abs/2004.04906), [BGE-v1.5](https://github.com/FlagOpen/FlagEmbedding)
|
78 |
-
- Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. e.g., BM25, [unicoil](https://arxiv.org/pdf/2106.14807.pdf), and [splade](https://arxiv.org/abs/2107.05720)
|
79 |
-
- Multi-vector retrieval: use multiple vectors to represent a text, e.g., [ColBERT](https://arxiv.org/abs/2004.12832).
|
80 |
-
|
81 |
-
|
82 |
-
**2. How to use BGE-M3 in other projects?**
|
83 |
-
|
84 |
-
For embedding retrieval, you can employ the BGE-M3 model using the same approach as BGE.
|
85 |
-
The only difference is that the BGE-M3 model no longer requires adding instructions to the queries.
|
86 |
-
|
87 |
-
For hybrid retrieval, you can use [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb
|
88 |
-
) and [Milvus](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
|
89 |
-
|
90 |
-
|
91 |
-
**3. How to fine-tune bge-M3 model?**
|
92 |
-
|
93 |
-
You can follow the common in this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune)
|
94 |
-
to fine-tune the dense embedding.
|
95 |
-
|
96 |
-
If you want to fine-tune all embedding function of m3 (dense, sparse and colbert), you can refer to the [unified_fine-tuning example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune)
|
97 |
|
|
|
|
|
|
|
98 |
|
|
|
99 |
|
|
|
100 |
|
|
|
101 |
|
|
|
102 |
|
103 |
## Usage
|
104 |
|
105 |
-
|
106 |
-
|
|
|
107 |
git clone https://github.com/FlagOpen/FlagEmbedding.git
|
108 |
cd FlagEmbedding
|
109 |
pip install -e .
|
110 |
```
|
111 |
-
|
112 |
-
|
|
|
|
|
113 |
pip install -U FlagEmbedding
|
114 |
```
|
115 |
|
|
|
116 |
|
|
|
117 |
|
118 |
-
### Generate Embedding for text
|
119 |
-
|
120 |
-
- Dense Embedding
|
121 |
```python
|
122 |
from FlagEmbedding import BGEM3FlagModel
|
123 |
|
124 |
-
model = BGEM3FlagModel('BAAI/bge-m3',
|
125 |
-
use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
|
126 |
|
127 |
-
sentences_1 = ["What is BGE M3?", "
|
128 |
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
|
129 |
"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
|
130 |
|
131 |
-
embeddings_1 = model.encode(sentences_1,
|
132 |
-
batch_size=12,
|
133 |
-
max_length=8192, # If you don't need such a long length, you can set a smaller value to speed up the encoding process.
|
134 |
-
)['dense_vecs']
|
135 |
embeddings_2 = model.encode(sentences_2)['dense_vecs']
|
136 |
similarity = embeddings_1 @ embeddings_2.T
|
137 |
print(similarity)
|
138 |
-
# [[0.6265, 0.3477], [0.3499, 0.678 ]]
|
139 |
```
|
140 |
-
You also can use sentence-transformers and huggingface transformers to generate dense embeddings.
|
141 |
-
Refer to [baai_general_embedding](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding#usage) for details.
|
142 |
|
|
|
143 |
|
144 |
-
- Sparse Embedding (Lexical Weight)
|
145 |
```python
|
146 |
from FlagEmbedding import BGEM3FlagModel
|
147 |
|
148 |
-
model = BGEM3FlagModel('BAAI/bge-m3',
|
149 |
|
150 |
-
sentences_1 = ["What is BGE M3?", "
|
151 |
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
|
152 |
"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
|
153 |
|
154 |
output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=False)
|
155 |
output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=False)
|
156 |
|
157 |
-
# you can see the weight for each token:
|
158 |
print(model.convert_id_to_token(output_1['lexical_weights']))
|
159 |
-
# [{'What': 0.08356, 'is': 0.0814, 'B': 0.1296, 'GE': 0.252, 'M': 0.1702, '3': 0.2695, '?': 0.04092},
|
160 |
-
# {'De': 0.05005, 'fin': 0.1368, 'ation': 0.04498, 'of': 0.0633, 'BM': 0.2515, '25': 0.3335}]
|
161 |
-
|
162 |
|
163 |
-
# compute the scores via lexical mathcing
|
164 |
lexical_scores = model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_2['lexical_weights'][0])
|
165 |
print(lexical_scores)
|
166 |
-
# 0.19554901123046875
|
167 |
-
|
168 |
-
print(model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_1['lexical_weights'][1]))
|
169 |
-
# 0.0
|
170 |
```
|
171 |
|
172 |
-
|
|
|
173 |
```python
|
174 |
from FlagEmbedding import BGEM3FlagModel
|
175 |
|
176 |
-
model = BGEM3FlagModel('BAAI/bge-m3',
|
177 |
|
178 |
-
sentences_1 = ["What is BGE M3?", "
|
179 |
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
|
180 |
"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
|
181 |
|
@@ -184,111 +142,37 @@ output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, retu
|
|
184 |
|
185 |
print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][0]))
|
186 |
print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][1]))
|
187 |
-
# 0.7797
|
188 |
-
# 0.4620
|
189 |
```
|
190 |
|
|
|
191 |
|
192 |
-
### Compute score for text pairs
|
193 |
-
Input a list of text pairs, you can get the scores computed by different methods.
|
194 |
```python
|
195 |
from FlagEmbedding import BGEM3FlagModel
|
196 |
|
197 |
-
model = BGEM3FlagModel('BAAI/bge-m3',
|
198 |
|
199 |
-
sentences_1 = ["What is BGE M3?", "
|
200 |
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
|
201 |
"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
|
202 |
|
203 |
-
sentence_pairs = [[i,j] for i in sentences_1 for j in sentences_2]
|
204 |
-
|
205 |
-
print(model.compute_score(sentence_pairs,
|
206 |
-
max_passage_length=128, # a smaller max length leads to a lower latency
|
207 |
-
weights_for_different_modes=[0.4, 0.2, 0.4])) # weights_for_different_modes(w) is used to do weighted sum: w[0]*dense_score + w[1]*sparse_score + w[2]*colbert_score
|
208 |
|
209 |
-
|
210 |
-
# 'colbert': [0.7796499729156494, 0.4621465802192688, 0.4523794651031494, 0.7898575067520142],
|
211 |
-
# 'sparse': [0.195556640625, 0.00879669189453125, 0.0, 0.1802978515625],
|
212 |
-
# 'dense': [0.6259765625, 0.347412109375, 0.349853515625, 0.67822265625],
|
213 |
-
# 'sparse+dense': [0.482503205537796, 0.23454029858112335, 0.2332356721162796, 0.5122477412223816],
|
214 |
-
# 'colbert+sparse+dense': [0.6013619303703308, 0.3255828022956848, 0.32089319825172424, 0.6232916116714478]
|
215 |
-
# }
|
216 |
```
|
217 |
|
|
|
218 |
|
219 |
-
|
220 |
-
|
221 |
-
## Evaluation
|
222 |
-
|
223 |
-
We provide the evaluation script for [MKQA](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MKQA) and [MLDR](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR)
|
224 |
-
|
225 |
-
### Benchmarks from the open-source community
|
226 |
-

|
227 |
-
The BGE-M3 model emerged as the top performer on this benchmark (OAI is short for OpenAI).
|
228 |
-
For more details, please refer to the [article](https://towardsdatascience.com/openai-vs-open-source-multilingual-embedding-models-e5ccb7c90f05) and [Github Repo](https://github.com/Yannael/multilingual-embeddings)
|
229 |
-
|
230 |
-
|
231 |
-
### Our results
|
232 |
-
- Multilingual (Miracl dataset)
|
233 |
-
|
234 |
-

|
235 |
-
|
236 |
-
- Cross-lingual (MKQA dataset)
|
237 |
-
|
238 |
-

|
239 |
-
|
240 |
-
- Long Document Retrieval
|
241 |
-
- MLDR:
|
242 |
-

|
243 |
-
Please note that [MLDR](https://huggingface.co/datasets/Shitao/MLDR) is a document retrieval dataset we constructed via LLM,
|
244 |
-
covering 13 languages, including test set, validation set, and training set.
|
245 |
-
We utilized the training set from MLDR to enhance the model's long document retrieval capabilities.
|
246 |
-
Therefore, comparing baselines with `Dense w.o.long`(fine-tuning without long document dataset) is more equitable.
|
247 |
-
Additionally, this long document retrieval dataset will be open-sourced to address the current lack of open-source multilingual long text retrieval datasets.
|
248 |
-
We believe that this data will be helpful for the open-source community in training document retrieval models.
|
249 |
-
|
250 |
-
- NarritiveQA:
|
251 |
-

|
252 |
-
|
253 |
-
- Comparison with BM25
|
254 |
-
|
255 |
-
We utilized Pyserini to implement BM25, and the test results can be reproduced by this [script](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR#bm25-baseline).
|
256 |
-
We tested BM25 using two different tokenizers:
|
257 |
-
one using Lucene Analyzer and the other using the same tokenizer as M3 (i.e., the tokenizer of xlm-roberta).
|
258 |
-
The results indicate that BM25 remains a competitive baseline,
|
259 |
-
especially in long document retrieval.
|
260 |
-
|
261 |
-

|
262 |
-
|
263 |
-
|
264 |
-
|
265 |
-
## Training
|
266 |
-
- Self-knowledge Distillation: combining multiple outputs from different
|
267 |
-
retrieval modes as reward signal to enhance the performance of single mode(especially for sparse retrieval and multi-vec(colbert) retrival)
|
268 |
-
- Efficient Batching: Improve the efficiency when fine-tuning on long text.
|
269 |
-
The small-batch strategy is simple but effective, which also can used to fine-tune large embedding model.
|
270 |
-
- MCLS: A simple method to improve the performance on long text without fine-tuning.
|
271 |
-
If you have no enough resource to fine-tuning model with long text, the method is useful.
|
272 |
-
|
273 |
-
Refer to our [report](https://arxiv.org/pdf/2402.03216.pdf) for more details.
|
274 |
-
|
275 |
-
|
276 |
-
|
277 |
-
|
278 |
-
|
279 |
|
280 |
## Acknowledgement
|
281 |
|
282 |
-
Thanks to the authors of open-sourced datasets
|
283 |
-
Thanks to the open-sourced libraries like [Tevatron](https://github.com/texttron/tevatron), [Pyserini](https://github.com/castorini/pyserini).
|
284 |
-
|
285 |
-
|
286 |
|
287 |
## Citation
|
288 |
|
289 |
-
If you find this repository useful, please consider giving a star :star: and citation
|
290 |
|
291 |
-
```
|
292 |
@misc{bge-m3,
|
293 |
title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
|
294 |
author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
|
|
|
7 |
license: mit
|
8 |
---
|
9 |
|
|
|
10 |
|
11 |
+
# Experimental Sparse Vector Repository
|
12 |
|
13 |
+
This repository is a fork of the [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) repository, aimed at creating sparse vectors. It is an experimental project based on the BGE-M3 model, which is known for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.
|
|
|
|
|
|
|
14 |
|
15 |
+
For more details, please refer to the original [github repo](https://github.com/FlagOpen/FlagEmbedding).
|
16 |
|
17 |
+
## BGE-M3 Overview ([paper](https://arxiv.org/pdf/2402.03216.pdf), [code](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3))
|
18 |
|
19 |
+
BGE-M3 is a highly versatile embedding model that supports:
|
20 |
+
- **Multi-Functionality**: Capable of dense retrieval, multi-vector retrieval, and sparse retrieval.
|
21 |
+
- **Multi-Linguality**: Supports over 100 languages.
|
22 |
+
- **Multi-Granularity**: Handles inputs from short sentences to long documents up to 8192 tokens.
|
23 |
|
24 |
+
## Retrieval Pipeline Recommendations
|
|
|
|
|
|
|
|
|
|
|
|
|
25 |
|
26 |
+
We recommend using a hybrid retrieval + re-ranking pipeline:
|
27 |
+
- **Hybrid Retrieval**: Combines embedding retrieval and BM25 algorithm for higher accuracy and generalization. BGE-M3 supports both embedding and sparse retrieval, allowing token weights similar to BM25 without additional cost.
|
28 |
+
- Refer to [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb) and [Milvus](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py) for hybrid retrieval examples.
|
29 |
+
- **Re-Ranking**: Use cross-encoder models like [bge-reranker](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker) or [bge-reranker-v2](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker) for higher accuracy after retrieval.
|
30 |
|
31 |
+
## News
|
32 |
|
33 |
+
- **2024/7/1**: Updated MIRACL evaluation results for BGE-M3. Refer to [bge-m3_miracl_2cr](https://huggingface.co/datasets/hanhainebula/bge-m3_miracl_2cr) for details.
|
34 |
+
- **2024/3/20**: Milvus now supports hybrid retrieval with BGE-M3. See [hello_hybrid_sparse_dense.py](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
|
35 |
+
- **2024/3/8**: BGE-M3 achieves top performance in multilingual benchmarks. See [article](https://towardsdatascience.com/openai-vs-open-source-multilingual-embedding-models-e5ccb7c90f05).
|
36 |
+
- **2024/3/2**: Released unified fine-tuning [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune) and [data](https://huggingface.co/datasets/Shitao/bge-m3-data).
|
37 |
+
- **2024/2/6**: Released [MLDR](https://huggingface.co/datasets/Shitao/MLDR) dataset and [evaluation pipeline](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR).
|
38 |
+
- **2024/2/1**: Vespa now supports multiple modes of BGE-M3. See [notebook](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb).
|
39 |
|
40 |
+
## Model Specifications
|
41 |
|
42 |
+
| Model Name | Dimension | Sequence Length | Introduction |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
43 |
|:----:|:---:|:---:|:---:|
|
44 |
+
| [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 1024 | 8192 | Multilingual; unified fine-tuning (dense, sparse, and colbert) |
|
45 |
+
| [BAAI/bge-m3-unsupervised](https://huggingface.co/BAAI/bge-m3-unsupervised) | 1024 | 8192 | Multilingual; contrastive learning |
|
46 |
+
| [BAAI/bge-m3-retromae](https://huggingface.co/BAAI/bge-m3-retromae) | -- | 8192 | Multilingual; extended max_length of xlm-roberta to 8192 |
|
47 |
+
| [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | 1024 | 512 | English model |
|
48 |
+
| [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 768 | 512 | English model |
|
49 |
+
| [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) | 384 | 512 | English model |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
50 |
|
51 |
+
## Data
|
52 |
|
53 |
+
| Dataset | Introduction |
|
54 |
+
|:-------:|:------------:|
|
55 |
+
| [MLDR](https://huggingface.co/datasets/Shitao/MLDR) | Document Retrieval Dataset covering 13 languages |
|
56 |
+
| [bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data) | Fine-tuning data used by bge-m3 |
|
57 |
|
58 |
## FAQ
|
59 |
|
60 |
+
### 1. Introduction to Different Retrieval Methods
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
61 |
|
62 |
+
- **Dense Retrieval**: Maps text into a single embedding.
|
63 |
+
- **Sparse Retrieval**: A vector with weights for tokens present in the text.
|
64 |
+
- **Multi-Vector Retrieval**: Uses multiple vectors to represent a text.
|
65 |
|
66 |
+
### 2. How to Use BGE-M3 in Other Projects?
|
67 |
|
68 |
+
For embedding retrieval, use the BGE-M3 model similarly to BGE. For hybrid retrieval, refer to [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb) and [Milvus](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
|
69 |
|
70 |
+
### 3. How to Fine-Tune BGE-M3 Model?
|
71 |
|
72 |
+
Follow the [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) for dense embedding fine-tuning. For unified fine-tuning (dense, sparse, and colbert), refer to the [unified_fine-tuning example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune).
|
73 |
|
74 |
## Usage
|
75 |
|
76 |
+
### Installation
|
77 |
+
|
78 |
+
```bash
|
79 |
git clone https://github.com/FlagOpen/FlagEmbedding.git
|
80 |
cd FlagEmbedding
|
81 |
pip install -e .
|
82 |
```
|
83 |
+
|
84 |
+
or
|
85 |
+
|
86 |
+
```bash
|
87 |
pip install -U FlagEmbedding
|
88 |
```
|
89 |
|
90 |
+
### Generate Embedding for Text
|
91 |
|
92 |
+
#### Dense Embedding
|
93 |
|
|
|
|
|
|
|
94 |
```python
|
95 |
from FlagEmbedding import BGEM3FlagModel
|
96 |
|
97 |
+
model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
|
|
|
98 |
|
99 |
+
sentences_1 = ["What is BGE M3?", "Definition of BM25"]
|
100 |
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
|
101 |
"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
|
102 |
|
103 |
+
embeddings_1 = model.encode(sentences_1, batch_size=12, max_length=8192)['dense_vecs']
|
|
|
|
|
|
|
104 |
embeddings_2 = model.encode(sentences_2)['dense_vecs']
|
105 |
similarity = embeddings_1 @ embeddings_2.T
|
106 |
print(similarity)
|
|
|
107 |
```
|
|
|
|
|
108 |
|
109 |
+
#### Sparse Embedding (Lexical Weight)
|
110 |
|
|
|
111 |
```python
|
112 |
from FlagEmbedding import BGEM3FlagModel
|
113 |
|
114 |
+
model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
|
115 |
|
116 |
+
sentences_1 = ["What is BGE M3?", "Definition of BM25"]
|
117 |
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
|
118 |
"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
|
119 |
|
120 |
output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=False)
|
121 |
output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=False)
|
122 |
|
|
|
123 |
print(model.convert_id_to_token(output_1['lexical_weights']))
|
|
|
|
|
|
|
124 |
|
|
|
125 |
lexical_scores = model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_2['lexical_weights'][0])
|
126 |
print(lexical_scores)
|
|
|
|
|
|
|
|
|
127 |
```
|
128 |
|
129 |
+
#### Multi-Vector (ColBERT)
|
130 |
+
|
131 |
```python
|
132 |
from FlagEmbedding import BGEM3FlagModel
|
133 |
|
134 |
+
model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
|
135 |
|
136 |
+
sentences_1 = ["What is BGE M3?", "Definition of BM25"]
|
137 |
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
|
138 |
"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
|
139 |
|
|
|
142 |
|
143 |
print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][0]))
|
144 |
print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][1]))
|
|
|
|
|
145 |
```
|
146 |
|
147 |
+
### Compute Score for Text Pairs
|
148 |
|
|
|
|
|
149 |
```python
|
150 |
from FlagEmbedding import BGEM3FlagModel
|
151 |
|
152 |
+
model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
|
153 |
|
154 |
+
sentences_1 = ["What is BGE M3?", "Definition of BM25"]
|
155 |
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
|
156 |
"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
|
157 |
|
158 |
+
sentence_pairs = [[i, j] for i in sentences_1 for j in sentences_2]
|
|
|
|
|
|
|
|
|
159 |
|
160 |
+
print(model.compute_score(sentence_pairs, max_passage_length=128, weights_for_different_modes=[0.4, 0.2, 0.4]))
|
|
|
|
|
|
|
|
|
|
|
|
|
161 |
```
|
162 |
|
163 |
+
## Evaluation
|
164 |
|
165 |
+
Evaluation scripts are provided for [MKQA](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MKQA) and [MLDR](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
166 |
|
167 |
## Acknowledgement
|
168 |
|
169 |
+
Thanks to the authors of open-sourced datasets like Miracl, MKQA, NarritiveQA, and libraries like [Tevatron](https://github.com/texttron/tevatron) and [Pyserini](https://github.com/castorini/pyserini).
|
|
|
|
|
|
|
170 |
|
171 |
## Citation
|
172 |
|
173 |
+
If you find this repository useful, please consider giving a star :star: and citation:
|
174 |
|
175 |
+
```bibtex
|
176 |
@misc{bge-m3,
|
177 |
title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
|
178 |
author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
|