p0x0q commited on
Commit
09f2f11
·
verified ·
1 Parent(s): a971cda

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -186
README.md CHANGED
@@ -7,175 +7,133 @@ tags:
7
  license: mit
8
  ---
9
 
10
- For more details please refer to our github repo: https://github.com/FlagOpen/FlagEmbedding
11
 
12
- # BGE-M3 ([paper](https://arxiv.org/pdf/2402.03216.pdf), [code](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3))
13
 
14
- In this project, we introduce BGE-M3, which is distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.
15
- - Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
16
- - Multi-Linguality: It can support more than 100 working languages.
17
- - Multi-Granularity: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens.
18
 
 
19
 
 
20
 
21
- **Some suggestions for retrieval pipeline in RAG**
 
 
 
22
 
23
- We recommend to use the following pipeline: hybrid retrieval + re-ranking.
24
- - Hybrid retrieval leverages the strengths of various methods, offering higher accuracy and stronger generalization capabilities.
25
- A classic example: using both embedding retrieval and the BM25 algorithm.
26
- Now, you can try to use BGE-M3, which supports both embedding and sparse retrieval.
27
- This allows you to obtain token weights (similar to the BM25) without any additional cost when generate dense embeddings.
28
- To use hybrid retrieval, you can refer to [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb
29
- ) and [Milvus](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
30
 
31
- - As cross-encoder models, re-ranker demonstrates higher accuracy than bi-encoder embedding model.
32
- Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker), [bge-reranker-v2](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker)) after retrieval can further filter the selected text.
 
 
33
 
 
34
 
35
- ## News:
36
- - 2024/7/1: **We update the MIRACL evaluation results of BGE-M3**. To reproduce the new results, you can refer to: [bge-m3_miracl_2cr](https://huggingface.co/datasets/hanhainebula/bge-m3_miracl_2cr). We have also updated our [paper](https://arxiv.org/pdf/2402.03216) on arXiv.
37
- <details>
38
- <summary> Details </summary>
 
 
39
 
40
- The previous test results were lower because we mistakenly removed the passages that have the same id as the query from the search results. After correcting this mistake, the overall performance of BGE-M3 on MIRACL is higher than the previous results, but the experimental conclusion remains unchanged. The other results are not affected by this mistake. To reproduce the previous lower results, you need to add the `--remove-query` parameter when using `pyserini.search.faiss` or `pyserini.search.lucene` to search the passages.
41
 
42
- </details>
43
- - 2024/3/20: **Thanks Milvus team!** Now you can use hybrid retrieval of bge-m3 in Milvus: [pymilvus/examples
44
- /hello_hybrid_sparse_dense.py](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
45
- - 2024/3/8: **Thanks for the [experimental results](https://towardsdatascience.com/openai-vs-open-source-multilingual-embedding-models-e5ccb7c90f05) from @[Yannael](https://huggingface.co/Yannael). In this benchmark, BGE-M3 achieves top performance in both English and other languages, surpassing models such as OpenAI.**
46
- - 2024/3/2: Release unified fine-tuning [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune) and [data](https://huggingface.co/datasets/Shitao/bge-m3-data)
47
- - 2024/2/6: We release the [MLDR](https://huggingface.co/datasets/Shitao/MLDR) (a long document retrieval dataset covering 13 languages) and [evaluation pipeline](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR).
48
- - 2024/2/1: **Thanks for the excellent tool from Vespa.** You can easily use multiple modes of BGE-M3 following this [notebook](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb)
49
-
50
-
51
- ## Specs
52
-
53
- - Model
54
-
55
- | Model Name | Dimension | Sequence Length | Introduction |
56
  |:----:|:---:|:---:|:---:|
57
- | [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 1024 | 8192 | multilingual; unified fine-tuning (dense, sparse, and colbert) from bge-m3-unsupervised|
58
- | [BAAI/bge-m3-unsupervised](https://huggingface.co/BAAI/bge-m3-unsupervised) | 1024 | 8192 | multilingual; contrastive learning from bge-m3-retromae |
59
- | [BAAI/bge-m3-retromae](https://huggingface.co/BAAI/bge-m3-retromae) | -- | 8192 | multilingual; extend the max_length of [xlm-roberta](https://huggingface.co/FacebookAI/xlm-roberta-large) to 8192 and further pretrained via [retromae](https://github.com/staoxiao/RetroMAE)|
60
- | [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | 1024 | 512 | English model |
61
- | [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 768 | 512 | English model |
62
- | [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) | 384 | 512 | English model |
63
-
64
- - Data
65
-
66
- | Dataset | Introduction |
67
- |:----------------------------------------------------------:|:-------------------------------------------------:|
68
- | [MLDR](https://huggingface.co/datasets/Shitao/MLDR) | Docuemtn Retrieval Dataset, covering 13 languages |
69
- | [bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data) | Fine-tuning data used by bge-m3 |
70
 
 
71
 
 
 
 
 
72
 
73
  ## FAQ
74
 
75
- **1. Introduction for different retrieval methods**
76
-
77
- - Dense retrieval: map the text into a single embedding, e.g., [DPR](https://arxiv.org/abs/2004.04906), [BGE-v1.5](https://github.com/FlagOpen/FlagEmbedding)
78
- - Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. e.g., BM25, [unicoil](https://arxiv.org/pdf/2106.14807.pdf), and [splade](https://arxiv.org/abs/2107.05720)
79
- - Multi-vector retrieval: use multiple vectors to represent a text, e.g., [ColBERT](https://arxiv.org/abs/2004.12832).
80
-
81
-
82
- **2. How to use BGE-M3 in other projects?**
83
-
84
- For embedding retrieval, you can employ the BGE-M3 model using the same approach as BGE.
85
- The only difference is that the BGE-M3 model no longer requires adding instructions to the queries.
86
-
87
- For hybrid retrieval, you can use [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb
88
- ) and [Milvus](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
89
-
90
-
91
- **3. How to fine-tune bge-M3 model?**
92
-
93
- You can follow the common in this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune)
94
- to fine-tune the dense embedding.
95
-
96
- If you want to fine-tune all embedding function of m3 (dense, sparse and colbert), you can refer to the [unified_fine-tuning example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune)
97
 
 
 
 
98
 
 
99
 
 
100
 
 
101
 
 
102
 
103
  ## Usage
104
 
105
- Install:
106
- ```
 
107
  git clone https://github.com/FlagOpen/FlagEmbedding.git
108
  cd FlagEmbedding
109
  pip install -e .
110
  ```
111
- or:
112
- ```
 
 
113
  pip install -U FlagEmbedding
114
  ```
115
 
 
116
 
 
117
 
118
- ### Generate Embedding for text
119
-
120
- - Dense Embedding
121
  ```python
122
  from FlagEmbedding import BGEM3FlagModel
123
 
124
- model = BGEM3FlagModel('BAAI/bge-m3',
125
- use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
126
 
127
- sentences_1 = ["What is BGE M3?", "Defination of BM25"]
128
  sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
129
  "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
130
 
131
- embeddings_1 = model.encode(sentences_1,
132
- batch_size=12,
133
- max_length=8192, # If you don't need such a long length, you can set a smaller value to speed up the encoding process.
134
- )['dense_vecs']
135
  embeddings_2 = model.encode(sentences_2)['dense_vecs']
136
  similarity = embeddings_1 @ embeddings_2.T
137
  print(similarity)
138
- # [[0.6265, 0.3477], [0.3499, 0.678 ]]
139
  ```
140
- You also can use sentence-transformers and huggingface transformers to generate dense embeddings.
141
- Refer to [baai_general_embedding](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding#usage) for details.
142
 
 
143
 
144
- - Sparse Embedding (Lexical Weight)
145
  ```python
146
  from FlagEmbedding import BGEM3FlagModel
147
 
148
- model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
149
 
150
- sentences_1 = ["What is BGE M3?", "Defination of BM25"]
151
  sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
152
  "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
153
 
154
  output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=False)
155
  output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=False)
156
 
157
- # you can see the weight for each token:
158
  print(model.convert_id_to_token(output_1['lexical_weights']))
159
- # [{'What': 0.08356, 'is': 0.0814, 'B': 0.1296, 'GE': 0.252, 'M': 0.1702, '3': 0.2695, '?': 0.04092},
160
- # {'De': 0.05005, 'fin': 0.1368, 'ation': 0.04498, 'of': 0.0633, 'BM': 0.2515, '25': 0.3335}]
161
-
162
 
163
- # compute the scores via lexical mathcing
164
  lexical_scores = model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_2['lexical_weights'][0])
165
  print(lexical_scores)
166
- # 0.19554901123046875
167
-
168
- print(model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_1['lexical_weights'][1]))
169
- # 0.0
170
  ```
171
 
172
- - Multi-Vector (ColBERT)
 
173
  ```python
174
  from FlagEmbedding import BGEM3FlagModel
175
 
176
- model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
177
 
178
- sentences_1 = ["What is BGE M3?", "Defination of BM25"]
179
  sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
180
  "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
181
 
@@ -184,111 +142,37 @@ output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, retu
184
 
185
  print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][0]))
186
  print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][1]))
187
- # 0.7797
188
- # 0.4620
189
  ```
190
 
 
191
 
192
- ### Compute score for text pairs
193
- Input a list of text pairs, you can get the scores computed by different methods.
194
  ```python
195
  from FlagEmbedding import BGEM3FlagModel
196
 
197
- model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
198
 
199
- sentences_1 = ["What is BGE M3?", "Defination of BM25"]
200
  sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
201
  "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
202
 
203
- sentence_pairs = [[i,j] for i in sentences_1 for j in sentences_2]
204
-
205
- print(model.compute_score(sentence_pairs,
206
- max_passage_length=128, # a smaller max length leads to a lower latency
207
- weights_for_different_modes=[0.4, 0.2, 0.4])) # weights_for_different_modes(w) is used to do weighted sum: w[0]*dense_score + w[1]*sparse_score + w[2]*colbert_score
208
 
209
- # {
210
- # 'colbert': [0.7796499729156494, 0.4621465802192688, 0.4523794651031494, 0.7898575067520142],
211
- # 'sparse': [0.195556640625, 0.00879669189453125, 0.0, 0.1802978515625],
212
- # 'dense': [0.6259765625, 0.347412109375, 0.349853515625, 0.67822265625],
213
- # 'sparse+dense': [0.482503205537796, 0.23454029858112335, 0.2332356721162796, 0.5122477412223816],
214
- # 'colbert+sparse+dense': [0.6013619303703308, 0.3255828022956848, 0.32089319825172424, 0.6232916116714478]
215
- # }
216
  ```
217
 
 
218
 
219
-
220
-
221
- ## Evaluation
222
-
223
- We provide the evaluation script for [MKQA](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MKQA) and [MLDR](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR)
224
-
225
- ### Benchmarks from the open-source community
226
- ![avatar](./imgs/others.webp)
227
- The BGE-M3 model emerged as the top performer on this benchmark (OAI is short for OpenAI).
228
- For more details, please refer to the [article](https://towardsdatascience.com/openai-vs-open-source-multilingual-embedding-models-e5ccb7c90f05) and [Github Repo](https://github.com/Yannael/multilingual-embeddings)
229
-
230
-
231
- ### Our results
232
- - Multilingual (Miracl dataset)
233
-
234
- ![avatar](./imgs/miracl.jpg)
235
-
236
- - Cross-lingual (MKQA dataset)
237
-
238
- ![avatar](./imgs/mkqa.jpg)
239
-
240
- - Long Document Retrieval
241
- - MLDR:
242
- ![avatar](./imgs/long.jpg)
243
- Please note that [MLDR](https://huggingface.co/datasets/Shitao/MLDR) is a document retrieval dataset we constructed via LLM,
244
- covering 13 languages, including test set, validation set, and training set.
245
- We utilized the training set from MLDR to enhance the model's long document retrieval capabilities.
246
- Therefore, comparing baselines with `Dense w.o.long`(fine-tuning without long document dataset) is more equitable.
247
- Additionally, this long document retrieval dataset will be open-sourced to address the current lack of open-source multilingual long text retrieval datasets.
248
- We believe that this data will be helpful for the open-source community in training document retrieval models.
249
-
250
- - NarritiveQA:
251
- ![avatar](./imgs/nqa.jpg)
252
-
253
- - Comparison with BM25
254
-
255
- We utilized Pyserini to implement BM25, and the test results can be reproduced by this [script](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR#bm25-baseline).
256
- We tested BM25 using two different tokenizers:
257
- one using Lucene Analyzer and the other using the same tokenizer as M3 (i.e., the tokenizer of xlm-roberta).
258
- The results indicate that BM25 remains a competitive baseline,
259
- especially in long document retrieval.
260
-
261
- ![avatar](./imgs/bm25.jpg)
262
-
263
-
264
-
265
- ## Training
266
- - Self-knowledge Distillation: combining multiple outputs from different
267
- retrieval modes as reward signal to enhance the performance of single mode(especially for sparse retrieval and multi-vec(colbert) retrival)
268
- - Efficient Batching: Improve the efficiency when fine-tuning on long text.
269
- The small-batch strategy is simple but effective, which also can used to fine-tune large embedding model.
270
- - MCLS: A simple method to improve the performance on long text without fine-tuning.
271
- If you have no enough resource to fine-tuning model with long text, the method is useful.
272
-
273
- Refer to our [report](https://arxiv.org/pdf/2402.03216.pdf) for more details.
274
-
275
-
276
-
277
-
278
-
279
 
280
  ## Acknowledgement
281
 
282
- Thanks to the authors of open-sourced datasets, including Miracl, MKQA, NarritiveQA, etc.
283
- Thanks to the open-sourced libraries like [Tevatron](https://github.com/texttron/tevatron), [Pyserini](https://github.com/castorini/pyserini).
284
-
285
-
286
 
287
  ## Citation
288
 
289
- If you find this repository useful, please consider giving a star :star: and citation
290
 
291
- ```
292
  @misc{bge-m3,
293
  title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
294
  author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
 
7
  license: mit
8
  ---
9
 
 
10
 
11
+ # Experimental Sparse Vector Repository
12
 
13
+ This repository is a fork of the [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) repository, aimed at creating sparse vectors. It is an experimental project based on the BGE-M3 model, which is known for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.
 
 
 
14
 
15
+ For more details, please refer to the original [github repo](https://github.com/FlagOpen/FlagEmbedding).
16
 
17
+ ## BGE-M3 Overview ([paper](https://arxiv.org/pdf/2402.03216.pdf), [code](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3))
18
 
19
+ BGE-M3 is a highly versatile embedding model that supports:
20
+ - **Multi-Functionality**: Capable of dense retrieval, multi-vector retrieval, and sparse retrieval.
21
+ - **Multi-Linguality**: Supports over 100 languages.
22
+ - **Multi-Granularity**: Handles inputs from short sentences to long documents up to 8192 tokens.
23
 
24
+ ## Retrieval Pipeline Recommendations
 
 
 
 
 
 
25
 
26
+ We recommend using a hybrid retrieval + re-ranking pipeline:
27
+ - **Hybrid Retrieval**: Combines embedding retrieval and BM25 algorithm for higher accuracy and generalization. BGE-M3 supports both embedding and sparse retrieval, allowing token weights similar to BM25 without additional cost.
28
+ - Refer to [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb) and [Milvus](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py) for hybrid retrieval examples.
29
+ - **Re-Ranking**: Use cross-encoder models like [bge-reranker](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker) or [bge-reranker-v2](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker) for higher accuracy after retrieval.
30
 
31
+ ## News
32
 
33
+ - **2024/7/1**: Updated MIRACL evaluation results for BGE-M3. Refer to [bge-m3_miracl_2cr](https://huggingface.co/datasets/hanhainebula/bge-m3_miracl_2cr) for details.
34
+ - **2024/3/20**: Milvus now supports hybrid retrieval with BGE-M3. See [hello_hybrid_sparse_dense.py](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
35
+ - **2024/3/8**: BGE-M3 achieves top performance in multilingual benchmarks. See [article](https://towardsdatascience.com/openai-vs-open-source-multilingual-embedding-models-e5ccb7c90f05).
36
+ - **2024/3/2**: Released unified fine-tuning [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune) and [data](https://huggingface.co/datasets/Shitao/bge-m3-data).
37
+ - **2024/2/6**: Released [MLDR](https://huggingface.co/datasets/Shitao/MLDR) dataset and [evaluation pipeline](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR).
38
+ - **2024/2/1**: Vespa now supports multiple modes of BGE-M3. See [notebook](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb).
39
 
40
+ ## Model Specifications
41
 
42
+ | Model Name | Dimension | Sequence Length | Introduction |
 
 
 
 
 
 
 
 
 
 
 
 
 
43
  |:----:|:---:|:---:|:---:|
44
+ | [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 1024 | 8192 | Multilingual; unified fine-tuning (dense, sparse, and colbert) |
45
+ | [BAAI/bge-m3-unsupervised](https://huggingface.co/BAAI/bge-m3-unsupervised) | 1024 | 8192 | Multilingual; contrastive learning |
46
+ | [BAAI/bge-m3-retromae](https://huggingface.co/BAAI/bge-m3-retromae) | -- | 8192 | Multilingual; extended max_length of xlm-roberta to 8192 |
47
+ | [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | 1024 | 512 | English model |
48
+ | [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 768 | 512 | English model |
49
+ | [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) | 384 | 512 | English model |
 
 
 
 
 
 
 
50
 
51
+ ## Data
52
 
53
+ | Dataset | Introduction |
54
+ |:-------:|:------------:|
55
+ | [MLDR](https://huggingface.co/datasets/Shitao/MLDR) | Document Retrieval Dataset covering 13 languages |
56
+ | [bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data) | Fine-tuning data used by bge-m3 |
57
 
58
  ## FAQ
59
 
60
+ ### 1. Introduction to Different Retrieval Methods
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
 
62
+ - **Dense Retrieval**: Maps text into a single embedding.
63
+ - **Sparse Retrieval**: A vector with weights for tokens present in the text.
64
+ - **Multi-Vector Retrieval**: Uses multiple vectors to represent a text.
65
 
66
+ ### 2. How to Use BGE-M3 in Other Projects?
67
 
68
+ For embedding retrieval, use the BGE-M3 model similarly to BGE. For hybrid retrieval, refer to [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb) and [Milvus](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
69
 
70
+ ### 3. How to Fine-Tune BGE-M3 Model?
71
 
72
+ Follow the [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) for dense embedding fine-tuning. For unified fine-tuning (dense, sparse, and colbert), refer to the [unified_fine-tuning example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune).
73
 
74
  ## Usage
75
 
76
+ ### Installation
77
+
78
+ ```bash
79
  git clone https://github.com/FlagOpen/FlagEmbedding.git
80
  cd FlagEmbedding
81
  pip install -e .
82
  ```
83
+
84
+ or
85
+
86
+ ```bash
87
  pip install -U FlagEmbedding
88
  ```
89
 
90
+ ### Generate Embedding for Text
91
 
92
+ #### Dense Embedding
93
 
 
 
 
94
  ```python
95
  from FlagEmbedding import BGEM3FlagModel
96
 
97
+ model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
 
98
 
99
+ sentences_1 = ["What is BGE M3?", "Definition of BM25"]
100
  sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
101
  "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
102
 
103
+ embeddings_1 = model.encode(sentences_1, batch_size=12, max_length=8192)['dense_vecs']
 
 
 
104
  embeddings_2 = model.encode(sentences_2)['dense_vecs']
105
  similarity = embeddings_1 @ embeddings_2.T
106
  print(similarity)
 
107
  ```
 
 
108
 
109
+ #### Sparse Embedding (Lexical Weight)
110
 
 
111
  ```python
112
  from FlagEmbedding import BGEM3FlagModel
113
 
114
+ model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
115
 
116
+ sentences_1 = ["What is BGE M3?", "Definition of BM25"]
117
  sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
118
  "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
119
 
120
  output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=False)
121
  output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=False)
122
 
 
123
  print(model.convert_id_to_token(output_1['lexical_weights']))
 
 
 
124
 
 
125
  lexical_scores = model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_2['lexical_weights'][0])
126
  print(lexical_scores)
 
 
 
 
127
  ```
128
 
129
+ #### Multi-Vector (ColBERT)
130
+
131
  ```python
132
  from FlagEmbedding import BGEM3FlagModel
133
 
134
+ model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
135
 
136
+ sentences_1 = ["What is BGE M3?", "Definition of BM25"]
137
  sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
138
  "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
139
 
 
142
 
143
  print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][0]))
144
  print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][1]))
 
 
145
  ```
146
 
147
+ ### Compute Score for Text Pairs
148
 
 
 
149
  ```python
150
  from FlagEmbedding import BGEM3FlagModel
151
 
152
+ model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
153
 
154
+ sentences_1 = ["What is BGE M3?", "Definition of BM25"]
155
  sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
156
  "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
157
 
158
+ sentence_pairs = [[i, j] for i in sentences_1 for j in sentences_2]
 
 
 
 
159
 
160
+ print(model.compute_score(sentence_pairs, max_passage_length=128, weights_for_different_modes=[0.4, 0.2, 0.4]))
 
 
 
 
 
 
161
  ```
162
 
163
+ ## Evaluation
164
 
165
+ Evaluation scripts are provided for [MKQA](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MKQA) and [MLDR](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
166
 
167
  ## Acknowledgement
168
 
169
+ Thanks to the authors of open-sourced datasets like Miracl, MKQA, NarritiveQA, and libraries like [Tevatron](https://github.com/texttron/tevatron) and [Pyserini](https://github.com/castorini/pyserini).
 
 
 
170
 
171
  ## Citation
172
 
173
+ If you find this repository useful, please consider giving a star :star: and citation:
174
 
175
+ ```bibtex
176
  @misc{bge-m3,
177
  title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
178
  author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},